[closed]Slow copy: memcpy 26Mb = 190ms with MMAP. IO_METHOD_USERPTR now works.

You might find this reference useful:

If parts of the system are reducing power until needed, then this could increase latency quite a bit.

I have set online all 4 cpu and set performance mode. Still tegra writes buffer 200ms per 26mp from regular to pinned memory.

Are you activating anything CUDA related? There is usually an INIT overhaed there…

Yes, i do init:

//NPP init

    int deviceCount;
    checkCudaErrors( cudaGetDeviceCount( & deviceCount ) );

    if( deviceCount == 0 )
        qDebug() << "CUDA error: no devices supporting CUDA.";
        exit( EXIT_FAILURE );

    int dev{ findCudaDevice( NULL, NULL ) };

    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties( & deviceProp, dev );
    qDebug() << "cudaSetDevice GPU" << dev << " = " << deviceProp.name;
    qDebug() << "cuda mapped mem support " << deviceProp.canMapHostMemory;
    checkCudaErrors( cudaSetDevice( dev ) );

    const NppLibraryVersion * libVer{ nppGetLibVersion() };
    printf("NPP Library Version %d.%d.%d\n", libVer->major, libVer->minor, libVer->build);

    int driverVersion, runtimeVersion;
    cudaDriverGetVersion( & driverVersion );
    cudaRuntimeGetVersion( & runtimeVersion );
    printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, ( driverVersion % 100 ) / 10 );
    printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, ( runtimeVersion % 100 ) / 10 );

    checkCudaCapabilities( 1, 0 );

    qDebug() << "GpuNumSMs " << nppGetGpuNumSMs();
    qDebug() << "MaxThreadsPerBlock " << nppGetMaxThreadsPerBlock();
    qDebug() << "MaxThreadsPerSM " << nppGetMaxThreadsPerSM();
    qDebug() << "GpuName " << nppGetGpuName();

    //cudaSetDeviceFlags( cudaDeviceMapHost );

    qDebug() << "Num CPU's "<< omp_get_num_procs() << omp_get_max_threads();
    //omp_set_dynamic( 0 );
    omp_set_num_threads( 4 );

I made test again:

  • memcpy from userspace pointer to userspace pointer is about 33ms per 26mb
  • memcpy from userspace pointer to pinned pointer is about 33ms per 26mb
  • memcpy from mmap v4l pointer to pinned is about 200ms per 26mb

So mmap is slow?!

V4L2_MEMORY_USERPTR works fast now.

I found v4l2 example (http://www.friendlyarm.net/forum/topic/1006), where is userptr buffer allocation using not malloc, but memalign from malloc.h. Otherwise: Invalid argument error 22.

Now i set IO_METHOD_USERPTR, and in the init_userp:

unsigned int page_size;

        page_size = getpagesize ();
        buffer_size = (buffer_size + page_size - 1) & ~(page_size - 1);


  for (n_buffers = 0; n_buffers < 4; ++n_buffers) {
                buffers[n_buffers].length = buffer_size;
                buffers[n_buffers].start = memalign (/* boundary */page_size,buffer_size);

                if (!buffers[n_buffers].start) {
          fprintf (stderr, "Out of memory\n");
                exit (EXIT_FAILURE);

Then i copy buf.m.userptr to pinned memory and then using in the cuda functions.

Thanks to all.

Best regards, Viktor.

hi viktor,

i used memalign in my application and user pointer started working but somehow data copy from
buf.m.userptr fails. see below code snippet

ioctl (fd_v4l, VIDIOC_DQBUF, &buf)
memcpy(test_buff, buf.m.userptr, buf.length);
if ( memcmp( test_buff, buf.m.userptr, buf.length) )
printf("error in memcpy);

memcmp fails. do you observe such issue ?

another point, if i replace memcpy by my_memcpy function then there is no issue but it takes more time.

void my_memcpy(unsigned char *dst, unsigned char *src, unsigned int len)
dst[len] = src[len];

it seems like user pointer cache is not invalidated by driver after writing data into it.

Hi Pratik,

I made tests and i didn’t see any copy failures.

I am working via buffer:

buffer = memalign(…)

buf…m.userptr = buffer

memmove( pinned, buffer, size )

I will make more tests later and will show it.

hi viktor,

i have similar code. I have 8 buffers allocated using memalign function and provided to v4l2 buffers.

did you try to compare pinned and buf.m.userptr in your code using memcmp ? do you always get return value of memcmp as zero ?

i have another one global buffer (test_buff allocated using memalign) and after VIDIOC_DQBUF i copy data from buf.m.userptr to global buffer. after copy, i compare test_buff and buf.m.userptr using memcmp. what i have observed is many times, memcmp returns non zero value that means buffers are not same.

following is my code

for (i = 0; i < TEST_BUFFER_NUM; i++)
buffers[i].start = memalign

buf.m.userptr = (unsigned long)buffers[i].start;


test_buff = memalign

ioctl (fd_v4l, VIDIOC_DQBUF, &buf)
memcpy(test_buff, buf.m.userptr, buf.length);
if ( memcmp( test_buff, buf.m.userptr, buf.length) )
printf("error in memcpy);
ioctl (fd_v4l, VIDIOC_QBUF, &buf)

Hi Pratik,

Did you check v4l errors and EAGAIN?


                buf.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
                buf.memory = V4L2_MEMORY_USERPTR;

                if (-1 == xioctl(fd, VIDIOC_DQBUF, &buf)) {
                        switch (errno) {
                        case EAGAIN:
                                return 0;

                        case EIO:
                                /* Could ignore EIO, see spec. */

                                /* fall through */


                for (i = 0; i < n_buffers; ++i)
                        if (buf.m.userptr == (unsigned long)buffers[i].start
                            && buf.length == buffers[i].length)

                assert(i < n_buffers);

                process_image((void *)buf.m.userptr, buf.bytesused);

                if (-1 == xioctl(fd, VIDIOC_QBUF, &buf))

hi viktor,

yes , i do check for v4l2 errors.

if (ioctl (fd_v4l, VIDIOC_DQBUF, &buf) < 0) {
    fprintf(fp,"-----   VIDIOC_DQBUF failed.   -----\n");
    return -1;

did you try to compare pinned and buf.m.userptr in your code using memcmp ? do you always get return value of memcmp as zero ?

Hi viktor,

One more update. when memcmp fails (returns non zero) i tried to check manually byte by byte all data in both buffers and i found that normally data after 3MB differs. in my case image size is ~4MB (to be precise 4085760 bytes).

what is the image size in your case ?

Hi Pratik, tomorrow i will make memcmp and will show results. My image size is 3156x4224x2 - yuyv.

hi viktor.

just to add, i am not using cuda API to memcpy image. i have allocated another buffer test_buff using memalign API and i copy data using memcpy(test_buff, buf.m.userptr, buf.length)

do you use CUDA API to copy image ?

if yes can you try to do image copy without cuda ?

Hi Pratik,

I made test with memcmp and byte to byte comparing:

  1. memalign global pinned buffer and set it’s address to buf.m.userptr
  2. in the loop read v4l2 buf after many EAGAIN cases each cycle
  3. memmove global buffer to pinned CUDA buffer(cudaMalloc). Assign directly pinned to buf.m.userptr i couldn’t - v4l2 QBUF freezes programm
  4. memcmp or byte to byte compare global buffer and pinned CUDA buffer always are not equal

Why are not equal and where i will discuss next days with my colleague and will feedback to you.
But visually i watched images retrieved from pinned buffer and there is no any damages or skipped frames.

Also i will memcmp not CUDA buffer with v4l2 buf.

Thank you.

Hi viktor,

thanks for reply.

what is the FPS (frame per second) of your camera ?

Hi Pratik, i have 10 fps now.
Still i don’t see any visual defects in images from camera.

Best regards, Viktor.

hi viktor,

we have camera with 90FPS. even we don’t see any visual defects because FPS is very fast so frame gets changed very frequently.

in our case, when we see memcmp fails we try to dump both buffers in file. in both buffers, visually no change observed but when we compare dumped file, there are some bytes getting changed.

do you find why memcmp fails at your end ? do you try to dump both buffers ?

Hi Pratik,

I still in process to undestanding this issue.
I will make more tests later.

Best regards, Viktor.