cuda zero-copy copy data failed!

Hi, I was using Qt managing the cuda code on my jetson TX1, and i failed to use the zero-copy to transform my data, it compiled ok, but crashed when running the executable file “protonect.cpp”, i think it’s the problem in the copy data to host variables, as it can run with other constant data.

int* h_map_dist, *d_map_dist;
    const size_t imageSize = 4 * 424 * 512;
    libfreenect2::Registration registration(dev->getIrCameraParams(), dev->getColorCameraParams());
  //cudaMemcpy(d_map_dist,registration.impl_->distort_map,          imageSize, cudaMemcpyHostToDevice);//512*424;
    cudaSetDeviceFlags(cudaDeviceMapHost);
    cudaHostAlloc((void**)&h_map_dist, imageSize, cudaHostAllocMapped);
    h_map_dist = registration.impl_->distort_map;
    cudaHostGetDevicePointer((void**)&d_map_dist,(void*)h_map_dist,0);

i wonder it’s the data format problem? if the "h_map_dist = registration.impl_->distort_map;"wrong?

Almost certainly it is wrong.

This sets the value of h_map_dist:

cudaHostAlloc((void**)&h_map_dist, imageSize, cudaHostAllocMapped);

this overwrites that value with something else:

h_map_dist = registration.impl_->distort_map;

So I don’t think that makes any sense.

Furthermore, this:

cudaHostGetDevicePointer((void**)&d_map_dist,(void*)h_map_dist,0);

should probably be returning an error code, but since you are not checking CUDA errors, you are missing it.

If you are having trouble with a CUDA code, you should always do proper CUDA error checking.

If you don’t know what proper CUDA error checking is, google “proper CUDA error checking” and take the first hit, study it, and apply it to your code.

Thanks for your suggestions of do proper CUDA error checking!
First:
I can’t understand your opinion:
This sets the value of h_map_dist:

cudaHostAlloc((void**)&h_map_dist, imageSize, cudaHostAllocMapped);

What’s the value of h_map_dist, i just think it only allocate a memory.(maybe misunderstanded)

this overwrites that value with something else:

h_map_dist = registration.impl_->distort_map;

I want to make the variable h_map_dist = registration.impl_->distort_map what should i do?

Second:
I add error checking on the code:

err = cudaHostAlloc((void**)&h_map_dist, imageSize, cudaHostAllocMapped);
    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate host memory (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
    h_map_dist = registration.impl_->distort_map;
    err = cudaHostGetDevicePointer((void**)&d_map_dist,(void*)h_map_dist,0);
    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to get device pointer (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

It returned “Failed to get device pointer (error code invalid argument)!”

cudaHostAlloc does the following:

  1. allocate system (CPU) memory
  2. register that memory in such a way that it’s address in the system will be fixed, and then maps that memory into GPU address space
  3. puts the (host) address of that memory in the pointer that you pass to it (i.e. h_map_dist in this case)

In your very next line of code, you are overwriting the value that was placed in h_map_dist in step 3 above. So that is a really bad idea. Logically it is broken.

THere are clearly some concepts about zero-copy that you don’t understand. Furthermore, I can’t tell from this little snippet what you are trying to accomplish exactly.

But if you want to get your CUDA API calls to be logically correct, you need to get rid of this line:

h_map_dist = registration.impl_->distort_map;

If your intent is that you want h_map_dist to point to the same data that registration.impl_->distort_map points to, then you will either need to:

  1. copy the data from registration.impl_->distort_map to h_map_dist (you can’t do this by jamming the pointer value into h_map_dist)

OR

  1. attempt to pass registration.impl_->distort_map directly to cudaHostRegister, and use that as your zero-copy memory instead. I don’t know for sure that this method 2 will work, since you’ve just shown a snippet of code and I don’t know what the origin of registration.impl_->distort_map is or what it points to exactly.

cudaHostAlloc was not implemented on TK1, I am not sure if it is available on TX1.

Thanks for your patient explaining!
I will have a try according to your advice, and the “registration.impl_->distort_map” is from a frame, so, it’s difficult to copy the data to h_map_dist.
Do you think i can use the unified virtual addressing(UVA) to repalce the zero-copy as the TX1’s cpu and gpu memory is unified ?