how to make zero copy work


I’m kind of a CUDA newbie, so please bear with me.
Recently, I implemented CUDA on some existing C code and got a substantial speed up. The code calls the same kernel several hundred times, and each time I have to send a large array back to the host. Since I keep sending the same data array back and forth, I thought from what I’ve read that using zero copy on that array would be a good idea.

So, I’m trying to use the simpleZeroCopy example from the SDK (which runs fine on my machine) as a guide, but when I try to run the code in my project, I get an error during the memory allocation. Here’s basically what I’m doing:

float *Drv, *Hrv;
cudaHostAlloc((void **)&Hrv, sizeof(float) * arraySize, cudaHostAllocMapped); //error occurs here

for (int i = 0; i < arraySize; i++)
Hrv[i] = 0;

cudaHostGetDevicePointer((void **)&Drv, (void *)Hrv, 0);

for (int i = 0; i < manyIterations; i++)
kernel<<<blocksPerGrid, threadsPerBlock>>>(Drv, someOtherStff);
//do some stuff on the rv array

The data size I’m working with for the zero copy is about 2048 * 512 floats, which is the same size as the data from the simpleZeroCopy example. Anybody see what I’m doing wrong? Thanks

What error do you actually get, and what GPU and OS are you using?

I’m running a GTX 285 on Windows XP.

cudaGetLastError() just returns “unknown error”

what driver and toolkit are you running?

Toolkit version 2.3, driver 190.38

Anyone have any ideas on this? Am I going about it the right way?

Is that cudaHostAlloc the first API call you make that isn’t related to device enumeration or selection? (aka do you have a context when you call cudaSetDeviceFlags)

I’ve been a little confused when I’ve read about the idea of contexts on the forums before. Here are all the CUDA-related calls that I’ve made up to the point of the error:


    cudaEvent_t start_event, stop_event;					



float timer;

cudaEventRecord(start_event, 0);

char *device = NULL;

unsigned int flags;

cudaDeviceProp deviceProp;

int idev=0, deviceCount;


cudaGetDeviceProperties(&deviceProp, idev);


	fprintf(stderr, "Device %d cannot map host memory!\n", idev);


//declare device variables

float *Dx, *Ddata, *Drv, *Dx1, *Dx2, *Dshifts, *Hrv;

flags = cudaHostAllocMapped;

cudaHostAlloc((void **)&Hrv, sizeof(float) * xElem * shiftsSize, flags);

cudaMalloc((void**)&Dx, sizeof(float) * xElem);

cudaMalloc((void**)&Ddata, sizeof(float) * xElem * yElem);

checkCUDAError("memory allocation");


Thanks, tmurray for your interest in helping me out so far.

cudaEventRecord will create a context before you set the device flags. It shouldn’t be returning unknown error, but I guess that’s a test hole we have. You need to set the device flags and the device before any other CUDA calls.

Hey, great! I disabled the timer and it’s working now. Thanks for the help