I’m kind of a CUDA newbie, so please bear with me.
Recently, I implemented CUDA on some existing C code and got a substantial speed up. The code calls the same kernel several hundred times, and each time I have to send a large array back to the host. Since I keep sending the same data array back and forth, I thought from what I’ve read that using zero copy on that array would be a good idea.
So, I’m trying to use the simpleZeroCopy example from the SDK (which runs fine on my machine) as a guide, but when I try to run the code in my project, I get an error during the memory allocation. Here’s basically what I’m doing:
for (int i = 0; i < manyIterations; i++)
{
kernel<<<blocksPerGrid, threadsPerBlock>>>(Drv, someOtherStff);
//do some stuff on the rv array
}
The data size I’m working with for the zero copy is about 2048 * 512 floats, which is the same size as the data from the simpleZeroCopy example. Anybody see what I’m doing wrong? Thanks
Is that cudaHostAlloc the first API call you make that isn’t related to device enumeration or selection? (aka do you have a context when you call cudaSetDeviceFlags)
I’ve been a little confused when I’ve read about the idea of contexts on the forums before. Here are all the CUDA-related calls that I’ve made up to the point of the error:
cudaEventRecord will create a context before you set the device flags. It shouldn’t be returning unknown error, but I guess that’s a test hole we have. You need to set the device flags and the device before any other CUDA calls.