I used this fuction so many times, but this time I don’t know why it didn’t work.
In the EmuRealease mode, the code works very well, I got the right result.
But when I change to Release Mode, it seems it didn’t really copy the data from the device, it is still the same value as my initalization.
I am wondering if anybody can give me some clue to solve this problem.
I didn’t meet this problem before though I used it for several project already, I tried to compare it with all the sucessful project, but I didn’t see any difference.
I got the same problem. i want to copy an integer array from device to host with cudaMemcpy. Everything is working fine in emu mode. But when I compile as a release nothing is copied. At first i suspected the debug mode and the CUDA_SAFE_CALL macro. But all it does (as far as I understand) is checking for errors and print them to stderr if something is wrong.
So I don’t know what else to do. Maybe its a general problem/bug in Cuda. So any help is appreciated.
Sorry but I am really lost here. I still got the problem and checked everything for errors. Here are is my code. Am I doing something wrong???
unsigned int test_size = 256;
unsigned char * d_test= new unsigned char[test_size];
CUDA_SAFE_CALL(cudaMalloc((void**) &d_test, test_size*sizeof(char)));
...
cudaProcess<<<grid, block, sbytes>>>(d_test);
...
unsigned char* h_test = new unsigned char[test_size];
CUDA_SAFE_CALL(cudaMemcpy(h_test, d_test, test_size*sizeof(unsignedchar), cudaMemcpyDeviceToHost));
for (int i= 0; i<test_size; i++) {
fprintf(stderr, " Test %d \n",h_test[i]);
}
...
__global__ void cudaProcess(unsigned char * g_test)
{
int thid = threadIdx.x;
g_test[thid]=5;
}
Now h_test should be an array full of 5s, and that should be displayed. But I get random numbers here no matter if I modify g_test in the kernel or not.
you allocate memory for the pointer d_test twice - once on the host (with the new operator), once on the device (by calling cudaMalloc). Since you don’t call the delete operator in between, you end up wasting host memory.
Your cudaProcess code will not intialize the entire array to 5s, unless you launch only one block, and the block has 256 threads in the x-dimension. That’s because currently threads with same IDs but from different blocks will write to the same location. You can fix that for one-dimensional blocks with something like:
int thid = blockIdx.x*blockDim.x+threadIdx.x;
What are your grid, block, and sbytes values? When I run your (slightly modified) code, I get all 5s in the output. My modifications are:
chaning new to malloc (new is a C++ operator, in Win32 nvcc doesn’t seem to easily link it).