emulation mode works but gpu mode fails

I have a simple program that works in emulation mode:I am trying to use device function to change some data(*d_data) passed from host side.

float3* d_data=NULL;

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_data, size));

kernel<<<gridSize, blockSize>>>(d_data);

It works well. But when I change into GPU mode, the d_data’s value hasn’t been changed after the device function call. Do I need to copy the results from device to host like this: ?

float3* h_odata=NULL;

CUDA_SAFE_CALL( cudaMalloc( (void**) &h_odata, size));

CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_data, size, cudaMemcpyDeviceToHost) );

But this doesn’t work neither. :mad: Please help, thanks!


You should copy from the device AFTER the kernel launch!

It owuld be safe to issue a “cudaThreadSynchronize()” after the kernel launch and then follow it up with a copy

Your second block of code:

You are allocating memory in h_odata with cudaMalloc -> h_odata points to device memory

You are calling cudaMemcpy with h_odata as the first argument and cudaMemcpyDeviceToHost as the last one -> CUDA thinks h_odata points to host memory and thats not correct in the example.

You must do the following:

float3* h_data;

float3* d_data;

h_data=(float3*)malloc(size); // Host allocation

cudaMalloc((void**)&d_data, size); // Device allocation

//Write the test data in h_data


cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Upload the data

kernel<<<gridSize, blockSize>>>(d_data); // Compute the data

cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Download the data

correct, just one addition: if you want most performance you should use page locked host memory, allocated by:

float3* h_data;

cudaMallocHost((void**)&h_data, size); // Host allocation

Thank you so much guys, it works! :laugh: