I have a simple program that works in emulation mode:I am trying to use device function to change some data(*d_data) passed from host side.
float3* d_data=NULL;
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_data, size));
kernel<<<gridSize, blockSize>>>(d_data);
It works well. But when I change into GPU mode, the d_data’s value hasn’t been changed after the device function call. Do I need to copy the results from device to host like this: ?
float3* h_odata=NULL;
CUDA_SAFE_CALL( cudaMalloc( (void**) &h_odata, size));
CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_data, size, cudaMemcpyDeviceToHost) );
But this doesn’t work neither. :mad: Please help, thanks!
-timothy
I have a simple program that works in emulation mode:I am trying to use device function to change some data(*d_data) passed from host side.
float3* d_data=NULL;
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_data, size));
kernel<<<gridSize, blockSize>>>(d_data);
It works well. But when I change into GPU mode, the d_data’s value hasn’t been changed after the device function call. Do I need to copy the results from device to host like this: ?
float3* h_odata=NULL;
CUDA_SAFE_CALL( cudaMalloc( (void**) &h_odata, size));
CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_data, size, cudaMemcpyDeviceToHost) );
But this doesn’t work neither. :mad: Please help, thanks!
-timothy
[snapback]421799[/snapback]
You should copy from the device AFTER the kernel launch!
It owuld be safe to issue a “cudaThreadSynchronize()” after the kernel launch and then follow it up with a copy
pfccpp
August 7, 2008, 6:51am
3
Your second block of code:
float3* h_odata=NULL;
CUDA_SAFE_CALL( cudaMalloc( (void**) &h_odata, size));
CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_data, size, cudaMemcpyDeviceToHost) );
You are allocating memory in h_odata with cudaMalloc → h_odata points to device memory
You are calling cudaMemcpy with h_odata as the first argument and cudaMemcpyDeviceToHost as the last one → CUDA thinks h_odata points to host memory and thats not correct in the example.
You must do the following:
float3* h_data;
float3* d_data;
h_data=(float3*)malloc(size); // Host allocation
cudaMalloc((void**)&d_data, size); // Device allocation
//Write the test data in h_data
...
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Upload the data
kernel<<<gridSize, blockSize>>>(d_data); // Compute the data
cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Download the data
VrahoK
August 7, 2008, 9:14am
4
Your second block of code:
You are allocating memory in h_odata with cudaMalloc → h_odata points to device memory
You are calling cudaMemcpy with h_odata as the first argument and cudaMemcpyDeviceToHost as the last one → CUDA thinks h_odata points to host memory and thats not correct in the example.
You must do the following:
float3* h_data;
float3* d_data;
h_data=(float3*)malloc(size); // Host allocation
cudaMalloc((void**)&d_data, size); // Device allocation
//Write the test data in h_data
...
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Upload the data
kernel<<<gridSize, blockSize>>>(d_data); // Compute the data
cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Download the data
[snapback]421918[/snapback]
correct, just one addition: if you want most performance you should use page locked host memory, allocated by:
float3* h_data;
cudaMallocHost((void**)&h_data, size); // Host allocation
correct, just one addition: if you want most performance you should use page locked host memory, allocated by:
float3* h_data;
cudaMallocHost((void**)&h_data, size); // Host allocation
[snapback]421943[/snapback]
Thank you so much guys, it works! :laugh: