I use to have a piece of code work well on CUDA 4.0. But after using CUDA 4.1 official release, I found cudaMemcpy from a device memory buffer, which is allocated within a kernel using dynamic memory, back to host memory buffer returns an “invalid argument” error.
I wrote a simple example program to show this problem as in the following: (also attached)
In the line where I do cudaMemcpy near the end, that is the place this error is returned. The same error returns for page-locked host buffer.
Can anyone give me a hint on where I did anything wrong? Or is this a bug in the latest driver/CUDA release? Thanks!
My system: Intel Xeon E5507, GTX 480. CUDA 4.1 toolkit. Ubuntu 11.04. Kernel version 2.6.38-13-generic. Driver NVIDIA-Linux-x86_64-285.05.33. This problem is also seen in another CentOS 5.5 system (Intel Xeon E5560, M2070. CUDA 4.1, kernel 2.6.18-194.el5.perfctr). simple_memcpy_test.cu (972 Bytes)
To give more information on this, this code works on cuda 4.0 (runtime + driver), but not with cuda 4.1 (runtime + driver).
Even if I use cuda 4.0 runtime with a 4.1’s driver, it still returns the same error. Does this mean there is driver bug somewhere in there?
Sorry, didn’t check your code thoroughly. It can be a mix of host/device pointers along with explicitly defined direction for cudaMemcpy. Have you tried to allocate h_buf via cudaHostAlloc and copy with default direction?
Nope.
It returns the same error message whether the host buffer is allocated through malloc, cudaMalloc, or cudaHostAlloc (with all different flags).
And it is the same case for cudaMemcpy with cudaMemcpyDefault parameter. (I guess the error message will be a different one, perhaps Invalid Direction if this direction parameter is the problem).
Do you have a CUDA 4.1 environment to verify this bug? Thanks!
Yes, probably there is some bug in CUDA driver, I checked with run-time API (“Invalid argument”) and with driver API (“global function call is not configured”). But as a workaround you may use cudaHostAlloc for h_buf and then “memcpy” inside a kernel to copy from *d_bufptr to h_buf, it works correctly.
There is no mention in “CUDA C Programming Guide” about memcpy (probably because it’s not an extention), but compiler handles it as well as other usual C-functions (e.g. memset).