cudaMemcpy error "invalid argument" from in-kernel malloc'ed device mem buffer on cuda 4


I use to have a piece of code work well on CUDA 4.0. But after using CUDA 4.1 official release, I found cudaMemcpy from a device memory buffer, which is allocated within a kernel using dynamic memory, back to host memory buffer returns an “invalid argument” error.

I wrote a simple example program to show this problem as in the following: (also attached)

#define SIZE 1024

#define BYTE_C 87

#include <stdlib.h>

#include <stdio.h>

__global__ void alloc_kernel(char **buf_ptr, size_t size, char content) {

    if(threadIdx.x == 0 && blockIdx.x == 0) {

        *buf_ptr = (char *) malloc(size);

        for(int i = 0; i < size; i++) {

            (*buf_ptr)[i] = content;




#define CUDA_CALL(cmd) do { \

    cudaError_t err; \

    err = cmd;       \

    if( err != cudaSuccess ) {       \

        printf("%s line %d: CUDA Error: %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \

    }               \


int main() {

    char **d_bufptr, *h_bufptr, *h_buf;

CUDA_CALL( cudaSetDevice(0) );

    CUDA_CALL( cudaMalloc((void **) &d_bufptr, sizeof(char *) ) );

alloc_kernel<<<1, 1>>>(d_bufptr, SIZE, BYTE_C);

CUDA_CALL( cudaDeviceSynchronize() );

CUDA_CALL( cudaMemcpy(&h_bufptr, d_bufptr, sizeof(char *), cudaMemcpyDeviceToHost) );

    h_buf = (char *)malloc(SIZE);

    CUDA_CALL( cudaMemcpy(h_buf, h_bufptr, SIZE, cudaMemcpyDeviceToHost) );

return 0;


In the line where I do cudaMemcpy near the end, that is the place this error is returned. The same error returns for page-locked host buffer.

Can anyone give me a hint on where I did anything wrong? Or is this a bug in the latest driver/CUDA release? Thanks!

My system: Intel Xeon E5507, GTX 480. CUDA 4.1 toolkit. Ubuntu 11.04. Kernel version 2.6.38-13-generic. Driver NVIDIA-Linux-x86_64-285.05.33. This problem is also seen in another CentOS 5.5 system (Intel Xeon E5560, M2070. CUDA 4.1, kernel 2.6.18-194.el5.perfctr). (972 Bytes)

Hi, does someone verify my code’s problem?

To give more information on this, this code works on cuda 4.0 (runtime + driver), but not with cuda 4.1 (runtime + driver).
Even if I use cuda 4.0 runtime with a 4.1’s driver, it still returns the same error. Does this mean there is driver bug somewhere in there?


Check syntax of cudaMemcpy, you are using h_bufptr in this code the different ways - as an address of a pointer and as a pointer.

Hi, Thanks, but I don’t get what you mean.

h_bufptr, defined as a char pointer, is a host variable, whose value is a device memory address.

Therefore, in

CUDA_CALL( cudaMemcpy(&h_bufptr, d_bufptr, sizeof(char *), cudaMemcpyDeviceToHost) );

The device buffer address stored in d_bufptr is transferred into h_bufptr. After that, this device buffer address can be used on the host.

Then, in

CUDA_CALL( cudaMemcpy(h_buf, h_bufptr, SIZE, cudaMemcpyDeviceToHost) );

The device buffer address stored in h_bufptr is used to transfer the in-kernel malloc’ed device buffer to the host side into h_buf.

Could you explain how this contradicts the syntax of cudaMalloc? Thanks!

Sorry, didn’t check your code thoroughly. It can be a mix of host/device pointers along with explicitly defined direction for cudaMemcpy. Have you tried to allocate h_buf via cudaHostAlloc and copy with default direction?

It returns the same error message whether the host buffer is allocated through malloc, cudaMalloc, or cudaHostAlloc (with all different flags).
And it is the same case for cudaMemcpy with cudaMemcpyDefault parameter. (I guess the error message will be a different one, perhaps Invalid Direction if this direction parameter is the problem).

Do you have a CUDA 4.1 environment to verify this bug? Thanks!

Yes, probably there is some bug in CUDA driver, I checked with run-time API (“Invalid argument”) and with driver API (“global function call is not configured”). But as a workaround you may use cudaHostAlloc for h_buf and then “memcpy” inside a kernel to copy from *d_bufptr to h_buf, it works correctly.

I never know there is a “memcpy” function that can be called in a kernel.

Could you point to the relevant sections in the CUDA programming guide about this “memcpy” function?

There is no mention in “CUDA C Programming Guide” about memcpy (probably because it’s not an extention), but compiler handles it as well as other usual C-functions (e.g. memset).