cudaErrorMemoryCopyFailed ..but I don't use cudaMemcpy at all?!


I try to create a little test app where the kernel does nothing else than filling up an array with zeroes. Unfortunetely, kernel execution fails with the error “cudaErrorMemoryCopyFailed”.

I don’t use cudaMemcpy at all before running the kernel… Any ideas what I can do to fix it?

The app is running fine on the device emulator.

Hardware: Pentium D 3,4Ghz / NX8800 GTS

Software: WinXP Pro SP2 / Cuda SDK 0.8 / Driver 97.73 / Compiler Visual Studio 2003



Kernel Code:

__global__ void testKernel(float* g_odata) 


  const unsigned int count = MEMSIZE / blockDim.x;

  const unsigned int from = count * threadIdx.x;

 int i=0;

 for( i=from; i<from+count; i++ )


   g_odata[i] = sin(i*0.01f);



Host Code:


runTest( int argc, char** argv) 



   unsigned int timer = 0;

    CUT_SAFE_CALL( cutCreateTimer(&timer));

    CUT_SAFE_CALL( cutStartTimer(timer));

   unsigned int mem_size = sizeof(float) * MEMSIZE;

   // allocate device memory for result

    float* d_odata;

    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));

   // setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( NUMTHREADS, 1, 1);

   // execute the kernel

    testKernel<<< grid, threads, mem_size >>>(d_odata);

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed"); // *** THIS CHECK FAILS ***

   // allocate mem for the result on host side

    float* h_odata = (float*) malloc( mem_size);

    // copy result from device to host

    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size, cudaMemcpyDeviceToHost) );

   CUT_SAFE_CALL( cutStopTimer( timer));

	printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));

	CUT_SAFE_CALL( cutDeleteTimer( timer));

	writeFile( "C:\out.wav", h_odata, MEMSIZE * sizeof(float) );

	// cleanup memory




How large is “mem_size” in your code? The third parameter to the testKernel configuration (<<< >>>) is for shared memory, but I don’t see any use of shared memory in your code. If “mem_size” is larger than 16k, the launch will fail. It’s unclear why it would return that error code, but that might be unrelated. The main point is you don’t need a third configuration parameter unless you are using shared (shared) memory.

what release are you using? the cudaErrorMemoryCopyFailed error code
was removed from the software stack a long time ago, and is no longer
there in the publicly released Beta version.

Thanks for your replies.

MEMSIZE is 4096, multiplied with sizeof(float), so the 16k boundary is reached exactly. You’re right, I forgot to remove the parameter. I’ll try to remove it and see what happens…

I thought I’m using the public release now, I’ll recheck my configuration to be sure. I upgraded today.

Before upgrading, I received “Unknown CUDA error” though, so the error message might still exist in the current public release?

Thanks for your help,


It WORKS :) Thanks for your help!

Note that function parameters are also passed via shared memory, so if your function has any parameters, you don’t have 16KB available.

Also, in general it’s usually best to use less than half of the available shared memory per thread block. This allows multiple thread blocks to be active per multiprocessor. If there are multiple thread blocks per multiprocessor, an available block can be swapped in when others are waiting on memory accesses and synchronization. This helps hide cover memory latency.


Thanks for the hint. And you’re right, I used an old cutil library so the error messages were wrong as I compiled with new headers, but forgot to recompile the cutil lib.

I’ve got another problem: When using 512 threads per block (tested with 512-1-1 and 256-2-1) I’m getting an error message “invalid configuration”. As the manual states that a maximum of 512 threads is possible, I was wondering why this doesn’t work? Are there some “reserved” threads or something like that, so that I can’t use the full size of 512?

many thanks,


I’ve used 512-thread blocks myself, so I’m not sure what the problem is. If you can provide a simple example we can test it here.