cudaErrorMemoryCopyFailed ..but I don't use cudaMemcpy at all?!

NilsS · February 20, 2007, 5:21pm

Hello,

I try to create a little test app where the kernel does nothing else than filling up an array with zeroes. Unfortunetely, kernel execution fails with the error “cudaErrorMemoryCopyFailed”.

I don’t use cudaMemcpy at all before running the kernel… Any ideas what I can do to fix it?

The app is running fine on the device emulator.

Hardware: Pentium D 3,4Ghz / NX8800 GTS

Software: WinXP Pro SP2 / Cuda SDK 0.8 / Driver 97.73 / Compiler Visual Studio 2003

Thanks,

Nils

Kernel Code:

__global__ void testKernel(float* g_odata) 

{

  const unsigned int count = MEMSIZE / blockDim.x;

  const unsigned int from = count * threadIdx.x;

 int i=0;

 for( i=from; i<from+count; i++ )

  {

   g_odata[i] = sin(i*0.01f);

  }

}

Host Code:

void

runTest( int argc, char** argv) 

{

   CUT_CHECK_DEVICE();

   unsigned int timer = 0;

    CUT_SAFE_CALL( cutCreateTimer(&timer));

    CUT_SAFE_CALL( cutStartTimer(timer));

   unsigned int mem_size = sizeof(float) * MEMSIZE;

   // allocate device memory for result

    float* d_odata;

    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));

   // setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( NUMTHREADS, 1, 1);

   // execute the kernel

    testKernel<<< grid, threads, mem_size >>>(d_odata);

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed"); // *** THIS CHECK FAILS ***

   // allocate mem for the result on host side

    float* h_odata = (float*) malloc( mem_size);

    // copy result from device to host

    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size, cudaMemcpyDeviceToHost) );

   CUT_SAFE_CALL( cutStopTimer( timer));

	printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));

	CUT_SAFE_CALL( cutDeleteTimer( timer));

	writeFile( "C:\out.wav", h_odata, MEMSIZE * sizeof(float) );

	// cleanup memory

    free(h_odata);

   CUDA_SAFE_CALL(cudaFree(d_odata));

}

ngoodnight · February 20, 2007, 7:29pm

How large is “mem_size” in your code? The third parameter to the testKernel configuration (<<< >>>) is for shared memory, but I don’t see any use of shared memory in your code. If “mem_size” is larger than 16k, the launch will fail. It’s unclear why it would return that error code, but that might be unrelated. The main point is you don’t need a third configuration parameter unless you are using shared (shared) memory.

baarts · February 20, 2007, 7:43pm

what release are you using? the cudaErrorMemoryCopyFailed error code
was removed from the software stack a long time ago, and is no longer
there in the publicly released Beta version.

NilsS · February 20, 2007, 9:20pm

Thanks for your replies.

@ngoodnight
MEMSIZE is 4096, multiplied with sizeof(float), so the 16k boundary is reached exactly. You’re right, I forgot to remove the parameter. I’ll try to remove it and see what happens…

@baarts
I thought I’m using the public release now, I’ll recheck my configuration to be sure. I upgraded today.

Before upgrading, I received “Unknown CUDA error” though, so the error message might still exist in the current public release?

Thanks for your help,

Nils

NilsS · February 20, 2007, 9:53pm

It WORKS :) Thanks for your help!

Mark_Harris · February 21, 2007, 11:35am

Note that function parameters are also passed via shared memory, so if your function has any parameters, you don’t have 16KB available.

Also, in general it’s usually best to use less than half of the available shared memory per thread block. This allows multiple thread blocks to be active per multiprocessor. If there are multiple thread blocks per multiprocessor, an available block can be swapped in when others are waiting on memory accesses and synchronization. This helps hide cover memory latency.

Mark

NilsS · February 21, 2007, 12:10pm

Thanks for the hint. And you’re right, I used an old cutil library so the error messages were wrong as I compiled with new headers, but forgot to recompile the cutil lib.

I’ve got another problem: When using 512 threads per block (tested with 512-1-1 and 256-2-1) I’m getting an error message “invalid configuration”. As the manual states that a maximum of 512 threads is possible, I was wondering why this doesn’t work? Are there some “reserved” threads or something like that, so that I can’t use the full size of 512?

many thanks,

Nils

Mark_Harris · February 21, 2007, 12:45pm

I’ve used 512-thread blocks myself, so I’m not sure what the problem is. If you can provide a simple example we can test it here.

Mark

Topic		Replies	Views
HELP NEEDED! cudamemcpy CUDA Programming and Performance	3	2533	March 18, 2008
Question about CUDA_SAFE_CALL(cudaMemcpy(hostPx, CUDA_SAFE_CALL(cudaMemcpy(hostPx, device CUDA Programming and Performance	6	47468	January 23, 2009
strange behavior with device emulation CUDA Programming and Performance	5	2692	May 20, 2008
Unknown Error CUDA Programming and Performance	4	5889	October 17, 2018
cudaMemcpy: invalid argument issue CUDA Programming and Performance	2	16468	September 5, 2011
copying memory to and from 3D pitched pointers CUDA Programming and Performance	6	6804	August 2, 2011
Encountering cudaErrorInvalidValue (error 11) although parameters for kernel call seem fine CUDA Programming and Performance	1	1173	October 24, 2017
cudaMemcpy sometimes doesn't work CUDA Programming and Performance	5	4474	November 13, 2008
cudaMemcpy3D issue when using with texture cache CUDA Programming and Performance	6	1952	April 11, 2010
cudaMemcpy CUDA Programming and Performance	3	8415	April 8, 2009

cudaErrorMemoryCopyFailed ..but I don't use cudaMemcpy at all?!

Related topics