I try to create a little test app where the kernel does nothing else than filling up an array with zeroes. Unfortunetely, kernel execution fails with the error “cudaErrorMemoryCopyFailed”.
I don’t use cudaMemcpy at all before running the kernel… Any ideas what I can do to fix it?
The app is running fine on the device emulator.
Hardware: Pentium D 3,4Ghz / NX8800 GTS
Software: WinXP Pro SP2 / Cuda SDK 0.8 / Driver 97.73 / Compiler Visual Studio 2003
Thanks,
Nils
Kernel Code:
__global__ void testKernel(float* g_odata)
{
const unsigned int count = MEMSIZE / blockDim.x;
const unsigned int from = count * threadIdx.x;
int i=0;
for( i=from; i<from+count; i++ )
{
g_odata[i] = sin(i*0.01f);
}
}
Host Code:
void
runTest( int argc, char** argv)
{
CUT_CHECK_DEVICE();
unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer(&timer));
CUT_SAFE_CALL( cutStartTimer(timer));
unsigned int mem_size = sizeof(float) * MEMSIZE;
// allocate device memory for result
float* d_odata;
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));
// setup execution parameters
dim3 grid( 1, 1, 1);
dim3 threads( NUMTHREADS, 1, 1);
// execute the kernel
testKernel<<< grid, threads, mem_size >>>(d_odata);
// check if kernel execution generated and error
CUT_CHECK_ERROR("Kernel execution failed"); // *** THIS CHECK FAILS ***
// allocate mem for the result on host side
float* h_odata = (float*) malloc( mem_size);
// copy result from device to host
CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size, cudaMemcpyDeviceToHost) );
CUT_SAFE_CALL( cutStopTimer( timer));
printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));
CUT_SAFE_CALL( cutDeleteTimer( timer));
writeFile( "C:\out.wav", h_odata, MEMSIZE * sizeof(float) );
// cleanup memory
free(h_odata);
CUDA_SAFE_CALL(cudaFree(d_odata));
}
How large is “mem_size” in your code? The third parameter to the testKernel configuration (<<< >>>) is for shared memory, but I don’t see any use of shared memory in your code. If “mem_size” is larger than 16k, the launch will fail. It’s unclear why it would return that error code, but that might be unrelated. The main point is you don’t need a third configuration parameter unless you are using shared (shared) memory.
what release are you using? the cudaErrorMemoryCopyFailed error code
was removed from the software stack a long time ago, and is no longer
there in the publicly released Beta version.
@ngoodnight
MEMSIZE is 4096, multiplied with sizeof(float), so the 16k boundary is reached exactly. You’re right, I forgot to remove the parameter. I’ll try to remove it and see what happens…
@baarts
I thought I’m using the public release now, I’ll recheck my configuration to be sure. I upgraded today.
Before upgrading, I received “Unknown CUDA error” though, so the error message might still exist in the current public release?
Note that function parameters are also passed via shared memory, so if your function has any parameters, you don’t have 16KB available.
Also, in general it’s usually best to use less than half of the available shared memory per thread block. This allows multiple thread blocks to be active per multiprocessor. If there are multiple thread blocks per multiprocessor, an available block can be swapped in when others are waiting on memory accesses and synchronization. This helps hide cover memory latency.
Thanks for the hint. And you’re right, I used an old cutil library so the error messages were wrong as I compiled with new headers, but forgot to recompile the cutil lib.
I’ve got another problem: When using 512 threads per block (tested with 512-1-1 and 256-2-1) I’m getting an error message “invalid configuration”. As the manual states that a maximum of 512 threads is possible, I was wondering why this doesn’t work? Are there some “reserved” threads or something like that, so that I can’t use the full size of 512?