I’m using the cudaError_t value returned by cudaFree to discover what could be causing the memory fill I’m experiencing.
For the cudaMalloc, for all six arrays I get cudaSuccess.
But when I try to deallocate with cudaFree for all six arrays I get an unrecognised error code i.e. not cudaSuccess, cudaErrorInvalidDevicePointer or cudaErrorInitializationError.
{
case cudaErrorInvalidDevicePointer : printf("\n x cudaErrorInvalidDevicePointer");
break;
case cudaSuccess : printf("\n x cudaSuccess");
break;
case cudaErrorInitializationError : printf("\n x cudaErrorInitializationError");
break;
default : printf("\n x return not recognised");
break;
but for 1024x1024 particles it appears that cudaFree is not deallocating fully.
In the first kernel, with peak mem usage of 11% cudaFree deallocates all memory.
but in the second kernel, with peak mem usage of 80% the allocation through cudaMalloc is successful but cudaFree returns an unrecognised cudaError_t value.
Use the CUDA_SAFE_CALL(cudafunctionhere) macro in debug mode. That should print a more detailed error description. Worst-case scenario it will say “unknown error”, but it may find an error you’re not testing for in the switch statement.
That corrsponds to cudaErrorLaunchFailure. It seems that your kernel fails to launch for some reason. cudaFree returns errors from previous async launches, so it might just return the error from the kernel.
Try the following code right after your kernel. That way, cudaFree() shouldn’t return any error.
I noticed with the 1024x1024 particle problem that the first kernel which is not deallocated has memory use of 80% after all cudaMallocs have been made, while with a 1024x512 particle problem which I have just run the memory use at the same stage was just 15%, and for the 512x512 particle problem the memory use at the same stage is just 11%
Why is there such a jump for the 1024x1024 problem?
And does this large memory use for that problem leave enough memory for the switching between blocks?
OK, I put cudaThreadExit() and cudaThreadSynchronize() after the kernel and now cudaGetLastError reports unspecified launch failure and the cudaError_t value returned by all six calls to cudaFree() is now cudaErrorInvalidDevicePointer even though six calls to cudaMalloc to allocate the memory on each GPU in the first place is cudaSuccess.
Is the amount of data being passed simply just too big for the GPU and the compiler is not picking it up? As stated earlier it seems odd that after six calls to cudaMalloc I get 80% memory use. This does seem alot. Is there a limit to the amount of memory one can allocate on the GPU?
Sorry to bother you… i found the error… searched at the wrong positions
i called cudaMemcpy sometimes, and the transfered data size was larger than the array i read from.
The cudaFree throwing the error followed directly on one of these cudaMemcpy call.
BUT: cudaLastError did’nt return anything, and the cudaMemCpy itself returned cudaSuccess as well… so i ignored the function…
After correcting this mistake, i didn’t get anymore “cudaErrorLaunchFailures” on any cudaFree call.
So this mistake in memCpy influenced cudaFree…
@chrismc: have you checked your memcpy? maybe you got the same bug as like me…
You are welcome to discuss what might have happened backstage here ;)