cuda-memcheck identifies libcuda.so as source of a cudaErrorIllegalAddress error

Olumide · December 19, 2017, 1:22am

I’m getting a cudaErrorIllegalAddress error due to an illegal memory access on CUDA API call to cudaFree. cuda-memcheck identifies the cause of the error as the libcuda.so as shown below

=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x32f753]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudart.so.8.0 (cudaFree + 0x186) [0x3de66]
=========     Host Frame:./myProgram [0x12cb76]
=========     Host Frame:./myProgram [0x15f645]
=========     Host Frame:./myProgram [0x191416]
=========     Host Frame:./myProgram [0x191161]
=========     Host Frame:./myProgram [0x45ea9]
=========     Host Frame:./myProgram [0x37523]
=========     Host Frame:./myProgram [0x2cab7]
=========     Host Frame:./myProgram [0x27718]
=========     Host Frame:./myProgram [0x24c42]
=========     Host Frame:./myProgram [0x20a64]
=========     Host Frame:./myProgram [0x1f0bd]
=========     Host Frame:./myProgram [0x1e8c7]
=========     Host Frame:./myProgram [0x1e39a]
=========     Host Frame:./myProgram [0x1c076]
=========     Host Frame:./myProgram [0x1c043]
=========     Host Frame:./myProgram [0x1bfc8]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf1) [0x203f1]
=========     Host Frame:./myProgram [0xbdfa]

The error was thrown after an attempt to free some memory. I have done an error check after my last kernel call, before the cudaFree call that threw the exception, but the check did not detect any errors.

I am at a loss as to how to proceed as the error does not point to any of my kernels but rather libcuda.so. Are such errors typical? What other debugging options should I be looking at in this case. BTW, I am running CUDA 8 (instead of CUDA 9) because I’ve got a Fermi GPU.

I will post a minimal example as as soon as I am able to (the actual code is somewhat complicated).

Update

Thinking that my kernel had accessed invalid memory I decided to try perform an allocate, memset and deallocation after my the kernel like so, hoping to

myKernel<<<1,1>>>();

int* d_test;

status = cudaMalloc( &d_test , 25 * sizeof(int) );
ERROR_CHECK( status )  // OK

status = cudaMemset( d_test , 25 * sizeof(int) , 0 );
ERROR_CHECK( status )	// OK

status = cudaFree( d_test );
ERROR_CHECK( status ) // error

However the error occurs only on the cudaFree call. I’ve got several other cudaFree calls. The first cudaFree call always generates this error.

Robert_Crovella · December 19, 2017, 2:15am

This call doesn’t look correct to me:

status = cudaMemset( d_test , 25 * sizeof(int) , 0 );

but I don’t think that it has anything to do with what you are reporting.

my guess is that you have an illegal access occurring in a kernel. To test, try adding this just prior to the cudaFree call throwing the error:

status=cudaDeviceSynchronize();
ERROR_CHECK( status )

If the error moves to that call, then you have the tiger by the tail - work backwards, or just do rigorous error checking.

Olumide · December 19, 2017, 6:59pm

You’re right! I sandwiched my kernel between two cudaDeviceSynchronize() calls like so

status = cudaDeviceSynchronize();
CHECK_ERROR( status )    // OK

myKernel<<<1,1>>>();

status = cudaDeviceSynchronize();
CHECK_ERROR( status )   // NOT OK

Clearly the problem is caused by my kernel; and I’m sort of pleased that’s the case. I’d rather fix my kernel than be stuck with a broken runtime library.

Thanks txBob

Update
Found and fixed the bug in my kernel :)