How to debug kernel throwing an exception?

I have a kernel which has calls to curand_uniform and curand and about 100 lines of code; 1 block with 1000 threads are launched.

It seems I have hit some sort of limit for the size of my kernel cause when I add more code, it throws an “exception: cudaError_enum at memory location xxxxx” when running under VS 2008 debugger.

It runs if I cut out other sections of code, esp the calls to curand functions. All code works if the kernel is small enough (pared down in other places).

When I run it with NSight 3.0, the debugger just terminates with no message.

I am running on a 525M with Optimus and local debugging. Driver v3.20.

How can I find out what capcity limit I’m hitting? Or is it something else?

Further info; kernel code:

curandState_t cstate;
curand_init (1234, tx * 64, 0, &cstate);
float dfrand = curand_uniform (&cstate);

for(int i = 0; i < niterations; i++)
{
	if(tx == 0)
	{
		//dfrand = curand_uniform (&cstate);
		pdfrand[0] = dfrand;
	}
...
		__syncthreads();
...
}

If I remove the comment in the last call to curand_uniform (in loop), the kernel throws the exception and won’t run at all in NSight.

pdfrand is a pointer to shared memory. It seems it is calling curand_uniform in the loop which is causing the majority of the problem.

ETA: tx is threadID; 1 block, 1000 threads

I am also getting a similar “exception: cudaError_enum at memory location xxxxx” when I include:

int x = 11111;

int y = (int) (.5 * x);

in my loop.

Does anyone know why CUDA would have a problem casting float to int?

float y = .5 * x;

works just fine.

Further info:

1>ptxas : info : 77736 bytes gmem, 72 bytes cmem[2], 48 bytes cmem[14]
1>ptxas : info : Used 26 registers, 6552 bytes cumulative stack size, 40 bytes cmem[0]

when building the kernel which throws an exception.

If I remove one line:

u += a * (b * v - u);

where all variables are floats, no exception is thrown (incorrect results, of course).

ptxas info is the same.

The occupancy calculator indicates I am not exceeding my cards limits.

Does anyone know how to figure this out?

Further info:

My full kernel can execute with 750 threads, but throws the exception with 812 threads.

Seems like I’m exceeding some memory capacity, but which one?

And how do I find out?

Full ptxas log:

1>ptxas : info : 77736 bytes gmem, 72 bytes cmem[2], 48 bytes cmem[14]
1>ptxas : info : Function properties for __float2int_rz
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas : info : Compiling entry function ‘_Z7cuSpikePfS_iiiS_S_iP6uchar4’ for ‘sm_21’
1>ptxas : info : Function properties for _Z7cuSpikePfS_iiiS_S_iP6uchar4
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas : info : Used 37 registers, 68 bytes cmem[0]
1>ptxas : info : Compiling entry function ‘Z6cuInitPfiiS’ for ‘sm_21’
1>ptxas : info : Function properties for Z6cuInitPfiiS
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas : info : Used 12 registers, 48 bytes cmem[0]

Your inital post says you were trying to launch one thread block with 1000 threads. There is a hard limit of 768 threads per thread block on the Fermi (sm_2x) architecture, which would seem to be the origin of your issue. The GT 525M is a compute capability 2.1 part (Fermi GF108, I think):

http://www.nvidia.in/object/cuda_gpus_in.html

These kind of per-architecture hardware limits are documented in an appendix at the back of the Programming Guide, but I don’t have the section number handy. To run with a large number of threads you will need to run with multiple thread blocks. As a rule of thumb, most CUDA code will do fine when using thread blocks that comprise between 128 and 256 threads each. Obviously there can be cases where one would want to chose smaller or larger number of threads/block, but the indicated range is usually a good starting point.

Certainly that was the first thing I checked when writing my kernel code:

Page 149 of the programming guide:

Table 10 Technical Specifications per Compute Capability
Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5
Maximum dimensionality of grid of thread
blocks 2 3
Maximum x-dimension of a grid of thread
blocks 65535 231-1
Maximum y- or z-dimension of a grid of
thread blocks 65535
Maximum dimensionality of thread block 3
Maximum x- or y-dimension of a block 512 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 512 <b>1024</b>

Fermi 2.1 capability is 1024 threads per block.

As I indicated, the code runs with 1000 threads if I pare it down enough.

I took some initialization code out and put it in another kernel and now it runs with the full funcionality in 1000 threads.

I just want to know how to determine the per thread memory requirements of kernel code when the occupancy calculator says I’m not exceeding my device’s capability with register or other data storage.

And what is the device capacity for kernel code storage?

Thanks for responding.

Sorry, I should have looked up the table instead of relying on my memory. Yes, 1024 threads per block on sm_2x.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

That same table also states the maximum number of instructions per kernel on sm_2x as 512 million. Each instruction comprises 8 bytes (there are some exceptions on older architectures). When I need to check on kernel code size, I always dump disassembly with cuobjdump --dump-sass, there could be better methods (maybe cuobjdump --dump-elf ?).

If you haven’t done so yet, make sure to check the error status of every CUDA API call and every kernel launch, as an undetected up-stream error could lead to weird behavior further don stream. Have you tried running with cuda-memcheck to see whether there are any out-of-bounds accesses or race conditions? When running on a single GPU, kernel run times are limited by operating system watch dog timers, so make sure you don’t exceed the watchdog limit (a few seconds, typically). Have you double-check the CURAND calls to make sure correct data is passed in? I don’t have personal experience with CURAND. What happens if you run your app outside the debugger, as a release build?

I ran the cuobjdump --dump-sass on my code and the kernel which threw the exception has an ending address of 0A88. Does that mean it’s taking 2696 bytes per thread?

And how do I compare that with my device (525M) capacity?

I took all the curand calls out of that kernel in order to get it to run, which it does now with 1000 threads. I am now running the curand calls in a seperate kernel and passing the results in via a new buffer.

I just want to know what limit I’m exceeding with too much code in my kernel.

A kernel whose last address is a 0xA88 is 0xA90 bytes long, which roughly corresponds to 2700 bytes, as you note. So this is not a particularly large kernel, and it not anywhere close to the program size limit or the memory capacity for this card. In other words, kernel code size is not your issue.

I re-iterate my advice to put error status checks after every CUDA and CURAND API call, and after every kernel launch. Also check the return status of host allocations. It might also be a good idea to run the app standalone and as a release build to eliminate debugger and profilers as possible source of interference. You may also want to consider building your application from the command line as this way you will know exactly how the code is being built.

OK, I put error checks on every cuda, curand and kernel call according to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api

with no errors being reported.

I then changed my working kernel to include a float cast to int, and the exception is thrown as before.

I will try to run the release build as soon as I get it to link.

What happens when you disable the trapping of C++ exceptions in the debugger? A C++ exception being thrown is not necessarily indicative of anything going wrong with the app, it could be part of the normal program execution. Although the fact that it happens only when the kernel is modified in a particular way, does not really jibe with that hypothesis.

I disabled C++ exceptions, and the dump is as follows. Line 528 of spike.cu is the call to the kernel. The app just terminates with no other notice. I’m guessing the memory leaks occur cause premature termination bypasses memory freeing, cause I don’t get those with normal termination.

Could this be the result of allocating a PBO for Cuda/OpenGL interoperability?

BTW, the kernel code size has grown to 2804 bytes with 2 floats cast to int added.

'StockMFC.exe': Loaded 'C:\Windows\SysWOW64\nvcuda.dll', Binary was not built with debug information.
'StockMFC.exe': Loaded 'C:\Windows\SysWOW64\nvapi.dll'
First-chance exception at 0x7552b9bc in StockMFC.exe: Microsoft C++ exception: cudaError_enum at memory location 0x00bee404..
GPUassert: too many resources requested for launch e:/DataFile/Projects/Correlation/spike.cu 528
Detected memory leaks!
Dumping objects ->
{353} normal block at 0x00CD8EF0, 4 bytes long.
 Data: <    > 88 8E CD 00 
{352} normal block at 0x00CD8E88, 40 bytes long.
 Data: <                > 00 00 00 00 CD CD CD CD 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {339} normal block at 0x00CD9B28, 12 bytes long.
 Data: <>           > 3E 0B 04 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {338} normal block at 0x00CD9AE0, 12 bytes long.
 Data: <j 5         > 6A 0E 35 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {337} normal block at 0x00CD9A98, 12 bytes long.
 Data: <  $         > 94 0C 24 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {336} normal block at 0x00CD9A50, 12 bytes long.
 Data: <,           > 2C 0E 1A 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {335} normal block at 0x00CD9A08, 12 bytes long.
 Data: <D           > 44 0E 13 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {334} normal block at 0x00CD99C0, 12 bytes long.
 Data: <  %         > F8 0D 25 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\plex.cpp(29) : {333} normal block at 0x00CD9908, 124 bytes long.
 Data: <H               > 48 94 CD 00 18 99 CD 00 B8 94 CD 00 C0 98 CD 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {332} normal block at 0x00CD98C0, 12 bytes long.
 Data: <  3         > D2 06 33 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {331} normal block at 0x00CD9878, 12 bytes long.
 Data: <            > 18 0E 09 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {330} normal block at 0x00CD9830, 12 bytes long.
 Data: <H           > 48 04 9C 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {329} normal block at 0x00CD97E8, 12 bytes long.
 Data: <  M         > 14 0E 4D 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {328} normal block at 0x00CD97A0, 12 bytes long.
 Data: <            > B8 09 1E 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {327} normal block at 0x00CD9758, 12 bytes long.
 Data: <            > D8 0C 1D 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {326} normal block at 0x00CD9710, 12 bytes long.
 Data: <(           > 28 0E 17 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {325} normal block at 0x00CD96C8, 12 bytes long.
 Data: <V V         > 56 0E 56 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {324} normal block at 0x00CD9680, 12 bytes long.
 Data: <            > 82 0C 05 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {323} normal block at 0x00CD9638, 12 bytes long.
 Data: <            > FA 0B 0D 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\plex.cpp(29) : {322} normal block at 0x00CD9580, 124 bytes long.
 Data: <            X   > 00 00 00 00 00 00 00 00 18 0D 04 00 58 92 CD 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\map_pp.cpp(69) : {321} normal block at 0x00CD9500, 68 bytes long.
 Data: <                > 00 00 00 00 84 95 CD 00 00 00 00 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\plex.cpp(29) : {320} normal block at 0x00CD9448, 124 bytes long.
 Data: <    p   d       > 00 00 00 00 70 94 CD 00 64 94 CD 00 00 94 CD 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(313) : {319} normal block at 0x00CD9400, 12 bytes long.
 Data: <    X       > 00 00 00 00 58 92 CD 00 00 00 00 00 
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\oleinit.cpp(84) : {310} client block at 0x00CD9380, subtype c0, 68 bytes long.
a CCmdTarget object at $00CD9380, 68 bytes long
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occmgr.cpp(781) : {309} client block at 0x00CD9258, subtype c0, 236 bytes long.
a CCmdTarget object at $00CD9258, 236 bytes long
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occmgr.cpp(143) : {306} client block at 0x00CD90A0, subtype c0, 128 bytes long.
a CCmdTarget object at $00CD90A0, 128 bytes long
Object dump complete.
The thread 'Win32 Thread' (0x62ac0) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x3114) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x63ee4) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x63604) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x623ec) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x642e4) has exited with code 7 (0x7).
The program '[402908] StockMFC.exe: Native' has exited with code 7 (0x7).

I think I just ran into a similar debugging challenge.

The Debug version of my application does not fail while the Release version throws an exception.

NSight reports that the kernel is failing on a memory exception but provides basically zero information about where it was dying in the release kernel. This is on CUDA 5.5 / NSight 3.1 / 320.17.

I suspect gdb on Linux would be better at providing information but I’m on Windows.

So my solution on Windows was to recompile with “Generate Line Number Info” (-lineinfo) enabled and then running cuda-memcheck.

The cuda-memcheck utility reports the SASS address where the exception occurred but with line number info enabled I was surprised to see it also reports the source filename and line number of the exception.

[ Someone at NVIDIA went to the trouble of writing a large PDF on CUDA-Memcheck so I suppose I should read it more carefully. :) ]

OK, I think I have solved this mystery. My kernel which runs, is reported to use 26 registers.

When I add code for 4 floats multiplied and cast to 2 ints, the register usage jumps to 39 which I discovered by the method outlined here:

https://devtalk.nvidia.com/default/topic/547194/cuda-programming-and-performance/how-to-get-register-count-before-kernel-launch-/

It is also revealed in the verbose compiler report, but it was hard to discern from among all the other kernel register usage reports.

39 registers is too many for 1000 threads. I still don’t know why multiplying the floats and casting to int would take 13 additional registers.

If a kernel cannot be launched because it requires too many resources (registers, shared memory, etc) this should be reflected in the error status for the kernel launch. Yet you stated above that no error was reported. If you add the following macro to your code, and invoke it right after the call to your kernel, does it report an error?

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

I stated that an error was reported if C++ exceptions are turned off in Visual Studio debugging.

If C++ exceptions are caught, the application terminates before any error is reported.

As stated in my last post, too many registers are used for 1000 threads in my kernels which fail. I did not catch this cause I failed to understand that register usage is reported for each kernel in the verbose compiler output, and that simple code additions can add 50% more register usage (ie. 26 to 39).