I have a kernel which has calls to curand_uniform and curand and about 100 lines of code; 1 block with 1000 threads are launched.
It seems I have hit some sort of limit for the size of my kernel cause when I add more code, it throws an “exception: cudaError_enum at memory location xxxxx” when running under VS 2008 debugger.
It runs if I cut out other sections of code, esp the calls to curand functions. All code works if the kernel is small enough (pared down in other places).
When I run it with NSight 3.0, the debugger just terminates with no message.
I am running on a 525M with Optimus and local debugging. Driver v3.20.
How can I find out what capcity limit I’m hitting? Or is it something else?
Your inital post says you were trying to launch one thread block with 1000 threads. There is a hard limit of 768 threads per thread block on the Fermi (sm_2x) architecture, which would seem to be the origin of your issue. The GT 525M is a compute capability 2.1 part (Fermi GF108, I think):
These kind of per-architecture hardware limits are documented in an appendix at the back of the Programming Guide, but I don’t have the section number handy. To run with a large number of threads you will need to run with multiple thread blocks. As a rule of thumb, most CUDA code will do fine when using thread blocks that comprise between 128 and 256 threads each. Obviously there can be cases where one would want to chose smaller or larger number of threads/block, but the indicated range is usually a good starting point.
Certainly that was the first thing I checked when writing my kernel code:
Page 149 of the programming guide:
Table 10 Technical Specifications per Compute Capability
Compute Capability
Technical Specifications 1.0 1.1 1.2 1.3 2.x 3.0 3.5
Maximum dimensionality of grid of thread
blocks 2 3
Maximum x-dimension of a grid of thread
blocks 65535 231-1
Maximum y- or z-dimension of a grid of
thread blocks 65535
Maximum dimensionality of thread block 3
Maximum x- or y-dimension of a block 512 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 512 <b>1024</b>
Fermi 2.1 capability is 1024 threads per block.
As I indicated, the code runs with 1000 threads if I pare it down enough.
I took some initialization code out and put it in another kernel and now it runs with the full funcionality in 1000 threads.
I just want to know how to determine the per thread memory requirements of kernel code when the occupancy calculator says I’m not exceeding my device’s capability with register or other data storage.
And what is the device capacity for kernel code storage?
That same table also states the maximum number of instructions per kernel on sm_2x as 512 million. Each instruction comprises 8 bytes (there are some exceptions on older architectures). When I need to check on kernel code size, I always dump disassembly with cuobjdump --dump-sass, there could be better methods (maybe cuobjdump --dump-elf ?).
If you haven’t done so yet, make sure to check the error status of every CUDA API call and every kernel launch, as an undetected up-stream error could lead to weird behavior further don stream. Have you tried running with cuda-memcheck to see whether there are any out-of-bounds accesses or race conditions? When running on a single GPU, kernel run times are limited by operating system watch dog timers, so make sure you don’t exceed the watchdog limit (a few seconds, typically). Have you double-check the CURAND calls to make sure correct data is passed in? I don’t have personal experience with CURAND. What happens if you run your app outside the debugger, as a release build?
I ran the cuobjdump --dump-sass on my code and the kernel which threw the exception has an ending address of 0A88. Does that mean it’s taking 2696 bytes per thread?
And how do I compare that with my device (525M) capacity?
I took all the curand calls out of that kernel in order to get it to run, which it does now with 1000 threads. I am now running the curand calls in a seperate kernel and passing the results in via a new buffer.
I just want to know what limit I’m exceeding with too much code in my kernel.
A kernel whose last address is a 0xA88 is 0xA90 bytes long, which roughly corresponds to 2700 bytes, as you note. So this is not a particularly large kernel, and it not anywhere close to the program size limit or the memory capacity for this card. In other words, kernel code size is not your issue.
I re-iterate my advice to put error status checks after every CUDA and CURAND API call, and after every kernel launch. Also check the return status of host allocations. It might also be a good idea to run the app standalone and as a release build to eliminate debugger and profilers as possible source of interference. You may also want to consider building your application from the command line as this way you will know exactly how the code is being built.
What happens when you disable the trapping of C++ exceptions in the debugger? A C++ exception being thrown is not necessarily indicative of anything going wrong with the app, it could be part of the normal program execution. Although the fact that it happens only when the kernel is modified in a particular way, does not really jibe with that hypothesis.
I disabled C++ exceptions, and the dump is as follows. Line 528 of spike.cu is the call to the kernel. The app just terminates with no other notice. I’m guessing the memory leaks occur cause premature termination bypasses memory freeing, cause I don’t get those with normal termination.
Could this be the result of allocating a PBO for Cuda/OpenGL interoperability?
BTW, the kernel code size has grown to 2804 bytes with 2 floats cast to int added.
'StockMFC.exe': Loaded 'C:\Windows\SysWOW64\nvcuda.dll', Binary was not built with debug information.
'StockMFC.exe': Loaded 'C:\Windows\SysWOW64\nvapi.dll'
First-chance exception at 0x7552b9bc in StockMFC.exe: Microsoft C++ exception: cudaError_enum at memory location 0x00bee404..
GPUassert: too many resources requested for launch e:/DataFile/Projects/Correlation/spike.cu 528
Detected memory leaks!
Dumping objects ->
{353} normal block at 0x00CD8EF0, 4 bytes long.
Data: < > 88 8E CD 00
{352} normal block at 0x00CD8E88, 40 bytes long.
Data: < > 00 00 00 00 CD CD CD CD 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {339} normal block at 0x00CD9B28, 12 bytes long.
Data: <> > 3E 0B 04 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {338} normal block at 0x00CD9AE0, 12 bytes long.
Data: <j 5 > 6A 0E 35 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {337} normal block at 0x00CD9A98, 12 bytes long.
Data: < $ > 94 0C 24 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {336} normal block at 0x00CD9A50, 12 bytes long.
Data: <, > 2C 0E 1A 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {335} normal block at 0x00CD9A08, 12 bytes long.
Data: <D > 44 0E 13 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {334} normal block at 0x00CD99C0, 12 bytes long.
Data: < % > F8 0D 25 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\plex.cpp(29) : {333} normal block at 0x00CD9908, 124 bytes long.
Data: <H > 48 94 CD 00 18 99 CD 00 B8 94 CD 00 C0 98 CD 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {332} normal block at 0x00CD98C0, 12 bytes long.
Data: < 3 > D2 06 33 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {331} normal block at 0x00CD9878, 12 bytes long.
Data: < > 18 0E 09 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {330} normal block at 0x00CD9830, 12 bytes long.
Data: <H > 48 04 9C 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {329} normal block at 0x00CD97E8, 12 bytes long.
Data: < M > 14 0E 4D 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {328} normal block at 0x00CD97A0, 12 bytes long.
Data: < > B8 09 1E 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {327} normal block at 0x00CD9758, 12 bytes long.
Data: < > D8 0C 1D 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {326} normal block at 0x00CD9710, 12 bytes long.
Data: <( > 28 0E 17 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {325} normal block at 0x00CD96C8, 12 bytes long.
Data: <V V > 56 0E 56 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {324} normal block at 0x00CD9680, 12 bytes long.
Data: < > 82 0C 05 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(923) : {323} normal block at 0x00CD9638, 12 bytes long.
Data: < > FA 0B 0D 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\plex.cpp(29) : {322} normal block at 0x00CD9580, 124 bytes long.
Data: < X > 00 00 00 00 00 00 00 00 18 0D 04 00 58 92 CD 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\map_pp.cpp(69) : {321} normal block at 0x00CD9500, 68 bytes long.
Data: < > 00 00 00 00 84 95 CD 00 00 00 00 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\plex.cpp(29) : {320} normal block at 0x00CD9448, 124 bytes long.
Data: < p d > 00 00 00 00 70 94 CD 00 64 94 CD 00 00 94 CD 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occcont.cpp(313) : {319} normal block at 0x00CD9400, 12 bytes long.
Data: < X > 00 00 00 00 58 92 CD 00 00 00 00 00
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\oleinit.cpp(84) : {310} client block at 0x00CD9380, subtype c0, 68 bytes long.
a CCmdTarget object at $00CD9380, 68 bytes long
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occmgr.cpp(781) : {309} client block at 0x00CD9258, subtype c0, 236 bytes long.
a CCmdTarget object at $00CD9258, 236 bytes long
f:\dd\vctools\vc7libs\ship\atlmfc\src\mfc\occmgr.cpp(143) : {306} client block at 0x00CD90A0, subtype c0, 128 bytes long.
a CCmdTarget object at $00CD90A0, 128 bytes long
Object dump complete.
The thread 'Win32 Thread' (0x62ac0) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x3114) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x63ee4) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x63604) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x623ec) has exited with code 7 (0x7).
The thread 'Win32 Thread' (0x642e4) has exited with code 7 (0x7).
The program '[402908] StockMFC.exe: Native' has exited with code 7 (0x7).
I think I just ran into a similar debugging challenge.
The Debug version of my application does not fail while the Release version throws an exception.
NSight reports that the kernel is failing on a memory exception but provides basically zero information about where it was dying in the release kernel. This is on CUDA 5.5 / NSight 3.1 / 320.17.
I suspect gdb on Linux would be better at providing information but I’m on Windows.
So my solution on Windows was to recompile with “Generate Line Number Info” (-lineinfo) enabled and then running cuda-memcheck.
The cuda-memcheck utility reports the SASS address where the exception occurred but with line number info enabled I was surprised to see it also reports the source filename and line number of the exception.
[ Someone at NVIDIA went to the trouble of writing a large PDF on CUDA-Memcheck so I suppose I should read it more carefully. :) ]
If a kernel cannot be launched because it requires too many resources (registers, shared memory, etc) this should be reflected in the error status for the kernel launch. Yet you stated above that no error was reported. If you add the following macro to your code, and invoke it right after the call to your kernel, does it report an error?
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaThreadSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
I stated that an error was reported if C++ exceptions are turned off in Visual Studio debugging.
If C++ exceptions are caught, the application terminates before any error is reported.
As stated in my last post, too many registers are used for 1000 threads in my kernels which fail. I did not catch this cause I failed to understand that register usage is reported for each kernel in the verbose compiler output, and that simple code additions can add 50% more register usage (ie. 26 to 39).