So I’ve got a kernel that’s getting rather lengthy “source code”-wise here, but it does work just dandy to start. (VS tells me 12636 lines but I comment “big-ly” too so its not really that much working code.)
Now I need it to do a few more things so I go to paste in a new chunk of code and suddenly the kernel will no longer launch!? It tries to launch, but returns a CUResult = 209 which, peaking into cuda.h, is CUDA_ERROR_NO_BINARY_FOR_GPU which also has some vague comments about “no kernel image suitable for the device.” I haven’t touched any compile parameters mind you. Just added some code and clicked “Debug” (yes, that compiles it too, yes I checked). Now if I remove my new lines of source code it magically launches fine again. I know what your thinking, “clearly that code has errors” - I thought that too at first, except the exact same chunk of code gets executed 50 other places in the kernel higher up. (Think: hash algorithm, I make a small hand full of vars to start and then do a big long list of math operations that are performed on those same variables over and over.)
Here is the real tricky part I’ve found, which also leads me to believe its not a coding error issue, I can remove a chunk of source code from anywhere in the kernel and it will start working again. Very Start, Tail End , Location in the Middle Chosen at Random, doesn’t matter what lines I remove. So long as what gets pulled out doesn’t cause compile errors I can hack out any piece of this kernel and it will resume launching successfully. WTF???
I also tried to narrow down exactly how many lines of code would still work, and instead found this wonderful narrow band of gray area where I could actually compile, run and debug and have it work, then without changing a single thing in the code just click compile, run and debug again and have it fail! I repeat, WTF???
This all makes me lean more towards a resource limit that’s getting hit. Now I’ve hit various resource limits before with this code, but I think all of them either the compiler has warned me about or Memory Checker has caught and I could measure what was “it” was and change the design and got them back well below the limits. The compiler reports nothing wrong with the kernel currently in both the working state and the broken state. And Memory Checker complains of nothing when its working and when its broken it won’t even launch so MemoryChecker is useless. So my next best guess is I could be hitting some limit the compiler doesn’t check for. However, I have searched far and wide but cannot find ANYthing about my kernel that is even remotely approaching a resource limit… Here is what I know to check:
Some Particulars:
Quadro 2000 (1024GB) Driver v376.33
GF106GL, 4 SMs , CC = SM_21
Shared Memory = 49152 bytes
Constant Memory = 65536 bytes
Total Global Memory = 1073741824 bytes
CUDA 8.0 installed
So the easy things to check first:
Launch config = 4 blocks of 512 threads
SM Block Limit = I run 1 block per SM, Limit is 8
SM Thread Limit = I run 512 threads per SM, Limit is 1536
Now this is what gets spit out with --ptxas-options=-v when I compile:
1> ptxas info : 0 bytes gmem, 232 bytes cmem[2]
1> ptxas info : Compiling entry function 'mykernel' for 'sm_20'
1> ptxas info : Function properties for mykernel
1> 1792 bytes stack frame, 188 bytes spill stores, 424 bytes spill loads
1> ptxas info : Used 63 registers, 1792 bytes cumulative stack size, 32768 bytes smem, 48 bytes cmem[0], 1 textures
So I’m interpreting that info to mean:
Compute Capability = I have it set to CC2.0, my hardware is capable of CC2.1
(Yep, tried compiling for 2.1. Made no difference.)
Shared Memory Limit = I use 32768 bytes smem, Limit is 49152
Constant Memory Limit = I use either 232 bytes cmem or 48 bytes cmem not sure really why it lists two amounts there but even adding those two together it doesn’t get remotely close to the 65536 byte limit.
Global Memory Limit = It lists 0 bytes gmem but my understanding is stack frame and possibly spill stores on top of that all end up in Global Memory also. (assuming L1/L2 caches both completely miss)
So worst case 1792 + 188 bytes = 1980 bytes but Global Memory Limit is supposedly some 1073741824 bytes That seems suspiciously miniscule compared to the limit, but even if that were per thread and every thread running used that amount of Stack Frame and spill it still would only be 4055040 bytes (aka 40.5MB out of 1GB) Also I should note that I was running originally with a stack frame size up at around 2400-ish and chopped a couple arrays in-half that were sized larger for worst case scenario inputs and that got it down to that 1792 number but the issue is the same regardless of the stack frame size.
Register Limit = I use 63 registers per thread, Limit is exactly 63. I max these out deliberately for performance, my understanding per the CUDA docs is that doing so is perfectly acceptable and anything that doesn’t fit in a register naturally spills to Global Memory (thus taking a performance hit.) However I have also tried limiting the registers just in case you need to leave some free for “something” (I don’t know what!). Did a go of it with mine set to 32 via -maxregcount and still see the same behavior, only effect was my spills naturally shot up.
Also some more obscure Limits I’ve pried from the clutches of “The Google”:
There is a Max number of instructions per kernel (SASS, I assume?)
As well as possibly a Max number of PTX (only found mentioned on this form.)
Max instructions per Kernel = I’m generating 41600 SASS instructions, Limit is 512 million.
I performed a cuobjdump -sass on my .cubin file to get that number, and the limit for CC2.0 and up is per Wikipedia.
Max PTX instructions = less than 314270, Limit mentioned was 2 million
Not sure this is a real thing, but I opened up the .ptx file that gets spit out during compile anyways and found it to be that many lines long so even if every line in that file were a ptx instruction, which doesn’t seem to be the case (lots of lines that are just “}”), that would still be well under a 2 million limit for ptx instructions.
Also since it seems highly relevant to the error message here is the full build command getting used to compile (VisualStudio2013 comes up with this for me based on the project settings, so its not like I could fat finger this even if I wanted to):
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -G --keep --keep-dir Debug -maxrregcount=0 --ptxas-options=-v --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "D:\Code\kernel.cu"
So here’s where I’m at: Kernel works. Add code. Boom. Remove code. Kernel works.
Also, Yes, I did search this forum with every keyword and error message I could think of and after 4 or 5 pages into the search results they definitely seemed ahem “less than relevant” to what I typed in.(#aintnogoogle) However I did find this post on top of one such search result pile which sounds eerily similar in nature, but he gets a different error and there really isn’t anything helpful there:
I’m out of ideas and things to check, so I’m throwing up a hail mary here, any ideas?
Did I miss something crucial?
Is there a limit I haven’t checked yet?
Aliens?
http://s2.quickmeme.com/img/67/67fffb91c3cc4ab9c0137383fe0ef02059b01ca3015a53c7de2e55c8bcc2361e.jpg
I hereby humbly await your bequeathal of knowledge…