Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209

So I’ve got a kernel that’s getting rather lengthy “source code”-wise here, but it does work just dandy to start. (VS tells me 12636 lines but I comment “big-ly” too so its not really that much working code.)

Now I need it to do a few more things so I go to paste in a new chunk of code and suddenly the kernel will no longer launch!? It tries to launch, but returns a CUResult = 209 which, peaking into cuda.h, is CUDA_ERROR_NO_BINARY_FOR_GPU which also has some vague comments about “no kernel image suitable for the device.” I haven’t touched any compile parameters mind you. Just added some code and clicked “Debug” (yes, that compiles it too, yes I checked). Now if I remove my new lines of source code it magically launches fine again. I know what your thinking, “clearly that code has errors” - I thought that too at first, except the exact same chunk of code gets executed 50 other places in the kernel higher up. (Think: hash algorithm, I make a small hand full of vars to start and then do a big long list of math operations that are performed on those same variables over and over.)

Here is the real tricky part I’ve found, which also leads me to believe its not a coding error issue, I can remove a chunk of source code from anywhere in the kernel and it will start working again. Very Start, Tail End , Location in the Middle Chosen at Random, doesn’t matter what lines I remove. So long as what gets pulled out doesn’t cause compile errors I can hack out any piece of this kernel and it will resume launching successfully. WTF???

I also tried to narrow down exactly how many lines of code would still work, and instead found this wonderful narrow band of gray area where I could actually compile, run and debug and have it work, then without changing a single thing in the code just click compile, run and debug again and have it fail! I repeat, WTF???

This all makes me lean more towards a resource limit that’s getting hit. Now I’ve hit various resource limits before with this code, but I think all of them either the compiler has warned me about or Memory Checker has caught and I could measure what was “it” was and change the design and got them back well below the limits. The compiler reports nothing wrong with the kernel currently in both the working state and the broken state. And Memory Checker complains of nothing when its working and when its broken it won’t even launch so MemoryChecker is useless. So my next best guess is I could be hitting some limit the compiler doesn’t check for. However, I have searched far and wide but cannot find ANYthing about my kernel that is even remotely approaching a resource limit… Here is what I know to check:

Some Particulars:
Quadro 2000 (1024GB) Driver v376.33

GF106GL, 4 SMs , CC = SM_21
Shared Memory = 49152 bytes
Constant Memory = 65536 bytes
Total Global Memory = 1073741824 bytes
CUDA 8.0 installed

So the easy things to check first:
Launch config = 4 blocks of 512 threads
SM Block Limit = I run 1 block per SM, Limit is 8
SM Thread Limit = I run 512 threads per SM, Limit is 1536

Now this is what gets spit out with --ptxas-options=-v when I compile:

1>  ptxas info    : 0 bytes gmem, 232 bytes cmem[2]
1>  ptxas info    : Compiling entry function 'mykernel' for 'sm_20'
1>  ptxas info    : Function properties for mykernel
1>      1792 bytes stack frame, 188 bytes spill stores, 424 bytes spill loads
1>  ptxas info    : Used 63 registers, 1792 bytes cumulative stack size, 32768 bytes smem, 48 bytes cmem[0], 1 textures

So I’m interpreting that info to mean:

Compute Capability = I have it set to CC2.0, my hardware is capable of CC2.1
(Yep, tried compiling for 2.1. Made no difference.)

Shared Memory Limit = I use 32768 bytes smem, Limit is 49152

Constant Memory Limit = I use either 232 bytes cmem or 48 bytes cmem not sure really why it lists two amounts there but even adding those two together it doesn’t get remotely close to the 65536 byte limit.

Global Memory Limit = It lists 0 bytes gmem but my understanding is stack frame and possibly spill stores on top of that all end up in Global Memory also. (assuming L1/L2 caches both completely miss)

So worst case 1792 + 188 bytes = 1980 bytes but Global Memory Limit is supposedly some 1073741824 bytes That seems suspiciously miniscule compared to the limit, but even if that were per thread and every thread running used that amount of Stack Frame and spill it still would only be 4055040 bytes (aka 40.5MB out of 1GB) Also I should note that I was running originally with a stack frame size up at around 2400-ish and chopped a couple arrays in-half that were sized larger for worst case scenario inputs and that got it down to that 1792 number but the issue is the same regardless of the stack frame size.

Register Limit = I use 63 registers per thread, Limit is exactly 63. I max these out deliberately for performance, my understanding per the CUDA docs is that doing so is perfectly acceptable and anything that doesn’t fit in a register naturally spills to Global Memory (thus taking a performance hit.) However I have also tried limiting the registers just in case you need to leave some free for “something” (I don’t know what!). Did a go of it with mine set to 32 via -maxregcount and still see the same behavior, only effect was my spills naturally shot up.

Also some more obscure Limits I’ve pried from the clutches of “The Google”:
There is a Max number of instructions per kernel (SASS, I assume?)
As well as possibly a Max number of PTX (only found mentioned on this form.)

Max instructions per Kernel = I’m generating 41600 SASS instructions, Limit is 512 million.
I performed a cuobjdump -sass on my .cubin file to get that number, and the limit for CC2.0 and up is per Wikipedia.

Max PTX instructions = less than 314270, Limit mentioned was 2 million
Not sure this is a real thing, but I opened up the .ptx file that gets spit out during compile anyways and found it to be that many lines long so even if every line in that file were a ptx instruction, which doesn’t seem to be the case (lots of lines that are just “}”), that would still be well under a 2 million limit for ptx instructions.

Also since it seems highly relevant to the error message here is the full build command getting used to compile (VisualStudio2013 comes up with this for me based on the project settings, so its not like I could fat finger this even if I wanted to):

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include"  -G  --keep --keep-dir Debug -maxrregcount=0 --ptxas-options=-v --machine 32 --compile -cudart static  -g   -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "D:\Code\kernel.cu"

So here’s where I’m at: Kernel works. Add code. Boom. Remove code. Kernel works.

Also, Yes, I did search this forum with every keyword and error message I could think of and after 4 or 5 pages into the search results they definitely seemed ahem “less than relevant” to what I typed in.(#aintnogoogle) However I did find this post on top of one such search result pile which sounds eerily similar in nature, but he gets a different error and there really isn’t anything helpful there:
https://devtalk.nvidia.com/default/topic/508322/?comment=3625318

I’m out of ideas and things to check, so I’m throwing up a hail mary here, any ideas?

Did I miss something crucial?

Is there a limit I haven’t checked yet?

Aliens?
http://s2.quickmeme.com/img/67/67fffb91c3cc4ab9c0137383fe0ef02059b01ca3015a53c7de2e55c8bcc2361e.jpg

I hereby humbly await your bequeathal of knowledge…

The usual cause of a CUDA_ERROR_NO_BINARY_FOR_GPU error is the lack of an embedded binary image for the specific architecture of the GPU one tries to run on, combined with the absence of PTX code that could be JIT compiled. That doesn’t seem to be the case here since your build target and your GPU are both sm_20, which is supported by CUDA 8.

There is a limit on device program size, but it is quite large and I have never encountered any code that ran into it. I have dealt with third-party code that had trouble compiling because it was in the > 100K instruction range, but you state that your code compiles fine, it just doesn’t run.

One of the appendices in the CUDA Programming Guide lists the architecture-dependent limits. The relevant limit is either the number of machine instructions in the machine code or the equivalent amount of code bytes. Number of machine instructions can be many times larger than the number of source code lines, especially with all the function inlining and loop unrolling the CUDA compiler often performs.

Use cuobjdump --dump-sass to dump the machine code and compare the number of instructions or the highest address used against the limit stated in the Programming Guide.

[Later:] The kernel size limit was 2 million instructions for sm_1x, and has been 512 million instructions since sm_20, so I don’t think you are hitting that. It would still be interesting to see how many instructions there are when you dump the code of the failing configuration with cuobjdump --dump-sass

I think you may actually be exactly at the limit for that one. My memory is very hazy, but as I recall sm_2x had a configurable shared memory / L1 cache split, and it would default to the smaller shared memory size? Try configuring for the “PreferShared” setting.

you’re building a 32-bit code?

Intentionally?

You’re using the driver API, I take it?

Are you carefully checking all error codes from the module load process?

Do you have any static device variable definitions in your code?

How much dynamic device memory allocation are you doing before the kernel launch?

So I take it your working hypothesis is that there isn’t enough contiguous memory to load the binary image because it’s mostly taken up by dynamic data allocation?

Sounds plausible and would jibe with the “grey zone” description where the kernel sometimes launches and sometimes doesn’t. But I would expect an out-of-resources error in that case, not CUDA_ERROR_NO_BINARY_FOR_GPU.

Yes, I suspect a memory issue. And no, it doesn’t fit all the data points cleanly.

I do know that a too-large static allocation will cause a module load to fail, which can cause weird errors later like “invalid symbol” (e.g. on cudaMemcpyToSymbol) and “no binary for GPU” on kernel launch. But if solid error checking was done up to that point, in the driver API (this is actually hidden in the runtime API - and must be discovered inferentially) I’m pretty sure you will get a module load error before you get to trying a kernel launch.

I have also seen weirdness around JIT failures that result in a module load failure and are observed later inferentially during other activity.

shootin from the hip. I normally steer clear of questions that don’t provide a solid repro case, and I broke my own rule here.

Well I certainly do appreciate you all taking a look! Even your questions here have already given me a few new ideas for things to check. To answer some of your questions:

njuffa is definitely correct on there being a split mode Shared Memory/L1 cache. Since my own recollection was also foggy, I just double checked the docs and for Fermi cores it’s 48KB Shared/16KB L1 by default. You can flip-flop it with the mode preference parameter to be 16KB Shared/48KB L1. (Kepler added a 32/32 split mode, but I’m definitely working with Fermi hardware here.) If all else fails I can hard code the parameter so its not just left to a default value, but I’ve over-run the shared limit before and the compiler flat out refused once that happened.

Also I looked to see if either of those two chunks of memory would cut into my Constant memory and from what I could find even though both Shared/L1 and Constant memory chunks live on-chip the 64KB Constant memory appears to be completely separate from 48KB/16kB chunk. So I think I’m in-bounds on the shared mem, but just to triple check I can hack out the shared memory completely from the kernel and make sure its not the culprit. I’ll let you know what turns out.

As far as txbob’s questions those made me think a bit. I’ll be honest in that I really didn’t know what “Driver API” meant and I definitely omitted a rather large bit of information in my original wall of text since I was fairly confident it wasn’t the problem and I was trying to stave off any red herrings. I’m no longer as confident. LOL

Sooooo, to fire off my CUDA kernel I’m actually using a C# winform that’s utilizing the ManagedCuda project library. After some googling it turns out that yes, I am in fact using the Driver Api simply by virtue of that’s what ManagedCuda uses under its hood.

I’m only semi-deliberately compiling for 32-bit, and if memory serves that was a decision I made very early on while trying to get CUDA working on my box. And it was simply because when starting out CUDA didn’t work so I flipped that switch and it started working so I left it alone ever since. I can’t quite recall now if that was a limitation of ManagedCuda or now looking back perhaps I just didn’t have my winform code project properly lined up for 64-bit goodness. I can definitely take a second wander through those settings now that I’ve got a few more wits-about-me in regards to CUDA.

I’m a little confused by the dynamic allocation question, I was under the impression that “dynamic” allocation usually referred to allocation of memory from within the kernel itself (as in calling new/delete or malloc/free as the kernel fires off). I’m now wondering if I might be way off base on my terms though. I don’t think I’m doing anything like that though, at least that I can see.

All of my vars are either locals within the kernel (as in int x[32] = {0};) or pushed out onto the device memory from the host code before the kernel is called.

As far as device variables the only thing I have using that is before my actual kernel function I have defined a few constant arrays and then an array of shared mem:

__shared__ __device__ unsigned int s[512 * 16];
extern __device__ __constant__ unsigned int block1[16];
extern __device__ __constant__ unsigned int block2[16];
extern __device__ __constant__ unsigned int block3[25];
extern __device__ __constant__ int sizeA;

Now like I said I’m talking at all this through the ManagedCuda library so to put values in there I call a method in the host code:

myCudaKernel.SetConstantVariable("block1", <i>initialvalueblah</i>);

I think that has to be calling the cudaMemcpyToSymbol function under its hood. SetConstantVariable is a void return so its either quietly eating an exception or not throwing any. Its probably also worth noting the values all appear correctly in the kernel when it runs and I debug.

The shared array only gets touched from inside kernel code, host code doesn’t do anything with it.

The only other device memory I’m utilizing is two device arrays plunked down in device global memory by the host code. One that contains input values and another that I write to to get at some output values. This is where I just realized while reading your questions I need to take a harder look. I used to run this kernel with a lot higher occupancy and those arrays should be getting sized based on the number of threads. Its been a while since I looked at that side of the code so I’ll double check I’m not over allocating something in memory there. FWIW The input array is an just an array of ints, one for each thread and then the output array was just a stop-gap measure so that I could get something back out of the kernel to verify calculations its quite a bit larger at 36 bytes per thread.

Check my math but that should only eat:
512(threads) * 4(blocks) * 4(bytes/int) = 14336 bytes
512(threads) * 4(blocks) * 9(uint) * 4(bytes/uint) = 73728 bytes

That’s just little over 880KB right??

Though let’s say “I no type so good” on the array dimensions, even if I was writing a much larger array than needed, it seems like a really weird error/symptom to manifest. I guess if its chewing up so much mem that the kernel doesn’t have room??? (That’s a lot of memory to fill up…) :-S

If I have time I’ll try and get a comparison cuobjdump of the sass before broken and after broken and maybe try to shave it down back into that gray area. If its like last time though the threshold kinda moves around on me (which yeah seems like resources intermitantly filled over filled or something of the like.)

I mean if all else fails I can definitely throw some code at you, but I can tell you right now you’re not going to like the look of it… haha! It would probably be worth the days-worth of work it would take for me to clean slate a test-case kernel that repro’s the issue. I was hoping it wouldn’t come to that though.

That’s the dynamic memory allocation I was referring to. In CUDA, it would be e.g. via cudaMalloc. In managedCuda, it’s something different, but I’m sure you know that.

It doesn’t look to me so far that you would likely be out of memory at the point of kernel launch (dynamic in-kernel allocations don’t come into play, yet). Which means I’m mostly out of ideas. As a double-check, if it were me, I would do a cudaMemGetInfo() right before the kernel launch, but I have no idea how to do that in managedCuda.

So just did the cuobjdump comparison and went from working version to broken by simply uncommenting about 10 lines of source code (each line has several math operations though). The SASS dump tallied up as follows:

41458 SASS instructions = Works

41600 SASS instructions = Fails

That second number looked suspiciously “round” to me, so I tried to narrow in a bit more line by line:

41518 instructions = Works again
41530 instructions = Still works
41547 instructions = Still works
41564 instructions = Still works
41582 instructions = Still works
41591 instructions = Still works
41595 instructions = Still works
41597 instructions = Still works
41604 instructions = BROKEN!
41597 instructions = Works again.

Sooo the difference between those last three compiles was just uncommenting and then re-commenting this teensy bit of source code:

x = tmp3 | tmp;

Also something that just jumped out at me is the initial broken count I did that was exactly 41600 SASS instructions is the EXACT same lines of source code that compiled into the 41604 that broke the second time, which is quite curious to me how it can suddenly take an extra 4 instructions to perform the exact same work…

Is that normal for the compiler to have variability in the number of SASS it outputs??? If that’s the case that could account for my intermitent symptoms I was seeing. If i’m right on that razor edge and the compiler spits out a couple extra instructions one build it would be broken then i recompile and maybe get a few less instructions and suddenly that same exact source code seems to work.

The bigger question:
41,600??? Does that number mean anything to anyone?

Cuz I’m staring right at
CUDA Toolkit v8.0 Programming guide >
Appendix G. Compute Capabilities >
G.1. Features and Technical Specifications >
Table 13. Technical Specifications per Compute Capability

And that thing clearly shows anything CC2.0 and up should be able to handle 512 million of those bad boys.

The kernel is big but not huge. As I said, I have dealt with 3rd party kernels in excess of 100K instructions, which ran just fine but their compile times here horrendous at the time (30 minutes). The number 41,604 doesn’t ring any bells for me, and it does not seem to factor into anything interesting looking.

It is certainly possible you are running into a bug somewhere that makes things fail for particular size programs. If the bug is in the CUDA toolchain, reporting it probably will not do any good because support for sm_2x has been discontinued with CUDA 9 (after being deprecated in CUDA 8). If the bug is in the CUDA driver, it may affect more architectures than just sm_2x, and may get fixed. You could try updating to the latest driver package in case your observation is related to a known driver bug that got fixed by now.

But you mentioned earlier that you use CUDA only indirectly, from C# through ManagedCUDA, and I think it is quite possible there is a bug in that software layer somewhere. I have no idea who provides the ManagedCUDA software and how robust that software is generally deemed to be but you may want to ask in an appropriate forum or mailing list for that software.

As the author of managedCuda I’ll add my two cents:
ManagedCuda is basically just a C# wrapper around the Driver API and every API call is checked to return CUResult.Success. An exception is thrown if the return value differs: No exception means the API is happy.

But more than that, I can also just wild guess. I’d try the following things:

  • Compile C#-application and Cuda to 64-bit
  • Try to load the PTX-Kernel and not the Cubin
  • Try CudaContext.LoadModule and CudaContext.LoadModulePTX, the last one also with the Cubin. LoadModule uses cuModuleLoad of the driver API, the latter cuModuleLoadDataEx. Not sure what the exact differences are inside Cuda.
  • I came across the same error when loading a 32-bit cubin-kernel in an 64-bit application and vice versa. PTX doesn’t throw an error on loading but can’t launch afterwards. As C# supports the “Any CPU” compilation, the 32/64-bitness can change on every application start. Make sure this doesn’t happen. (E.g. check the size of IntPtr in C#)

Yes, with my own quick drill into the ManagedCuda source I noticed what you just confirmed, that its really just doing sanity checks on the managed vars and then calling the native cuda method and passing the result back. I guess if I really want to confirm it’s not ManagedCuda amiss I can code up a native kernel launcher, but then I’d have to remember such unclean thoughts as how to malloc stuff again… and that just makes my insides feel funny. Maybe I’ve gone soft. lol

Also I went back and double checked and I definitely have had both my C# project AND the CUDA project compiler settings hard locked to x86 & --32bit this whole time. I also re-found what led me to do that initially. It was the fact that I have to pass in the Device Memory pointer that I allocate for my input/output arrays as a kernel parameter. If it was 32bit everywhere I was certain that I could fit all the address bits into a standard int on both sides. If it was 64bit then I wasn’t sure what to pass the bits as. I’m realizing now that wasn’t really an issue.

Switching the CUDA compiler to 64 bit will only change the “Memory Address” size to 64 bits right? An int “value” would still be 4 bytes (32 bits)?

Since the CudaDeviceVariable handles the pointer on the host end I would just need the corresponding pointer type on the kernel side which would naturally size the pointer correctly, yes?

I’m actually already using LoadModule() to load the PTX file. (I think whatever tutorial I initially looked at when setting things up said to do it that way.) I didn’t realize it was meant for cubin files even, nor had I heard about the LoadModulePTX() method. I can give those alternate combos I haven’t used yet a whirl and see if anything changes.

(1) Addresses / pointers will increase from 32 bits to 64 bits
(2) Variables of types size_t, uintptr_t, etc will increase from 32 bits to 64 bits
(3) Variables of type ‘long’ will increase from 32 to 64 bits (except on Windows)

These changes almost always causes register use to increase, sometimes considerably. In general, CUDA maintains the sizes of all data types exactly as they are in the host code. This makes for seamless integration of host and device code, but also causes tight dependencies on host header files (which is why CUDA requires specific host tool chain versions).