Is it possible that the JIT compilation PTX to machine code fails in certain cases ? I was always thinking that stuff would be rock-stable and work in all case (maybe except for gigantic libraries).
We had recently the case that a certain (not very big) CUDA library did not work on a certain notebook GPU (Quadro NVS 4200M, Compute capability 2.1) due to some failure which we couldn’t track down. The library was compiled with the following settings (Cuda Toolkit 5.0, 64-bit, CMAKE, Windows 7 64bit).
set(CUDA_NVCC_FLAGS -gencode arch=compute_11,code=sm_11 -gencode
arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_20,code=compute_20)
So that means (by the last entry in the list) PTX code for CC 2.0 was embedded, which in my understanding then should have been just-in-time-compiled at runtime to machine code for CC 2.1. But it seems that didn’t work properly.
The library then was compiled for every possible compute capability with both PTX and machine code (so that no JIT takes place at least for the compute capabilities which are ‘known’ to NVCC compiler from cuda toolkit 5.0), and then suddenly the library worked on the Quadro NVS 4200M.
It would like to hear if that is possible or has happened also to others. I couldn’t track it 100% down but I’m quite sure that the JIT-ing was the problem. Maybe that is also GPu- or driver-dependent ? The GPU in question is quite weak.
I am interested in the stability of the JIT mechanism, because in case we couldn’t 100% rely on it this would have the implication that we would have to compile our libraries for much more compute capabilites than we do currently, which I actually want to avoid because the DLLs may get big.
You might try dumping the executable with "cuobjdump --list-ptx " to verify that the compute_20 PTX is in the fatbin.
I see that you specified arch/code twice for compute_20. I thought this was legal but perhaps it’s tickling a bug. You might also try specifying your compute_20 targets with:
[ Edit: as @njuffa notes, this would be a pointless declaration since by default the sm_20/sm_21 binary would be loaded and no JIT would occur. ]
Finally, you should consider that the kernel failed for some other reason (memory?).
[Postscript:] allanmac already addressed various points I make in this post. His post wasn’t there when I started typing, and apparently I type more slowly than he does :-)
When you say “it seems it didn’t work properly” what error status was returned by CUDA? On what API call or kernel invocation? Is it possible the app failed due to a lack of GPU memory, or because a kernel timed out? The Quadro NVS 4200M is a low-end device so both of those failure modes seem like plausible hypotheses.
Unless you force JIT compilation, the CUDA driver will first look for a matching SASS binary for the architecture of the GPU present. Looking at your compilation options, it appears SASS for sm_20 is embedded (-gencode arch=compute_20,code=sm_20) into the executable. sm_20 SASS is binary compatible with sm_21 (I have a Quadro 2000 here which is sm_21 so have hands-on experience), so that should get loaded on your sm_21 GPU and no JIT compilation from PTX should take place.
So I am puzzled by your observation. You can use
cuobjdump --dump-sass to see for what architectures SASS was embedded, and
cuobjdump --dump-ptx to check for what architectures PTX was embedded. You can also look for JIT activity by clearing the JIT cache prior to app invocation and then check whether any files were deposited there.
Are bugs in the JIT compiler possible? The answer is yes, this is a complex piece of code. However the sm_20 portion of it has been around for years, so it is unlikely that a bug should be encountered at this time. In the spirit of Occam’s razor I would suggest double-checking for non-JIT sources of failure first.
Another sledgehammer tool you can use to verify your application is:
cuda-memcheck --report-api-errors all <app.exe>
I always forget to use this one!
Thx for all comments.
@allanmac: The ‘-gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=compute_20’ should be OK because according to my understanding it should compile machine code for CC 2.0 (first part of statement) and PTX code for CC 2.0 (second part).
@njuffa, @allanmac: WE get the GPU-accelerated library as binary package (header files, lib, DLL) from a partner (but we can tell them how they should compile it), so I don’t have the source files of the library. I will try out the cuda-memcheck thing (didn’t know about).