I have a kernel which behaves differently depending on whether it is compiled by ptxas or by the JIT compiler. A couple of answers would help my investigation:
- Is it possible to see the output (the SASS, I guess) from the JIT compiled version? It would be interesting to compare the two versions.
- What software does the JIT compilation? The runtime (cudart.lib)? The "kernel" driver (385.90 on my version of windows)? Something else? In other words, what software should I alter to experiment with different versions?
The GPU driver (only) does the JIT.
When you use a driver that ships in the particular CUDA toolkit installer that you used to install CUDA, it’s expected that the driver JIT compilation mechanism, and the ptxas tool in the CUDA toolkit, should produce basically the same thing. However if you later update the driver, that may not be the case. Bugs get fixed in newer drivers all the time, and a significant forward movement of driver to a newer branch (e.g. moving forward from e.g. an r384 driver branch to an r387 driver branch) may even introduce new features, new optimizations, etc.
I don’t know of a convenient way to capture the JIT produced by the driver (unless you try digging through the JIT cache), so my recommendation would be to try a newer CUDA toolkit version, to see if the behavioral difference is reproducible there. If it is, you may be able to compare SASS that way. Obviously, if you are on the latest CUDA toolkit version already, that method won’t work. But you are not on the latest CUDA toolkit version, if you are using 385.90.
I guess I should have mentioned that I’m compiling with nvcc 8.0.44 and using driver 385.90. Are these out of sync for windows server 2012 r2?
8.0.44 is called CUDA 8.0 GA1 by NVIDIA. Even for folks who wish to use CUDA 8.0, NVIDIA would recommend that you move forward to CUDA 8 GA2, where the version number is 8.0.61/8.0.62
And after that, both CUDA 9.0 and CUDA 9.1 have been released.
Current CUDA (9.1 at the moment): http://www.nvidia.com/getcuda
CUDA toolkit archive: https://developer.nvidia.com/cuda-toolkit-archive
Since you are on 8.0.44, it might be that by installing CUDA 8.0.61, you may observe different behavior, and be able to inspect the ptxas-compiled (SASS) code. A similar statement could be made for CUDA 9.0, and CUDA 9.1
Both CUDA 8.0.61 and CUDA 9.0 should be usable with your r384-branch driver, without having to change your GPU driver.
Last I checked, the files in the JIT cache are regular ELF files with some amount of meta-data prefix. So if you have a simple application, you can nuke the JIT cache, JIT compile then edit the file(s) generated subsequently to remove everything before the ELF marker, finally disassemble with
cuobjdump --dump-sass. However, unless you have practice reading through GPU machine code, it is unlikely that you will spot any salient differences, in particular for optimized builds where it can be exceedingly hard to identify which machine instructions correspond to which source code lines.
In general, one should never assume that the offline compiler and the JIT compiler produce identical code, even when using the JIT compiler from the driver that ships with a particular CUDA version. To the best of my knowledge, NVIDIA provides no guarantees of identical code generation between the two compiler versions and in my observation, there are often differences: mostly minor, sometimes significant.
You did not say how the code behaves differently with the two compilers. While compiler bugs can and do occur, these days the CUDA toolchain is sufficiently robust that these are rare occurrences. I assume you have already checked that your code does not contain latent bugs, e.g. invoking undefined C++ behavior or violatingthe CUDA programming model, which may be exposed by switching compilers. When the code is executed under control of cuda-memcheck, does it flag any issues? Note that cuda-memcheck cannot find all possible bugs, for example it can only find a subset of possible race conditions.
Thanks txbob and njuffa. I’ll try to update my compiler. Also, I’m successfully disassembling my kernel from the JIT cache. Thanks for the tip.
As to how the results differ: The ptxas version gives the correct answer all the time. The JIT version gives me a different answer every time I run it (on a K80 with --generate-code=arch=compute_37,code=compute_37); sometimes the result is correct, sometimes not. cuda-memcheck says it’s OK. Of course, as you say, that doesn’t mean my code is correct.
When you drill down into the code in NVVP, it will display a machine code listing. Is that accurate depending on whether the kernel is ptxas or jit compiled?
I appreciate your guys’ help.
I would assume that NVVP displays whatever code has actually been loaded into the GPU (regardless of how it was compiled), but I do not know. Only the engineers who maintain NVVP could tell you for sure.
When you refer to “incorrect” results, are these results from integer or floating-point computation? If the latter, are these floating-point results somewhat different numerically, or are they completely bogus?
They’re floating point values. Some are correct (bit-wise identical) and others are completely bogus. My kernel involves some shared memory transactions, so I probably have a race-condition.
If nvvp is to be believed, the sass for ptxas and jit are quite different.
If nvvp is to be believed, the sass for ptxas and jit are quite different.
That can happen if the release dates are fairly far apart. However you might also want to check whether you are in fact using completely identical compilation switches for both compiles.
Like @njuffa, I’m reasonably confident that the SASS code reflected in the source-disassembly view of the visual profiler:
will always reflect the SASS code that is executing. The source code should always reflect the source of course. However I’m not confident the mapping between the two is as trustworthy in the JIT compiled case as in the ptxas-compiled case. A proper mapping there depends on the -lineinfo switch as indicated, and I’m not sure the lineinfo information is generated or preserved at JIT compile time.
A number of compiler switches are passed through to the JIT compiler, but I am not sure -lineinfo is one of them.
Because JIT compilation tends to decrease programmer visibility, my usual recommendation is to use JIT compilation only where necessary. That means that where dynamic code generation is not required, it is best to use fat binaries with code compiled offline for all desired target architectures.
Hi: I’ve been trying different combinations of CUDA and drivers to pin this down a little better. My table looks like:
- CUDA 8.0 GA 2 (NVCC 8.0.60), Driver 376.51
- CUDA 8.0 GA 2, Driver 377.55
- CUDA 8.0 GA 2, Driver 385.08
code=sm_37 : OK
code=compute_37 : Fails
- CUDA 9.1 (NCC 9.1.85), Driver 388.19
code=sm_37 : Fails
So, it looks like the 38x series driver and CUDA 9.1 are generating bad code. Or I have a latent bug that they are exciting. However, I’ve tested my code on Maxwell and Pascal devices and it seems to work OK. Furthermore, the kernel itself is templated and there are a dozen different instantiated versions of it, only one of which appears to give the wrong results. Also, if I run under cuda-memcheck, it gives the right answer.
I guess I’m on to trying to make a test case to submit to support.
Thanks for your help and insights.
(I have an auxiliary question: what do the ISCADD and TEXDEPBAR instructions do? The instruction set reference is a bit short on details. There are considerably more instances of TEXDEPBAR in the Kepler machine code generated by CUDA 8 compared to CUDA 9. I wonder if that is causing my irreproducible results?)
NVIDIA doesn’t provide full public documentation of their machine ISA. They should, IMHO, but haven’t done so for the entire existence of CUDA, so the situation is unlikely to change.
ISCADD is an “integer scale and add”, i.e. (a << s) + b, and was often used for address computations in the past. It might fall naturally out of some bit-twiddling codes that use such a sequence, or be converted from integer multiply-add if the multiplier turns out to be a power of two (ISCADD likely uses less energy than IMAD, making the substitution desirable).
TEXDEPBAR is some sort of texture dependency barrier that usually occurs after a batch of load instructions that read through the texture cache. I don’t know what it does, but based on the name I envision a barrier that inhibits further execution until an internal buffer related to texture cache access has been drained. So presumably (speculation!) this is needed first and foremost for correct execution, and an incorrectly placed TEXDEPBAR could cause code to malfunction.
In my experience, that observation is strongly correlated with having an inherent race condition in your code.
In my experience, newer compilers tend to get more aggressive in terms of using optimizations based on assumptions about your code. These assumptions tend to expose race conditions more readily.
If you are using shared memory in your code, I would encourage you to test the failing case with
cuda-memcheck --tool racecheck …
For this particular test, I would encourage you to build and test both debug and release versions of your code (i.e. with and without -G compile switch.)
If you are not using shared memory, or if that tool reports no issues, it’s possible you still have a global memory race in your code.
Another aspect of modern compilers (for CPU and GPU) is that they are apt to relentlessly exploit any instances of undefined C++ behavior (such as integer overflow), which gives them great freedom in code generation by C++ standard rules (usually not what the programmer intended!). This has even bitten seasoned programmers in the behind that worked on well-known open source project that worked correctly for up to decades and then suddenly failed after a compiler upgrade.
On the other hand, compilers are complex pieces of software incorporating many heuristics, so bugs are a fact of life. Since the machine ISAs of NVIDIA GPUs lack binary compatibility, bugs are more frequent when targeting a new architecture such as Volta, because code generators have to be adjusted / rewritten for each new architecture.