I am facing a tricky issue with a CUDA program of mine, which is difficult to localize and provide an MWE for, since almost anything I remove from the device-side code makes the issue go away.
The symptoms:
- In
print_floating_point()
between the twoprintf()
instructions, the local variablevalue
changes from-42.0
to0.0
, for no apparent reason. - When applying compute-sanitizer to the program, I get notified of an “Invalid local read of size 8 bytes” at line 667 (an invocation of
print_decimal_number()
.
I’m using CUDA 11.6.55 and compiling for compute capability 6.1 (GTX 1050 Ti). I suspect this may have something to do with register spilling, but can’t say for sure.
The program is here… I know, I know, it’s big program, 915 lines, but - I cut it as far as I could without having the effect disappear. I just can’t seem to localize it - maybe it has a global aspect? Related to spilled registers or something?
Notes:
- The motivation is a full-fledged
printf()
-family implementation for CUDA code. i.e. including the missing specifiers in CUDA’s built-in printf, support for printing binaries / bitmasks, and most importantly -sprintf()
which is sorely missed. It would be a port of this library. - For the purposes of this post, I am not concerned with the final output not being correct. This program doesn’t have the entire printf’ing code anyway. Once I get by the weird, unexplainable behavior I’ll make sure this, and the other ~500 testcases, pass.
- Due disclosure: I also asked this on StackOverflow…
verbose ptxas output:
$ $ ptxas --verbose --gpu-name sm_61 test/test_suite_device.ptx 2>&1 | cu++filt
ptxas info : 62 bytes gmem
ptxas info : Compiling entry function 'snprintf_kernel(char *, unsigned long)' for 'sm_61'
ptxas info : Function properties for snprintf_kernel(char *, unsigned long)
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 70 registers, 336 bytes cmem[0], 412 bytes cmem[2]
ptxas info : Function properties for snprintf_(char *, unsigned int, const char *, ...)
232 bytes stack frame, 140 bytes spill stores, 140 bytes spill loads
sanitizer output (snipped):
COMPUTE-SANITIZER
Invalid __local__ read of size 8 bytes
at 0x11d8 in /home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:507:_ZN51_INTERNAL_507c18dc_20_test_suite_device_cu_c3781aca20print_decimal_numberEP8gadget_tdjjjPcj
by thread (0,0,0) in block (0,0,0)
Address 0xfffd20 is out of bounds
Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:658:_ZN51_INTERNAL_507c18dc_20_test_suite_device_cu_c3781aca8print_fpEP8gadget_tdjjjb [0x11c8]
Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:797:_ZN51_INTERNAL_507c18dc_20_test_suite_device_cu_c3781aca10_vsnprintfEP8gadget_tPKcP13__va_list_tag [0x10f8]
Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:869:snprintf_(char *, unsigned int, const char *, ...) [0x170]
Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:879:snprintf_kernel(char *, unsigned long) [0x78]