"Invalid local read" and local variable mysteriously changing to 0; what could be the cause?

I am facing a tricky issue with a CUDA program of mine, which is difficult to localize and provide an MWE for, since almost anything I remove from the device-side code makes the issue go away.

The symptoms:

  1. In print_floating_point() between the two printf() instructions, the local variable value changes from -42.0 to 0.0, for no apparent reason.
  2. When applying compute-sanitizer to the program, I get notified of an “Invalid local read of size 8 bytes” at line 667 (an invocation of print_decimal_number().

I’m using CUDA 11.6.55 and compiling for compute capability 6.1 (GTX 1050 Ti). I suspect this may have something to do with register spilling, but can’t say for sure.

The program is here… I know, I know, it’s big program, 915 lines, but - I cut it as far as I could without having the effect disappear. I just can’t seem to localize it - maybe it has a global aspect? Related to spilled registers or something?


  • The motivation is a full-fledged printf()-family implementation for CUDA code. i.e. including the missing specifiers in CUDA’s built-in printf, support for printing binaries / bitmasks, and most importantly - sprintf()which is sorely missed. It would be a port of this library.
  • For the purposes of this post, I am not concerned with the final output not being correct. This program doesn’t have the entire printf’ing code anyway. Once I get by the weird, unexplainable behavior I’ll make sure this, and the other ~500 testcases, pass.
  • Due disclosure: I also asked this on StackOverflow…

verbose ptxas output:

$ $ ptxas --verbose --gpu-name sm_61 test/test_suite_device.ptx  2>&1 | cu++filt
ptxas info    : 62 bytes gmem
ptxas info    : Compiling entry function 'snprintf_kernel(char *, unsigned long)' for 'sm_61'
ptxas info    : Function properties for snprintf_kernel(char *, unsigned long)
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 70 registers, 336 bytes cmem[0], 412 bytes cmem[2]
ptxas info    : Function properties for snprintf_(char *, unsigned int, const char *, ...)
    232 bytes stack frame, 140 bytes spill stores, 140 bytes spill loads

sanitizer output (snipped):

Invalid __local__ read of size 8 bytes
    at 0x11d8 in /home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:507:_ZN51_INTERNAL_507c18dc_20_test_suite_device_cu_c3781aca20print_decimal_numberEP8gadget_tdjjjPcj
    by thread (0,0,0) in block (0,0,0)
    Address 0xfffd20 is out of bounds
    Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:658:_ZN51_INTERNAL_507c18dc_20_test_suite_device_cu_c3781aca8print_fpEP8gadget_tdjjjb [0x11c8]
    Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:797:_ZN51_INTERNAL_507c18dc_20_test_suite_device_cu_c3781aca10_vsnprintfEP8gadget_tPKcP13__va_list_tag [0x10f8]
    Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:869:snprintf_(char *, unsigned int, const char *, ...) [0x170]
    Device Frame:/home/eyalroz/src/mine/printf/test/cuda/test_suite_device.cu:879:snprintf_kernel(char *, unsigned long) [0x78]

The error does not appear on my machine when compiling in debug mode -G.
Did you forget to copy the buffer back to the host? When I add the D2H copy, I get the following output. The digit 2 seems to be missing from the device result.

here 1: value =  -42.00000
here 2: value =  -42.00000
Format:   "%15e"
Actual:   "         -4e+01"
Expected: "  -4.200000e+01

When I make every function host device and try to run snprintf_ from the host, digit 2 is missing from the host output, too. Compiling with -Xcompiler "-Wconversion" gives two errors.

error: conversion from ‘int_fast64_t’ {aka ‘long int’} to ‘double’ may change value [-Werror=conversion]
599 | if ((flags & FLAGS_ADAPT_EXP) && floored_exp10 >= -1 && dcc.integral == power_of_10(floored_exp10 + 1)) {

error: conversion from ‘size_t’ {aka ‘long unsigned int’} to ‘p_size_t’ {aka ‘unsigned int’} may change value [-Werror=conversion]
904 | snprintf_(buffer, buffer_size, FORMAT, -42.);

void invoke_on_host(char* buffer, size_t buffer_size)
  snprintf_(buffer, buffer_size, FORMAT, -42.);
int main() {
  char buffer_for_device[30] = "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ";
  char buffer_for_host[30] = "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ";
  invoke_on_device(buffer_for_device, sizeof(buffer_for_device));
  invoke_on_host(buffer_for_host, sizeof(buffer_for_host));
    "Format:   \"%s\"\n"
    "Actual device:   \"%s\"\n"
    "Actual host:   \"%s\"\n"
    "Expected: \"", FORMAT, buffer_for_device, buffer_for_host);
  printf(FORMAT, -42.);

The error does not appear on my machine when compiling in debug mode -G .

Yes, that’s true on my system as well. Will look into the conversion business. I think the conversion warnings are harmless (edit:) … yes, it’s fine.

Did you forget to copy the buffer back to the host?

Ah, actually yes, but it doesn’t matter, i.e. I wasn’t complaining about getting the wrong actual vs expected. Fixed it - updated the link to point to a new version of the source.

Update: If I disable -G, but add __noinline__ to the large function of mine (_vsnprintf()) - I no longer experience the two issues. Now, it’s true this could simply be masking some bug, but I now have a nagging suspicion that this might be a compiler/assembler issue rather than a bug of mine. See the CUDA support branch of my printf-family-functions library for the WIP.

You may not not even need to put no-inline at the top of the call stack. noinline only the function putchar_via_gadget works as well. Might be simpler to compare the assembly if only a small part is not inlined.

@striker159 : Well I’ll be…! You’re right.

But here’s something weird: The SASS differs by > 200,000 lines (and each SASS is about 250K!!! lines), and the PTX situation is not much better.

… it looks like a lot, if not most, of the differences are register numbers; and there are occasional differences of placement of the same line. But there are still enough of other kinds of differences.