Clock64() return value incorrect when debugged

We discovered that the clock64() return values weren’t sensible… but only when the process was being debugged with our debugger. But then we discovered that the same problem occurred when the process was debugged with cuda-gdb. So we suspect this is a driver, or possibly even hardware problem that is activated for any debugged process, regardless of the debugger.

The following is a pared-down example based on the testbed test that discovered this. It performs a spin loop waiting for clock64() to cross a threshold. (Don’t tell me that this is an insane thing to do; debugger tests frequently are insane.) When the process is debugged, that threshold never is crossed.

I’ve added some printf’s to track the clock64() values several times. I’ve also added a bug-out so that it gives up after 4 million iterations, so that the printf’s eventually get dumped to the user. Without that, the kernel would just run forever.

I suspect that the clock64() return value is being 32-bit truncated somewhere along the way, because I never see values larger than 2^32. Or even larger than 2^31, for that matter.

We found this on the Orin platform (capability 8.7). The software is L4T R35.4.1 and CUDA 11.4.
It could have a wider reach than just this platform, but I haven’t seen it anywhere else, which is why I’m reporting it here.

Here’s the promised test program:

#include <stdio.h>
#include <unistd.h>

// CUDA kernel
__global__ void Clocker()
{
   int i = blockDim.x * blockIdx.x + threadIdx.x;

   long long  clk    = clock64();
   long long  target = clk + 6500000000ull;
   long long  iters  = 0;
   long long  now;
   while ((now = clock64()) < target) { 
      iters++; 
      if (i == 0 && (iters % 500000) == 0) {
         printf("still waiting(%llu) now = %llu\n", iters, now);
      }
#if 1
      if (iters >= 4000000) {
         if (i == 0) {
            printf("Taking too long.  Something is wrong.  Bugging out.\n");
         }
         break;
      }
#endif
   }

   if (i == 0) {
      printf("clk       = %llu\n", clk);
      printf("target    = %llu\n", target);
      printf("iters     = %llu\n", iters);
      printf("clock64() = %llu\n", clock64());
      printf("clock()   = %llu\n", clock());
   }
}

#define  threadsPerBlock  32
#define  blocksPerGrid    32

// Host code
int main(int argc, char** argv)
{
   Clocker<<<blocksPerGrid, threadsPerBlock>>>();
   cudaError_t err = cudaGetLastError();
   if (err != cudaSuccess) {
      fprintf(stderr,
              "kernel launch failure: %s (%d)\n",
              cudaGetErrorString(err), err);
      exit(-1);
   }                                                             

   cudaDeviceSynchronize();

   printf("Done\n");

   exit(0);
}

To see it fail, built the above program, then just let it run under cuda-gdb. No breakpoints or anything needed.

/usr/local/cuda/bin/cuda-gdb clocks64
r

Hi,

We observe something different and want to confirm with you first.

The “Taking too long …” message appears when normally executed and debugged(with cuda-gdb).
The same error also occurs when running the sample on a desktop GPU.
Changing clock64 to clock, the same message also shows up.

So the problem is that clock64 is somehow truncated and the issue occurs on both dGPU and iGPU. And the issue is not limited to debugging time. Is that correct?

Thanks.

No, that is not the behavior I see. On my Orin system, and without the involvement of cuda-gdb, clock64() values do eventually exceed the threshold, and the loop exits normally without any “Taking too long” message.

For desktop systems: The original test that this pared-down example is based on has been working since forever. But that calculated the clock addend using the clock rate from properties. I suppose it’s possible that clock rates in olden times might have allowed the addend to remain below 2^32. I do have one convenient desktop system to try this on right now, so I did. The clock64() values correctly produce values >= 2^32, both debugged and non-debugged. The particulars of this desktop system were:

CentOS 7.5.1804 userland, CUDA 12.2, Nvidia driver 535.104.05.

More details from the Orin:

orion>/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Sun_Oct_23_22:16:07_PDT_2022
Cuda compilation tools, release 11.4, V11.4.315
Build cuda_11.4.r11.4/compiler.31964100_0
orion>head -1 /etc/nv_tegra_release 
# R35 (release), REVISION: 4.1, GCID: 33958178, BOARD: t186ref, EABI: aarch64, DATE: Tue Aug  1 19:57:35 UTC 2023
orion>cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for aarch64  35.4.1  Release Build  (root@ubuntu)  Mon 07 Aug 2023 10:12:22 AM EDT
GCC version:  gcc version 9.3.0 (Buildroot 2020.08) 

orion>make clock64
PATH=${PATH}:/usr/local/cuda/bin nvcc -gencode=arch=compute_87,code=\"sm_87,compute_87\"  --compiler-options -fno-strict-aliasing -DUNIX -g -G -Wno-deprecated-gpu-targets  -c -o clock64.o clock64.cu
g++ -L/usr/local/cuda/targets/aarch64-linux/lib -Wl,-rpath,/usr/local/cuda/targets/aarch64-linux/lib  clock64.o  -lcudart_static -lrt -lpthread -ldl -o clock64
orion>./clock64
still waiting(500000) now = 2454925652
still waiting(1000000) now = 4909515521
clk       = 383669
target    = 6500383669
iters     = 1323750
clock64() = 6500515673
clock()   = 18446744071620173766
Done
orion>/usr/local/cuda/bin/cuda-gdb ./clock64
NVIDIA (R) CUDA Debugger
11.4 release
Portions Copyright (C) 2007-2021 NVIDIA Corporation
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./clock64...
(cuda-gdb) r
Starting program: /bob/raptorimage/claw/todd/cuda/clock64 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 39732]
[New Thread 0xffffdfffc900 (LWP 39735)]
[New Thread 0xffffd3ffc900 (LWP 39736)]
still waiting(500000) now = 636504484
still waiting(1000000) now = 1253086261
still waiting(1500000) now = 30912931
still waiting(2000000) now = 649615545
still waiting(2500000) now = 1269521920
still waiting(3000000) now = 49370057
still waiting(3500000) now = 660923395
still waiting(4000000) now = 1281331993
Taking too long.  Something is wrong.  Bugging out.
clk       = 176612
target    = 6500176612
iters     = 4000000
clock64() = 1281571812
clock()   = 1281612055
Done
[Thread 0xffffd3ffc900 (LWP 39736) exited]
[Thread 0xfffff7ff1900 (LWP 39728) exited]
[Inferior 1 (process 39728) exited normally]
(cuda-gdb) quit

Hmm. I just noticed that I reported this against Orin NX, but I’m using AGX Orin. (Too many Orins!) Any chance this could account for the difference in behavior?

Hi,

We observe the same on Orin:

$ ./test 
still waiting(500000) now = 72198615
still waiting(1000000) now = 143957544
still waiting(1500000) now = 215691763
still waiting(2000000) now = 287425721
still waiting(2500000) now = 359159312
still waiting(3000000) now = 430893477
still waiting(3500000) now = 502627265
still waiting(4000000) now = 574361464
Taking too long.  Something is wrong.  Bugging out.
clk       = 511253
target    = 6500511253
iters     = 4000000
clock64() = 574583264
clock()   = 574622240
Done
$ sudo /usr/local/cuda/bin/cuda-gdb ./test
NVIDIA (R) CUDA Debugger
11.4 release
Portions Copyright (C) 2007-2021 NVIDIA Corporation
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...
(No debugging symbols found in ./test)
(cuda-gdb) r
Starting program: /home/nvidia/topic_269270/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 78951]
[New Thread 0xffffdfffc900 (LWP 78954)]
still waiting(500000) now = 71917020
still waiting(1000000) now = 143676985
still waiting(1500000) now = 215410927
still waiting(2000000) now = 287269440
still waiting(2500000) now = 359003082
still waiting(3000000) now = 430736842
still waiting(3500000) now = 502470541
still waiting(4000000) now = 574204043
Taking too long.  Something is wrong.  Bugging out.
clk       = 229658
target    = 6500229658
iters     = 4000000
clock64() = 574424402
clock()   = 574463582
Done
[Thread 0xffffdfffc900 (LWP 78954) exited]
[Inferior 1 (process 78947) exited normally]
(cuda-gdb) q

Which power mode do you use?
We test it with MaxN and fix the clock to the maximum.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

It looks like you’re seeing very different results. In both your cases, the clock64 is continually increasing and does produce values >= 2^32. The timings are different for some reason, but you could just adjust the bail-out threshold to 6,000,000 instead and they would stop. When I run this, the clock64 values are incapable of reaching the 6,500,000 value because of the truncation. (Well, assuming it is a truncate and not something weirder.)

As for power&clocks, I had left them as the default. These were correctness tests, not performance tests. But I change the power&clocks as you specified, and tried again, with similar results:

orion>./clock64
still waiting(500000) now = 2454731551
still waiting(1000000) now = 4909308866
clk       = 220202
target    = 6500220202
iters     = 1324071
clock64() = 6500342171
clock()   = 18446744071620002119
Done


orion>/usr/local/cuda/bin/cuda-gdb ./clock64
NVIDIA (R) CUDA Debugger
11.4 release
Portions Copyright (C) 2007-2021 NVIDIA Corporation
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./clock64...
(cuda-gdb) r
Starting program: /bob/raptorimage/claw/todd/cuda/clock64 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 3981]
[New Thread 0xffffdfffc900 (LWP 3984)]
still waiting(500000) now = 2454741013
still waiting(1000000) now = 1009464084
still waiting(1500000) now = 3464029456
still waiting(2000000) now = 2014668308
still waiting(2500000) now = 567002766
still waiting(3000000) now = 3021567986
still waiting(3500000) now = 1572303999
still waiting(4000000) now = 123372737
Taking too long.  Something is wrong.  Bugging out.
clk       = 231524
target    = 6500231524
iters     = 4000000
clock64() = 123617028
clock()   = 123658628
Done
[Thread 0xfffff7ff1900 (LWP 3977) exited]
[Inferior 1 (process 3977) exited normally]

Hi,

Would you mind sharing the below output with us?

$ cat /etc/nv_tegra_release 

More, which Orin do you use? 32GB/64GB/devkit?
Thanks.

orion>cat /etc/nv_tegra_release 
# R35 (release), REVISION: 4.1, GCID: 33958178, BOARD: t186ref, EABI: aarch64, DATE: Tue Aug  1 19:57:35 UTC 2023

orion>grep Mem /proc/meminfo
MemTotal:       31326152 kB
MemFree:        28981296 kB
MemAvailable:   29711416 kB

The guy who flashed it also thought that the JetPack version might be important. I was skeptical because I though that was just the installer, but maybe it’s relevant. He tells me it was:

SDK Manager 2.0.0-11402 with JetPack 5.1.2

He also had this to say:

Also, sadly, all of our Orins here appear to be “illegitimate” – we
have 12 core Orins with 32 GB of RAM and according to NVIDIA that’s not
a sold configuration. Not sure if that could be related to anything, but
I wanted to mention it.

Is this accurate? Did we get some strange early-release Orins, maybe?

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Yes, SDKmanager version should not impact the content of the installation.
Could you share the output of lshw and cat /proc/device-tree/nvidia,dtsfilename with us?

Thanks.