Clock64() return value incorrect when debugged, part 2

It looks like my original post on this was closed, so I’m posting again. Here’s the original post, so I won’t have to reiterate it:

https://forums.developer.nvidia.com/t/clock64-return-value-incorrect-when-debugged/269270

The last request was for some system-identifying data. I was able to get access to the system again, so here it is:

orion                       
    description: Computer
    product: Jetson AGX Orin Developer Kit
    vendor: Unknown
    version: Not Specified
    serial: 1421122110254
    width: 64 bits
    capabilities: smbios-3.0.0 dmi-3.0.0 smp tagged_addr_disabled
    configuration: boot=normal family=Unknown sku=Unknown
  *-core
       description: Motherboard
       product: Jetson AGX Orin Developer Kit
       vendor: Unknown
       physical id: 0
       version: Not Specified
       serial: 1421122110254
       slot: Unknown
     *-cache:0 DISABLED
          description: L1 cache
          physical id: 6
          slot: L1 Instruction Cache
          size: 64KiB
          capacity: 64KiB
          capabilities: internal instruction
          configuration: level=1
     *-cache:1 DISABLED
          description: L1 cache
          physical id: 1
          slot: L1 Data Cache
          size: 64KiB
          capacity: 64KiB
          capabilities: internal data
          configuration: level=1
     *-cache:2 DISABLED
          description: L2 cache
          physical id: 2
          slot: L2 Cache
          size: 256KiB
          capacity: 256KiB
          capabilities: internal unified
          configuration: level=2
     *-cache:3 DISABLED
          description: L3 cache
          physical id: 3
          slot: L3 Cache
          size: 2MiB
          capacity: 2MiB
          capabilities: internal unified
          configuration: level=3
     *-cpu:0
          description: CPU
          product: ARMv8 (Not Specified)
          vendor: Not Specified
          physical id: 4
          bus info: cpu@0
          version: Not Specified
          serial: Not Specified
          slot: CPU01
          size: 2201MHz
          capacity: 2201MHz
          clock: 31MHz
          capabilities: lm cpufreq
          configuration: cores=12 enabledcores=12 threads=1
     *-firmware
          description: BIOS
          vendor: EDK II
          physical id: 5
          version: 4.1-33958178
          date: 08/01/2023
          size: 1MiB
          capabilities: pci pnp upgrade shadowing cdboot bootselect acpi uefi
     *-memory
          description: Generic Memory
          physical id: b
          size: 29GiB
          capacity: 32GiB
     *-cpu:1 DISABLED
          description: CPU
          product: cpu-map
          physical id: 7
          bus info: cpu@0
     *-cpu:2 DISABLED
          description: CPU
          product: cpu
          physical id: 8
          bus info: cpu@1
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:3 DISABLED
          description: CPU
          product: cpu
          physical id: 9
          bus info: cpu@2
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:4 DISABLED
          description: CPU
          product: cpu
          physical id: a
          bus info: cpu@3
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:5 DISABLED
          description: CPU
          product: cpu
          physical id: c
          bus info: cpu@4
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:6 DISABLED
          description: CPU
          product: cpu
          physical id: d
          bus info: cpu@5
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:7 DISABLED
          description: CPU
          product: cpu
          physical id: e
          bus info: cpu@6
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:8 DISABLED
          description: CPU
          product: cpu
          physical id: f
          bus info: cpu@7
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:9 DISABLED
          description: CPU
          product: cpu
          physical id: 10
          bus info: cpu@8
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:10 DISABLED
          description: CPU
          product: cpu
          physical id: 11
          bus info: cpu@9
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:11 DISABLED
          description: CPU
          product: cpu
          physical id: 12
          bus info: cpu@10
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:12 DISABLED
          description: CPU
          product: cpu
          physical id: 13
          bus info: cpu@11
          size: 2201MHz
          capacity: 2201MHz
          capabilities: cpufreq
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:13 DISABLED
          description: CPU
          product: cpu
          physical id: 14
          bus info: cpu@12
        *-cache
             description: L1 Cache
             physical id: 0
             size: 64KiB
     *-cpu:14 DISABLED
          description: CPU
          product: cpu_core_power_states
          physical id: 15
          bus info: cpu@13
     *-pci:0
          description: PCI bridge
          product: NVIDIA Corporation
          vendor: NVIDIA Corporation
          physical id: 100
          bus info: pci@0001:00:00.0
          version: a1
          width: 32 bits
          clock: 33MHz
          capabilities: pci pm msi pciexpress msix normal_decode bus_master cap_list
          configuration: driver=pcieport
          resources: irq:64 ioport:1000(size=4096) memory:20a8000000-20a80fffff
        *-network DISABLED
             description: Wireless interface
             product: RTL8822CE 802.11ac PCIe Wireless Network Adapter
             vendor: Realtek Semiconductor Co., Ltd.
             physical id: 0
             bus info: pci@0001:01:00.0
             logical name: wlan0
             version: 00
             serial: ec:2e:98:ca:55:79
             width: 64 bits
             clock: 33MHz
             capabilities: pm msi pciexpress bus_master cap_list ethernet physical wireless
             configuration: broadcast=yes driver=rtl88x2ce driverversion=5.10.120-rt70-r35.4.1-tegra-Red latency=0 multicast=yes wireless=unassociated
             resources: irq:312 ioport:1000(size=256) memory:20a8000000-20a800ffff
     *-pci:1
          description: PCI bridge
          product: NVIDIA Corporation
          vendor: NVIDIA Corporation
          physical id: 0
          bus info: pci@0005:00:00.0
          version: a1
          width: 32 bits
          clock: 33MHz
          capabilities: pci pm msi pciexpress msix normal_decode bus_master cap_list
          configuration: driver=pcieport
          resources: irq:68 memory:2b28000000-2b280fffff
        *-generic
             description: System peripheral
             product: Concurrent Real-Time
             vendor: Concurrent Real-Time
             physical id: 0
             bus info: pci@0005:01:00.0
             version: 01
             width: 32 bits
             clock: 33MHz
             capabilities: msi pm pciexpress bus_master cap_list
             configuration: driver=rcim latency=0
             resources: irq:308 memory:2b28040000-2b28043fff memory:2b28000000-2b2803ffff
  *-usbhost:0
       product: xHCI Host Controller
       vendor: Linux 5.10.120-rt70-r35.4.1-tegra-RedHawk-8.4.7-trace xhci-hcd
       physical id: 1
       bus info: usb@1
       logical name: usb1
       version: 5.10
       capabilities: usb-2.00
       configuration: driver=hub slots=4 speed=480Mbit/s
     *-usb:0
          description: Bluetooth wireless interface
          product: Bluetooth Radio
          vendor: Realtek
          physical id: 3
          bus info: usb@1:3
          version: 0.00
          serial: 00e04c000001
          capabilities: bluetooth usb-1.00
          configuration: driver=rtk_btusb maxpower=500mA speed=12Mbit/s
     *-usb:1
          description: USB hub
          product: 4-Port USB 2.0 Hub
          vendor: Generic
          physical id: 4
          bus info: usb@1:4
          version: 1.40
          capabilities: usb-2.10
          configuration: driver=hub slots=4 speed=480Mbit/s
        *-usb
             description: Keyboard
             product: USB Receiver
             vendor: Logitech
             physical id: 2
             bus info: usb@1:4.2
             version: 12.09
             capabilities: usb-2.00
             configuration: driver=usbhid maxpower=98mA speed=12Mbit/s
  *-usbhost:1
       product: xHCI Host Controller
       vendor: Linux 5.10.120-rt70-r35.4.1-tegra-RedHawk-8.4.7-trace xhci-hcd
       physical id: 2
       bus info: usb@2
       logical name: usb2
       version: 5.10
       capabilities: usb-3.10
       configuration: driver=hub slots=4 speed=10000Mbit/s
     *-usb
          description: USB hub
          product: 4-Port USB 3.0 Hub
          vendor: Generic
          physical id: 3
          bus info: usb@2:3
          version: 1.40
          capabilities: usb-3.20
          configuration: driver=hub slots=4 speed=10000Mbit/s
  *-network:0 DISABLED
       description: Ethernet interface
       physical id: 3
       logical name: dummy0
       serial: 66:9f:46:20:65:9b
       capabilities: ethernet physical
       configuration: broadcast=yes driver=dummy driverversion=5.10.120-rt70-r35.4.1-tegra-Red
  *-network:1
       description: Ethernet interface
       physical id: 4
       logical name: eth0
       serial: 48:b0:2d:78:04:20
       size: 1Gbit/s
       capacity: 10Gbit/s
       capabilities: ethernet physical 100bt 100bt-fd 1000bt-fd 10000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=nvethernet driverversion=5.10.120-rt70-r35.4.1-tegra-Red duplex=full ip=10.134.31.140 link=yes multicast=yes port=twisted pair speed=1Gbit/s
  *-network:2
       description: Ethernet interface
       physical id: 5
       logical name: rndis0
       serial: 36:38:0e:26:d6:75
       capabilities: ethernet physical
       configuration: broadcast=yes driver=g_ether driverversion=29-May-2008 firmware=tegra-xudc link=no multicast=yes
  *-network:3
       description: Ethernet interface
       physical id: 6
       logical name: usb0
       serial: 36:38:0e:26:d6:77
       capabilities: ethernet physical
       configuration: broadcast=yes driver=g_ether driverversion=29-May-2008 firmware=tegra-xudc link=no multicast=yes

The 2nd question was for the device-tre, so here’s that:

orion>cat /proc/device-tree/nvidia,dtsfilename; echo
/dvs/git/dirty/git-master_linux/kernel/kernel-5.10/arch/arm64/boot/dts/../../../../../../hardware/nvidia/platform/t23x/concord/kernel-dts/tegra234-p3701-0000-p3737-0000.dts

Hi,

Orin devkit should have 12 cores+32GB memory.
So your device looks good.

Is it possible to check what value the clock64() is truncated?
This will help us to narrow down the cause.

Thanks.

I wrote a different test to get close at least:

#include <stdio.h>
#include <unistd.h>

// CUDA kernel
__global__ void Clocker()
{
   int i = blockDim.x * blockIdx.x + threadIdx.x;

   long long  iters  = 0;
   long long  last   = 0;
   int        bad    = 0;
   for (;;) {
      long long  now = clock64();
      if (i == 0 && now < last) {
         printf("iters=%llu last=%llu -> now=%llu\n", iters, last, now);
         bad++;
      }
      last = now;
      iters++;
      if (now >= 6000000000) break;
      if (iters >= 5000000) break;
      if (bad >= 4) break;
   }

   if (i == 0) {
      printf("iters     = %llu\n", iters);
      printf("bad       = %u\n", bad);
      printf("clock64() = %llu\n", clock64());
      printf("clock()   = %llu\n", clock());
   }
}

#define  threadsPerBlock  32
#define  blocksPerGrid    32

// Host code
int main(int argc, char** argv)
{
   Clocker<<<blocksPerGrid, threadsPerBlock>>>();
   cudaError_t err = cudaGetLastError();
   if (err != cudaSuccess) {
      fprintf(stderr,
              "kernel launch failure: %s (%d)\n",
              cudaGetErrorString(err), err);
      exit(-1);
   }                                                             

   cudaDeviceSynchronize();

   printf("Done\n");

   exit(0);
}

The output:

orion>make clock64-truncate
PATH=${PATH}:/usr/local/cuda/bin nvcc -gencode=arch=compute_87,code=\"sm_87,compute_87\"  --compiler-options -fno-strict-aliasing -DUNIX -g -G -Wno-deprecated-gpu-targets  -c -o clock64-truncate.o clock64-truncate.cu
g++ -L/usr/local/cuda/targets/aarch64-linux/lib -Wl,-rpath,/usr/local/cuda/targets/aarch64-linux/lib  clock64-truncate.o  -lcudart_static -lrt -lpthread -ldl -o clock64-truncate
rm clock64-truncate.o
orion>./clock64-truncate
iters     = 3599132
bad       = 0
clock64() = 6000101108
clock()   = 1705176466
Done
orion>/usr/local/cuda/bin/cuda-gdb -quiet ./clock64-truncate
Reading symbols from ./clock64-truncate...
(cuda-gdb) r
Starting program: /bob/raptorimage/claw/todd/cuda/clock64-truncate 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9634]
[New Thread 0xffffdfffc900 (LWP 9637)]
[New Thread 0xffffd3ffc900 (LWP 9638)]
iters=2304439 last=3841722939 -> now=361210
iters=4646915 last=3905378433 -> now=375795
iters     = 5000000
bad       = 2
clock64() = 589113465
clock()   = 589158354
Done
[Thread 0xffffdfffc900 (LWP 9637) exited]
[Thread 0xfffff7ff1900 (LWP 9630) exited]
[Inferior 1 (process 9630) exited normally]
(cuda-gdb) quit

I include the run without cuda-gdb to show that it still produces sensible numbers.
In the run under cuda-gdb, you can see that now “reverts” to small values twice before the arbitrary iteration count limit I set up:

iters=2304439 last=3841722939 -> now=361210
iters=4646915 last=3905378433 -> now=375795

It’s strange: those values are kind of close to 2^32, but not as close as I would’ve expected.
It appears that the clock64() is advancing by around 3841722939/2304439 = ~1667 each iteration, and the value 3905378433 much further away from 2^32 than that, around 390 million.

Hi,

Thanks for your patience.

We can reproduce this issue in our environment with the new example now!
Somehow our device runs slower and cannot capture this issue with iter=5000000.
After increasing the iter to 100000000, we also got the truncated clock output under cuda-gdb.

(cuda-gdb) r
Starting program: /home/nvidia/topic_273130/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 30227]
[New Thread 0xffffdfffc900 (LWP 30230)]
iters=22683742 last=1837566831 -> now=338080
iters=45370366 last=1838012875 -> now=334787
iters=68062832 last=1838480292 -> now=334028
iters=90745568 last=1837691132 -> now=333707
iters     = 90745569
bad       = 4
clock64() = 750071608
clock()   = 750108779
Done
[Thread 0xffffdfffc900 (LWP 30230) exited]
[Inferior 1 (process 30223) exited normally]

We need to check this with our internal team. Will share more details with you later.

Thanks.

I’m glad it’s reproducible there, at least. It’s strange that the truncation happens at different values on your system. I think, on my latest run, I didn’t use jetson_clocks, so maybe slower clocks affected the results.
Anyway, we’ll assume that this is a bug known to you folks, and that you’ll fix it in due course. That’s an acceptable explanation for the failure of the test in our own debugger’s testbed.

Hi,

This behavior is not expected.
But we are checking with the cuda-gdb team is this is a bug.

Thanks.