Clock64() reversed hi and lo

I tried to use the clock64() function to measure some device code, but noticed that the hi and lo 32-bits seem to be reversed, It’s a simple post-processing step to account for this, but I couldn’t find any reference to this behavior anywhere so I’m wondering if there larger underlying problem.

To verify I read the special registers %clock64, %clock, and %clock_hi directly with inline assembly. %clock and %clock_hi are what I expect, but %clock64 seems to hold %clock << 32 | %clock_hi instead of that I expected (%clock_hi << 32 | %clock).

Interestingly the same seems to be the case for the special register %globaltimer and its 32-bit counterparts %globaltimer_hi and %globaltimer_lo.

I’m using OptiX 7.6 with PTX (rather than OptiX IR), but the OptiX docs state that the %clock... and &globaltimer... register interpretation is unchanged. I tried this on two different machines with a RTX 3090 and a GTX 1070 respectively, both with the same result.

Is this the expected behavior and I just failed to RTFM or did I mess up somewhere else in the process?

clock64() returns an long long that is arranged exactly as you would expect. The most rapidly changing bits are in the LSB direction, and the least rapidly changing bits are in the MSB direction.

I’ve never witnessed anything like what you’re suggesting using clock64() (which is not PTX). If what you are saying is true, then I would expect that two successive reads of clock64 would show unchanging lower 32-bit value. I’ve never witnessed that.

I can’t speak to Optix. There is a separate forum for Optix questions. If you have a non-Optix CUDA short, complete test case, my suggestion would be to post it. English text descriptions of what you are doing are far less useful.

sorry about that, I didn’t pay attention when selecting the forum. I changed it to OptiX.

If I write a minimal example in pure CUDA it seems to work as expected. So I assume I am using doing something wrong with OptiX.

Since OptiX requires quite a lot of boilerplate code I don’t want to write a minimal example from scratch. I was able to reproduce the behavior in the optix7course by Ingo Wald. Specifically, by modifying optix7course/example04_firstTriangleMesh/, adding the following after line 101:

if (ix == 0 && iy == 0) {
    printf("clock64(): %016llx\t %%clock64: %016lx \t%%clock_hi: %08x \t%%clock: %08x\n",
           clock64(), __clock64(), __clock_hi(), __clock_lo());

and the following before line 97:

static __device__ __inline__ uint64_t __clock64() {
    uint64_t val;
    asm volatile("mov.u64 %0, %%clock64;" : "=l"(val));
    return val;

static __device__ __inline__ uint32_t __clock_hi() {
    uint32_t val;
    asm volatile("mov.u32 %0, %%clock_hi;" : "=r"(val));
    return val;

static __device__ __inline__ uint32_t __clock_lo() {
    uint32_t val;
    asm volatile("mov.u32 %0, %%clock;" : "=r"(val));
    return val;

I compiled the code using OptiX 7.6 and nvcc V11.8.89 and ran it on a RTX 3090 with driver version 520.61.05.

Here is a sample of the output I’m seeing when running ./ex04_firstTriangleMesh:

clock64(): f8fda81400000010     %clock64: f8fda82100000010     %clock_hi: 00000010     %clock: f8fda833
clock64(): fa39241a00000010     %clock64: fa39242700000010     %clock_hi: 00000010     %clock: fa392439
clock64(): fb7501af00000010     %clock64: fb7501bc00000010     %clock_hi: 00000010     %clock: fb7501ce
clock64(): fcb0716700000010     %clock64: fcb0717400000010     %clock_hi: 00000010     %clock: fcb07186
clock64(): fdfe327100000010     %clock64: fdfe327e00000010     %clock_hi: 00000010     %clock: fdfe3290
clock64(): ff27c84700000010     %clock64: ff27c85400000010     %clock_hi: 00000010     %clock: ff27c866
clock64(): 00637bcd00000011     %clock64: 00637bda00000011     %clock_hi: 00000011     %clock: 00637bec
clock64(): 01abd42f00000011     %clock64: 01abd43c00000011     %clock_hi: 00000011     %clock: 01abd44e
clock64(): 02dac19400000011     %clock64: 02dac1a100000011     %clock_hi: 00000011     %clock: 02dac1b3
clock64(): 04166bb000000011     %clock64: 04166bbd00000011     %clock_hi: 00000011     %clock: 04166bcf
clock64(): 055213cb00000011     %clock64: 055213d800000011     %clock_hi: 00000011     %clock: 055213ea
clock64(): 068de35000000011     %clock64: 068de35d00000011     %clock_hi: 00000011     %clock: 068de36f

The upper 32-bit are counting up and it’s rolling over into the lower 32-bit. The same happens when reading the 64-bit %clock64 register directly. Reading the 32-bit registers holding the hi and lo 32-bits individually behaves as expected.

Thanks for the example code.
I filed an internal OptiX bug report for investigation.

I used the clock_t clock() function for a time view rendering mode without issues in the past.
Example code here:

Please check current R530 driver releases. This should have been fixed in the meantime.


I tried the new driver (530.30.02) and also tried upgrading CUDA (12.1.105) and OptiX (7.7.0) but the timestamp behavior is unchanged.

Moreover, with the new driver printf doesn’t seem to be working from device code anymore. Running the same code (both same binary and recompiled) on a machine with the new driver simply doesn’t print anything. CUDA and OptiX version don’t seem to have an affect on this issue. The lack of printf isn’t terrible, but makes debugging more tedious.

Sorry about that. I don’t know the exact display driver versions which would have received the fix. It might not have made it into that specific R530 release, yet

Yes, the printf issue is known and will be solved in a future driver release.
Until then please set the environment variable OPTIX_FORCE_DEPRECATED_LAUNCHER=1 mentioned inside the OptiX release notes for debugging and profiling purposes.

I checked the integrations and the fix should be inside the New Feature Branch driver 530.41.03 which is available for your RTX 3090 since March 23, 2023.
Please find it here: