We discovered that the clock64() return values weren’t sensible… but only when the process was being debugged with our debugger. But then we discovered that the same problem occurred when the process was debugged with cuda-gdb. So we suspect this is a driver, or possibly even hardware problem that is activated for any debugged process, regardless of the debugger.
The following is a pared-down example based on the testbed test that discovered this. It performs a spin loop waiting for clock64() to cross a threshold. (Don’t tell me that this is an insane thing to do; debugger tests frequently are insane.) When the process is debugged, that threshold never is crossed.
I’ve added some printf’s to track the clock64() values several times. I’ve also added a bug-out so that it gives up after 4 million iterations, so that the printf’s eventually get dumped to the user. Without that, the kernel would just run forever.
I suspect that the clock64() return value is being 32-bit truncated somewhere along the way, because I never see values larger than 2^32. Or even larger than 2^31, for that matter.
We found this on the Orin platform (capability 8.7). The software is L4T R35.4.1 and CUDA 11.4.
It could have a wider reach than just this platform, but I haven’t seen it anywhere else, which is why I’m reporting it here.
Here’s the promised test program:
#include <stdio.h>
#include <unistd.h>
// CUDA kernel
__global__ void Clocker()
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
long long clk = clock64();
long long target = clk + 6500000000ull;
long long iters = 0;
long long now;
while ((now = clock64()) < target) {
iters++;
if (i == 0 && (iters % 500000) == 0) {
printf("still waiting(%llu) now = %llu\n", iters, now);
}
#if 1
if (iters >= 4000000) {
if (i == 0) {
printf("Taking too long. Something is wrong. Bugging out.\n");
}
break;
}
#endif
}
if (i == 0) {
printf("clk = %llu\n", clk);
printf("target = %llu\n", target);
printf("iters = %llu\n", iters);
printf("clock64() = %llu\n", clock64());
printf("clock() = %llu\n", clock());
}
}
#define threadsPerBlock 32
#define blocksPerGrid 32
// Host code
int main(int argc, char** argv)
{
Clocker<<<blocksPerGrid, threadsPerBlock>>>();
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr,
"kernel launch failure: %s (%d)\n",
cudaGetErrorString(err), err);
exit(-1);
}
cudaDeviceSynchronize();
printf("Done\n");
exit(0);
}
To see it fail, built the above program, then just let it run under cuda-gdb. No breakpoints or anything needed.
/usr/local/cuda/bin/cuda-gdb clocks64
r