Kernel launch timeout

I see a number of mentions to kernel timeout issues, but none of them have helped resolve my issue.

I’m using a Jetson Nano 2GB Developer Kit, running Ubuntu 18.04 LTS, with nvcc 10.2.

I’m evaluating performance for an application port with a simple scatter-gather benchmark:

Running this:

on a V100 with 1024 blocks and threads I get 1.7 TFlop:

$ ./cudapi
npts = 1048576, nloop = 1000000, pi = 3.141593
time = 3.067751, estimated MFlops = 1709030.491719

On a Nano with 32 blocks and threads I get 1.6 GFlop:

$ ./cudapi
npts = 1024, nloop = 1000000, pi = 3.141593
time = 3.201548, estimated MFlops = 1599.226374

Increasing that to 64 blocks and threads I get the dreaded launch timeout:

cudaDeviceSynchronize: the launch timed out and was terminated

Any suggestions for how to prevent that?

Hi,

Since Nano 2GB is quite limited in the resources, the timeout may origin from the out of the memory.
Would you mind monitoring your device with the tegrastats at the same time first?

$ sudo tegrastats

Thanks.

Here’s the logfile while I run the program.

Note that I’m asking for just 64*64 doubles, and cudaMalloc doesn’t given an error.

logfile (4.5 KB)

Hi,

Based on the log, it seems that you are using the Nano 4GB module rather than 2GB.
Could you confirm this?

The memory looks good for us.
We are going to reproduce this issue in our environment.
Will share more information with you later.

Thanks.

Hi,

cudaDeviceSynchronize: the launch timed out and was terminated
cudaMemcpy: the launch timed out and was terminated

Is this the timeout issue you mentioned?
We got this error with 1024, 32 and 64 blocks.

Have you checked the output data?
Could you check the output is correct or incorrect?

Thanks.

I had gotten both 2G and 4G modules to test, I don’t recall which I used for this, but the log is clear that it’s 4G. The memory I’m requesting when the kernel fails is much less than either.

That is the timeout issue. The output is incorrect, it should be pi (that’s one of the reasons I use this test).

I’ve run this on many other GPUs without issues.

Hi,

We got some information from our internal team.
This is an expected behavior.

On Jetson Nano, there is 5 second limit for each kernel.
Any kernel that does not complete in 5 senconds will be killed by the watchdog.

But other Jetson platform, like Xavier or TX2, doesn’t has such limit on kernel execution time.

Thanks.

That’s a surprising limit. Where is it documented? And we have a good use case for kernels that take longer than that – is it possible to modify that time, or is it necessary to switch boards?

Hi,

A workaround is to disable the GPU timeout watchdog:

$ sudo -s
$ echo N > /sys/kernel/debug/gpu.0/timeouts_enabled

But please note that Nano doesn’t have slice support.
Other kernels will need to wait for the previous one to finish to get the resources.
(In the full GPU load cases)

After doing this, we can get the correct PI output with your app.
Please also note that you can maximize the Nano performance with the following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Running:

sudo -s
echo N > /sys/kernel/debug/gpu.0/timeouts_enabled

I now get:

blocks threads MFlops
64 64 1,631
128 128 1,651

Adding:

sudo nvpmodel -m 0
sudo jetson_clocks

doesn’t make a difference:

blocks threads MFlops
32 32 1,642
64 64 1,642
128 128 1,654
256 256 1,642

On a V100 I get:

blocks threads MFlops
1024 1024 1,616,770

From the Jetson specs shouldn’t these be closer? Are there any code optimizations needed for the Jetson?

Hi,

Thanks for updating the following with us.

Do you want to compare the performance listed below?
https://developer.nvidia.com/embedded/jetson-modules

Jetson Nano
AI Performance 472 GFLOPs

Please note that it is pure calculation performance.
Since there are some memory accesses in your source, it’s expected to affect the performance.

Memory access speed differs from the device and the memory type that be used.
Especially Jetson is an integrated memory system.

We need further checking to see if any optimization can be applied.
Will share more information with you later.

Thanks.

Yes; this simple benchmark is predictive for our more complex workloads.

On a V100 it gives 1.6 TFlop, vs your spec of 7; that’s a typical ratio.

But on the Jetson it gives 1.6 GFlop vs your spec of 472, that’s the gap I’d like to reduce.

Thanks for the confirmation.

We are checking the possible optimization internally.
Will get back to you later.

Any insight into the significant difference in the expected performance on this benchmark task? Should I post it as a separate issue, since the timeout issue has been addressed?

Hi,

We do have some investigation but no obvious improvement currently.

It seems the sample tries to access a global memory frequently.
And this kind of access is relatively slow on Nano due to bandwidth limitation.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.