I was trying to understand the CUDA __nanosleep function available for Volta and above architecture, which puts threads to sleep.
This gets lowered to PTX nanosleep. According to this documentation, this instruction provides sleep in the range of [0,2*t].
This becomes quite a large range, and h/w can completely ignore the instruction too? For example, when the sleep is 0 irrespective of the argument passed to the instruction.
I wanted to know if this is some discrepancy in the documentation. If not, how to reliably use this instruction given that the h/w can just not put the threads to sleep?
In the worst case, there is a large range. There is also no published indication of anything further characterizing the variability, AFAIK. I agree the function has questionable utility for situations that require some definition of exact timing.
In addition, the nanosleep function has a maximum requestable sleep value of ~1ms. I expect this particular notation to be present in the PTX docs of the next major CUDA release.
Depending on your interests, you might wish to explore the PTX special register globaltimer or the CUDA C++
clock64() function, to build your own delay. Yes, I’m aware that globaltimer also has wording that seems to discourage its use.
For modern, forward-looking usage, the best approach is probably to see if you can adapt the libcu++ chrono functionality to your needs.
The libcu++ chrono implementation, relies on the PTX globaltimer. It provides an abstraction for CUDA usage, which is great. The limitation of 1ms does not seem to be a problem here as well.
From the programmers perspective, I think the end result might be same, i.e., some approximate number of cycles a thread gets delayed. However, from the hardware-side, I believe (inferred from the documentation) __nanosleep provides much deeper functionalities. It suspends the thread in hardware, which might have other hardware implications.