I found some weird situation about cudaMemcpy() behavior.
{
c10::cuda::CUDAStreamGuard guard(stream);
for (int i = 0; i < 100; i++) {
long st = now();
std::vector<torch::jit::IValue> inputs;
auto input = torch::ones({1, 3, 224, 224}, torch::TensorOptions().dtype(torch::kFloat32));
inputs.push_back(input.to(device_2));
stream.synchronize();
long en = now();
std::cout << (en - st) / 1e6 << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(std::stoi(argv[1])));
}
}
This is my code.
If sleep time(argv[1]) exceeds about 16s, the time for moving input data is getting slow.
under 16s â 0.3ms
over 16s â 13ms
So, i check the nsys profile with my program(over 16s) and get this results.
Itâs been documented in various forum posts that putting a thread to sleep may impact the performance of subsequent cuda activity. here is one example but there are at least several others. The performance will eventually return to normal âby itselfâ but for a period of time after the thread wakes, there is a noticeable performance impact.
The performance reduction may be connected to GPU clock (management) behavior and the clock management system in place when the thread comes out of sleep mode. You can try the following to lock the GPU into P0/max power /max clock . (Note that the machine might consume more power in this state , you can try with any power lower than max value if this could be a concern to you)
nvidia-smi -q -d POWER | findstr Max
note down the âMax Power Limitâ for all GPUs
nvidia-smi -pl maxPower -i GPUId
set the max power read in step 1 to GPU , GPUid refers to your local GPU id
nvidia-smi -q -d CLOCK | findstr SM
find and note down the max SM clock , just choose the highest if several print
nvidia-smi --lock-gpu-clocks=1356,1356 -i GPUid
replace 1356 with the max clock you read in step 3 , GPUid refers to your local GPU id
Then rerun your test to see if the performance after sleep recovers âmore quicklyâ. if that doesnât help, I donât have any further suggestions here. Itâs quite possible as you point out that there is some other mechanism involved (perhaps OS related, in the ioctl system.)
Currently, i just check the gpu device idle time and execute some small size kernel to maintain performance.
I will check my device setup with your advice!
And i think that this problem also happens not only when thread sleep but also when thread do not use gpu device for a long time (long time for cpu jobs and try to use gpu device).
ioctl (I/O control) is typically just a way of interfacing with device drivers for various hardware. You would need to figure out which device the ioctl calls refer to, what operations are requested from the device via ioctl, and what exactly the device driver does in response to those requests. If the driver software is open source, that is doable. If the driver is closed source, it will likely remain a mystery unless you invest much time into reverse engineering.
Over time ioctl has also become a method of communicating with pseudo-devices, such as system monitoring software.