Is there any timeout or lifetime in cuda ioctl?

gnup · June 28, 2023, 10:33am

Hi.

I found some weird situation about cudaMemcpy() behavior.

  {
    c10::cuda::CUDAStreamGuard guard(stream);
    for (int i = 0; i < 100; i++) {
      long st = now();
      std::vector<torch::jit::IValue> inputs;
      auto input = torch::ones({1, 3, 224, 224}, torch::TensorOptions().dtype(torch::kFloat32));
      inputs.push_back(input.to(device_2));
      stream.synchronize();
      long en = now();
      std::cout << (en - st) / 1e6 << std::endl;

      std::this_thread::sleep_for(std::chrono::milliseconds(std::stoi(argv[1])));
    }
  }

This is my code.

If sleep time(argv[1]) exceeds about 16s, the time for moving input data is getting slow.

under 16s → 0.3ms
over 16s → 13ms

So, i check the nsys profile with my program(over 16s) and get this results.

The ioctl in OS runtime libraries blocks the cudaMemcpyAsync.

Situation with under 16s does not show ioctl blocking.

Is there any timeout or lifetime in gpu device ioctl?

If it is true, can i maintain this ioctl regardless of sleep time?

Update my new code without libtorch.

#include <iostream>
#include <chrono>
#include <typeinfo>
#include <thread>
#include <random>
#include <cuda.h>
#include <cuda_runtime.h>

std::random_device rd;
std::mt19937 gen(rd());

typedef std::chrono::steady_clock::time_point time_point;

time_point hrt()
{
      return std::chrono::steady_clock::now();
}

long epoch_time = std::chrono::duration_cast<std::chrono::nanoseconds>(
            std::chrono::system_clock::now().time_since_epoch()).count();

time_point epoch = hrt();

long nanos(time_point t) {
    return std::chrono::duration_cast<std::chrono::nanoseconds>(t - epoch).count() + epoch_time;
}


long now() {
      return nanos(hrt());
}

int main(int argc, char* argv[])
{
  int N = 1*3*224*224;
  float *x, *d_x;

  x = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));

        for (int i = 0; i < 10; i++) {
                long st = now();
    cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
                cudaDeviceSynchronize();
                long en = now();
                std::cout << (en - st) / 1e6 << std::endl;

                std::this_thread::sleep_for(std::chrono::milliseconds(std::stoi(argv[1])));
        }

  cudaFree(d_x);
  free(x);
}

This code also shows same behavior.

Robert_Crovella · June 28, 2023, 2:18pm

It’s been documented in various forum posts that putting a thread to sleep may impact the performance of subsequent cuda activity. here is one example but there are at least several others. The performance will eventually return to normal “by itself” but for a period of time after the thread wakes, there is a noticeable performance impact.

The performance reduction may be connected to GPU clock (management) behavior and the clock management system in place when the thread comes out of sleep mode. You can try the following to lock the GPU into P0/max power /max clock . (Note that the machine might consume more power in this state , you can try with any power lower than max value if this could be a concern to you)

nvidia-smi -q -d POWER | findstr Max
note down the ‘Max Power Limit’ for all GPUs
nvidia-smi -pl maxPower -i GPUId
set the max power read in step 1 to GPU , GPUid refers to your local GPU id
nvidia-smi -q -d CLOCK | findstr SM
find and note down the max SM clock , just choose the highest if several print
nvidia-smi --lock-gpu-clocks=1356,1356 -i GPUid
replace 1356 with the max clock you read in step 3 , GPUid refers to your local GPU id

Then rerun your test to see if the performance after sleep recovers “more quickly”. if that doesn’t help, I don’t have any further suggestions here. It’s quite possible as you point out that there is some other mechanism involved (perhaps OS related, in the ioctl system.)

gnup · June 28, 2023, 4:00pm

Thanks for your reply.

Currently, i just check the gpu device idle time and execute some small size kernel to maintain performance.

I will check my device setup with your advice!

And i think that this problem also happens not only when thread sleep but also when thread do not use gpu device for a long time (long time for cpu jobs and try to use gpu device).

Robert_Crovella · June 28, 2023, 4:10pm

That is also going to excite a similar clock reduction management scheme

It should also be amenable to the “clock-locking” that I indicated.

gnup · June 28, 2023, 4:21pm

Set my device “clock-locking”. (MaxPower 280 → 320, MAXClock 2100, 2100)

Performance is slightly better but still too slow. (reduced about 1ms)

But still shows ioctl blocking.

Maybe i have to study about OS ioctl.

Thanks for your help!

njuffa · June 28, 2023, 8:11pm

ioctl (I/O control) is typically just a way of interfacing with device drivers for various hardware. You would need to figure out which device the ioctl calls refer to, what operations are requested from the device via ioctl, and what exactly the device driver does in response to those requests. If the driver software is open source, that is doable. If the driver is closed source, it will likely remain a mystery unless you invest much time into reverse engineering.

Over time ioctl has also become a method of communicating with pseudo-devices, such as system monitoring software.

system · July 12, 2023, 8:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
When executing cuda API occasionally, ioctl blocks for tens of milliseconds CUDA NVCC Compiler cuda	1	516	August 21, 2023
Stay in blocking status on cudaMemcpyAsync() function over 100 msec DRIVE AGX Orin General status_answered , driveos-dl , c_bug , a_not_needed , s_np	13	1309	March 31, 2023
How much time is cudaMemcpy() use? CUDA Programming and Performance	1	4031	July 30, 2008
Do the non-async calls sleep or burn CPU? CUDA Programming and Performance	20	22110	January 13, 2008
CudaMemcpy Bandwidth is influenced by idle operations CUDA Programming and Performance cuda , performance	3	621	March 31, 2023
[Jetson AGX Orin] Kernel launch is frequently delayed after kernel launch Jetson AGX Orin cuda	15	105	May 6, 2025
Is it possible cudaMemcpy can consume more than 100 milliseconds for just a few bytes of data? CUDA Programming and Performance cuda	4	345	October 14, 2021
Synchronous cudaMemcpy locks GPU for other host threads/programs? CUDA Programming and Performance	1	865	June 13, 2011
Inconsistent cudaMemcpy execution time CUDA Programming and Performance	4	1570	October 25, 2013
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3736	April 23, 2011

Is there any timeout or lifetime in cuda ioctl?

Related topics