Is there any timeout or lifetime in cuda ioctl?

Hi.

I found some weird situation about cudaMemcpy() behavior.

  {
    c10::cuda::CUDAStreamGuard guard(stream);
    for (int i = 0; i < 100; i++) {
      long st = now();
      std::vector<torch::jit::IValue> inputs;
      auto input = torch::ones({1, 3, 224, 224}, torch::TensorOptions().dtype(torch::kFloat32));
      inputs.push_back(input.to(device_2));
      stream.synchronize();
      long en = now();
      std::cout << (en - st) / 1e6 << std::endl;

      std::this_thread::sleep_for(std::chrono::milliseconds(std::stoi(argv[1])));
    }
  }

This is my code.

If sleep time(argv[1]) exceeds about 16s, the time for moving input data is getting slow.

under 16s → 0.3ms
over 16s → 13ms

So, i check the nsys profile with my program(over 16s) and get this results.

The ioctl in OS runtime libraries blocks the cudaMemcpyAsync.

Situation with under 16s does not show ioctl blocking.

Is there any timeout or lifetime in gpu device ioctl?

If it is true, can i maintain this ioctl regardless of sleep time?


Update my new code without libtorch.

#include <iostream>
#include <chrono>
#include <typeinfo>
#include <thread>
#include <random>
#include <cuda.h>
#include <cuda_runtime.h>

std::random_device rd;
std::mt19937 gen(rd());

typedef std::chrono::steady_clock::time_point time_point;

time_point hrt()
{
      return std::chrono::steady_clock::now();
}

long epoch_time = std::chrono::duration_cast<std::chrono::nanoseconds>(
            std::chrono::system_clock::now().time_since_epoch()).count();

time_point epoch = hrt();

long nanos(time_point t) {
    return std::chrono::duration_cast<std::chrono::nanoseconds>(t - epoch).count() + epoch_time;
}


long now() {
      return nanos(hrt());
}

int main(int argc, char* argv[])
{
  int N = 1*3*224*224;
  float *x, *d_x;

  x = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));

        for (int i = 0; i < 10; i++) {
                long st = now();
    cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
                cudaDeviceSynchronize();
                long en = now();
                std::cout << (en - st) / 1e6 << std::endl;

                std::this_thread::sleep_for(std::chrono::milliseconds(std::stoi(argv[1])));
        }

  cudaFree(d_x);
  free(x);
}

This code also shows same behavior.

1 Like

It’s been documented in various forum posts that putting a thread to sleep may impact the performance of subsequent cuda activity. here is one example but there are at least several others. The performance will eventually return to normal “by itself” but for a period of time after the thread wakes, there is a noticeable performance impact.

The performance reduction may be connected to GPU clock (management) behavior and the clock management system in place when the thread comes out of sleep mode. You can try the following to lock the GPU into P0/max power /max clock . (Note that the machine might consume more power in this state , you can try with any power lower than max value if this could be a concern to you)

  1. nvidia-smi -q -d POWER | findstr Max
    note down the ‘Max Power Limit’ for all GPUs

  2. nvidia-smi -pl maxPower -i GPUId
    set the max power read in step 1 to GPU , GPUid refers to your local GPU id

  3. nvidia-smi -q -d CLOCK | findstr SM
    find and note down the max SM clock , just choose the highest if several print

  4. nvidia-smi --lock-gpu-clocks=1356,1356 -i GPUid
    replace 1356 with the max clock you read in step 3 , GPUid refers to your local GPU id

Then rerun your test to see if the performance after sleep recovers “more quickly”. if that doesn’t help, I don’t have any further suggestions here. It’s quite possible as you point out that there is some other mechanism involved (perhaps OS related, in the ioctl system.)

1 Like

Thanks for your reply.

Currently, i just check the gpu device idle time and execute some small size kernel to maintain performance.

I will check my device setup with your advice!

And i think that this problem also happens not only when thread sleep but also when thread do not use gpu device for a long time (long time for cpu jobs and try to use gpu device).

That is also going to excite a similar clock reduction management scheme

It should also be amenable to the “clock-locking” that I indicated.

1 Like

Set my device “clock-locking”. (MaxPower 280 → 320, MAXClock 2100, 2100)

Performance is slightly better but still too slow. (reduced about 1ms)

But still shows ioctl blocking.

Maybe i have to study about OS ioctl.

Thanks for your help!

ioctl (I/O control) is typically just a way of interfacing with device drivers for various hardware. You would need to figure out which device the ioctl calls refer to, what operations are requested from the device via ioctl, and what exactly the device driver does in response to those requests. If the driver software is open source, that is doable. If the driver is closed source, it will likely remain a mystery unless you invest much time into reverse engineering.

Over time ioctl has also become a method of communicating with pseudo-devices, such as system monitoring software.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.