Cuda slow performance after process sleep/wait

I’m currently writing code for a system that has multiple processes, some CPU and GPU intensive. I noticed the performance I achieve for cuda kernel functions during runtime is noticeably different (dozens of percentage slower) than the results when I simply run those functions by themselves. One of the reasons I found is that calls for functions such as “std::this_thread::sleep_for” and condition_variable::wait would cause later runs of cuda kernel functions to be much slower.

I created a code snippet for example:

CMakeLists.txt

cmake_minimum_required(VERSION 3.22)
project(CudaSlow)

set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc")
enable_language(CUDA)
FIND_PACKAGE(Threads REQUIRED)
FIND_PACKAGE(CUDA 10.2 REQUIRED)

set(CMAKE_CXX_STANDARD 17)

add_executable(CudaSlow main.cu)
target_include_directories(CudaSlow PUBLIC
        ${CUDA_INCLUDE_DIRS}
        )

TARGET_LINK_LIBRARIES(CudaSlow PUBLIC
        Threads::Threads
        ${CUDA_LIBRARIES}
        )

main.cu

#include <iostream>
#include <cuda_device_runtime_api.h>
#include <cuda_runtime.h>
#include <chrono>
#include <thread>

__global__ void Func(float* A, float* B, float* C, int N) {
    for (int j = threadIdx.x; j < N; j+= blockDim.x) {
        C[j] = A[j] * B[j];
    }
}


int main() {
    // example function variables
    int N = 1e5;
    float* A;
    float* B;
    float* C;
    cudaMalloc(&A, N * sizeof(float));
    cudaMalloc(&B, N * sizeof(float));
    cudaMalloc(&C, N * sizeof(float));
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    auto iterations = 2000;
    // Run multiple iterations without std::this_thread::sleep_for
    {
        double totalTime = 0;
        for (int i = 0; i < iterations; ++i) {
            auto start = std::chrono::system_clock::now();
            Func<<<1, 512, 0, stream>>>(A, B, C, N);
            cudaStreamSynchronize(stream);
            totalTime += std::chrono::duration_cast<std::chrono::duration<double>>(
                    std::chrono::system_clock::now() - start).count();
        }
        std::cout << "Average time without std::this_thread::sleep_for is " << totalTime / (float) iterations << std::endl;
    }
    // Run multiple iterations with std::this_thread::sleep_for
    {
        double totalTime = 0;
        for (int i = 0; i < iterations; ++i) {
            auto start = std::chrono::system_clock::now();
            Func<<<1, 512, 0, stream>>>(A, B, C, N);
            cudaStreamSynchronize(stream);
            totalTime += std::chrono::duration_cast<std::chrono::duration<double>>(
                    std::chrono::system_clock::now() - start).count();
            // call to sleep that is outside the time calculation and after synchronizing the stream
            std::this_thread::sleep_for(std::chrono::milliseconds (1));
        }
        std::cout << "Average time with std::this_thread::sleep_for is " << totalTime / (float) iterations << std::endl;
    }
    return 0;
}

The output when I run this program on a GeForce RTX 3090 with Cuda 11.4:

Average time without std::this_thread::sleep_for is 7.91676e-05
Average time with std::this_thread::sleep_for is 0.000129487

You’re close to measuring the launch overhead with these tiny kernels doing almost nothing. I suspect the difference here would be lost in the noise if your kernels executed for a few milliseconds. For example, by adding a for loop to your kernel and running it for 100 times, I get results like this:

Average time without std::this_thread::sleep_for is 0.00415706
Average time with std::this_thread::sleep_for is 0.00416839

Now the difference is less than 1%.

So we are talking about a relatively small (on the order of the kernel launch overhead) fixed cost that appears when you put a thread to sleep, to subsequently launch a kernel after the thread wakes up.

Since CUDA has a lazy initialization process, it wouldn’t surprise me that there is some additional resource initialization time to make the CUDA runtime usable again, after a thread goes to sleep and wakes up.

I have no idea what is happening, really, that is just idle speculation. But generally good advice for using CUDA is to try to avoid launching kernels that are a few microseconds or a few tens of microseconds in duration. If you do so, even independent of your observation here, the cost to launch a kernel becomes a significant part of your overall workflow, and thus you are using the GPU inefficiently.

You can file a bug if you wish, but my guess is that:

  1. this is ultimately expected behavior
  2. its unlikely that significant resource would be applied to try to improve this situation, because you are using the GPU inefficiently (even if this were “fixed”).
1 Like