Cuda slow performance after process sleep/wait

or1 · June 14, 2022, 1:38pm

I’m currently writing code for a system that has multiple processes, some CPU and GPU intensive. I noticed the performance I achieve for cuda kernel functions during runtime is noticeably different (dozens of percentage slower) than the results when I simply run those functions by themselves. One of the reasons I found is that calls for functions such as “std::this_thread::sleep_for” and condition_variable::wait would cause later runs of cuda kernel functions to be much slower.

I created a code snippet for example:

CMakeLists.txt

cmake_minimum_required(VERSION 3.22)
project(CudaSlow)

set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc")
enable_language(CUDA)
FIND_PACKAGE(Threads REQUIRED)
FIND_PACKAGE(CUDA 10.2 REQUIRED)

set(CMAKE_CXX_STANDARD 17)

add_executable(CudaSlow main.cu)
target_include_directories(CudaSlow PUBLIC
        ${CUDA_INCLUDE_DIRS}
        )

TARGET_LINK_LIBRARIES(CudaSlow PUBLIC
        Threads::Threads
        ${CUDA_LIBRARIES}
        )

main.cu

#include <iostream>
#include <cuda_device_runtime_api.h>
#include <cuda_runtime.h>
#include <chrono>
#include <thread>

__global__ void Func(float* A, float* B, float* C, int N) {
    for (int j = threadIdx.x; j < N; j+= blockDim.x) {
        C[j] = A[j] * B[j];
    }
}


int main() {
    // example function variables
    int N = 1e5;
    float* A;
    float* B;
    float* C;
    cudaMalloc(&A, N * sizeof(float));
    cudaMalloc(&B, N * sizeof(float));
    cudaMalloc(&C, N * sizeof(float));
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    auto iterations = 2000;
    // Run multiple iterations without std::this_thread::sleep_for
    {
        double totalTime = 0;
        for (int i = 0; i < iterations; ++i) {
            auto start = std::chrono::system_clock::now();
            Func<<<1, 512, 0, stream>>>(A, B, C, N);
            cudaStreamSynchronize(stream);
            totalTime += std::chrono::duration_cast<std::chrono::duration<double>>(
                    std::chrono::system_clock::now() - start).count();
        }
        std::cout << "Average time without std::this_thread::sleep_for is " << totalTime / (float) iterations << std::endl;
    }
    // Run multiple iterations with std::this_thread::sleep_for
    {
        double totalTime = 0;
        for (int i = 0; i < iterations; ++i) {
            auto start = std::chrono::system_clock::now();
            Func<<<1, 512, 0, stream>>>(A, B, C, N);
            cudaStreamSynchronize(stream);
            totalTime += std::chrono::duration_cast<std::chrono::duration<double>>(
                    std::chrono::system_clock::now() - start).count();
            // call to sleep that is outside the time calculation and after synchronizing the stream
            std::this_thread::sleep_for(std::chrono::milliseconds (1));
        }
        std::cout << "Average time with std::this_thread::sleep_for is " << totalTime / (float) iterations << std::endl;
    }
    return 0;
}

The output when I run this program on a GeForce RTX 3090 with Cuda 11.4:

Average time without std::this_thread::sleep_for is 7.91676e-05
Average time with std::this_thread::sleep_for is 0.000129487

Robert_Crovella · June 14, 2022, 2:02pm

You’re close to measuring the launch overhead with these tiny kernels doing almost nothing. I suspect the difference here would be lost in the noise if your kernels executed for a few milliseconds. For example, by adding a for loop to your kernel and running it for 100 times, I get results like this:

Average time without std::this_thread::sleep_for is 0.00415706
Average time with std::this_thread::sleep_for is 0.00416839

Now the difference is less than 1%.

So we are talking about a relatively small (on the order of the kernel launch overhead) fixed cost that appears when you put a thread to sleep, to subsequently launch a kernel after the thread wakes up.

Since CUDA has a lazy initialization process, it wouldn’t surprise me that there is some additional resource initialization time to make the CUDA runtime usable again, after a thread goes to sleep and wakes up.

I have no idea what is happening, really, that is just idle speculation. But generally good advice for using CUDA is to try to avoid launching kernels that are a few microseconds or a few tens of microseconds in duration. If you do so, even independent of your observation here, the cost to launch a kernel becomes a significant part of your overall workflow, and thus you are using the GPU inefficiently.

You can file a bug if you wish, but my guess is that:

this is ultimately expected behavior
its unlikely that significant resource would be applied to try to improve this situation, because you are using the GPU inefficiently (even if this were “fixed”).

Topic		Replies	Views
100% CPU usage when running CUDA code CUDA Programming and Performance	5	5000	October 31, 2023
Strange Runtime behavior CUDA Programming and Performance	7	3104	December 18, 2009
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1777	July 19, 2022
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8641	December 18, 2008
GPU processing does not give full power after hibernation CUDA Programming and Performance kernel	2	777	May 9, 2023
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3719	April 23, 2011
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	269	June 11, 2024
A sleep after kernel launches increases performance substantially in my multithreaded cuda DLL CUDA Programming and Performance	0	446	August 13, 2020
Performance leakage due excessive API times CUDA Programming and Performance	5	659	May 24, 2019
Unexplained GPU call overhead one call 13ms; N calls N*7 ms CUDA Programming and Performance	6	2534	August 7, 2008

Cuda slow performance after process sleep/wait

Related topics