CUDA stream are blocked when luanch lots of kernels (>1000)

DunLiang · December 29, 2018, 2:01pm

I found that CUDA stream will block when I launch lots of kernels (more than 1000). I am wondering is there any configuration that I can change?

In my experiments, I launch a small kernel 10000 times. This kernel ran shortly(about 190us). The kernel launched very fast when launching the first 1000 kernels. It takes 4~5us to launch a kernel. But after that, The launch process becomes slow. It takes about 190 us to launch a new kernel. The CUDA stream seems to wait for the previous kernel complete, and the buffer size is about 1000 kernels. When I created 3 streams, each stream can launch 1000 kernel asynchrony. I want to make this buffer bigger. I try to set cudaLimitDevRuntimePendingLaunchCount, but not work. Is there any way?

#include <stdio.h>
#include "cuda_runtime.h"

#define CUDACHECK(cmd) do {                         \
    cudaError_t e = cmd;                              \
    if( e != cudaSuccess ) {                          \
        printf("Failed: Cuda error %s:%d '%s'\n",             \
        __FILE__,__LINE__,cudaGetErrorString(e));   \
        exit(EXIT_FAILURE);                             \
    }                                                 \
} while(0)

// a dummy kernel for test
__global__ void add(float *a, int n) {
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i=0; i<n; i++) {
        a[id] = sqrt(a[id] + 1);
    }
}

int main(int argc, char* argv[])
{

    //managing 1 devices
    int nDev = 1;
    int nStream = 1;
    int size = 32*1024*1024;

//allocating and initializing device buffers
    float** buffer = (float**)malloc(nDev * sizeof(float*));
    cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev*nStream);

for (int i = 0; i < nDev; ++i) {
        CUDACHECK(cudaSetDevice(i));
        //CUDACHECK(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 10000));
        CUDACHECK(cudaMalloc(buffer + i, size * sizeof(float)));
        CUDACHECK(cudaMemset(buffer[i], 1, size * sizeof(float)));
        for (int j = 0; j<nStream; j++)
        CUDACHECK(cudaStreamCreate(s+i*nStream+j));
    }

    for (int i = 0; i < nDev; ++i) {
        CUDACHECK(cudaSetDevice(i));
        for (int j=0; j<10000; j++) {
            for (int k=0; k<nStream; k++)
            add<<<32, 1024, 0, s[i*nStream+k]>>>(buffer[i], 1000);
        }
    }

    for (int i = 0; i < nDev; ++i) {
        CUDACHECK(cudaSetDevice(i));
        cudaDeviceSynchronize();
    }

//free device buffers
    for (int i = 0; i < nDev; ++i) {
        CUDACHECK(cudaSetDevice(i));
        CUDACHECK(cudaFree(buffer[i]));
    }

    printf("Success \n");
    return 0;
}

Here is the nvprof results:

When I create 3 streams, the first 3000 kernel launched quickly and then become slow

When I create 1 streams, the first 1000 kernel launched quickly and then become slow

I also create a stackoverflow question: https://stackoverflow.com/questions/53970187/cuda-stream-are-blocked-when-luanch-lots-of-kernels-1000

njuffa · December 29, 2018, 4:35pm

What you are seeing is expected behavior. The GPU command queue has finite length, but it is pretty generous in size. What are you hoping to achieve at application level by queuing up even more launches than you are doing now? Based on your description, your application’s throughput appears to be hopelessly bottlenecked on the GPU already: the host is submitting new kernels to the GPU at 40x the GPU kernel processing rate.

I was always under the impression that there is a command buffer size limit baked into the driver (which may change with driver version). So I can’t help you with the configuration aspect, as I have never noticed or explored it. What status does the call to cudaSetDeviceLimit return?

Robert_Crovella · December 29, 2018, 4:37pm

cudaLimitDevRuntimePendingLaunchCount pertains to device kernel launches, which are not in view in OP’s code. I’m not surprised it has no effect.

I’m not aware of any direct/explicit user configuration control over the host launch queue depth.

DunLiang · December 30, 2018, 6:43am

Thanks for the reply. I file a feature request on the bug report page.

Topic		Replies	Views
Maximum number of operations in a stream CUDA Programming and Performance	4	896	June 2, 2023
Cuda kernel blocking launch CUDA Programming and Performance cuda , kernel , parallel-computing	4	1725	March 16, 2023
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	712	October 8, 2016
limit on number of events linked to a stream via cudaStreamWaitEvent()..? CUDA Programming and Performance	6	2288	May 8, 2015
Using multiple streams with multiple host threads takes longer? stream CUDA Programming and Performance	3	893	February 10, 2021
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	376	February 10, 2024
Kernel launch concurrency CUDA Programming and Performance	10	1798	December 11, 2014
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1118	September 26, 2018
My streams are not running concurrently CUDA Programming and Performance	7	1745	March 6, 2018
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3646	October 21, 2017

CUDA stream are blocked when luanch lots of kernels (>1000)

Related topics