Fewer concurrent kernels with Hardware Accelerated GPU Scheduling (HAGS)

I program in C++ using CUDA 12.2, Windows 10, and Visual Studio.

I wrote a program to explore CUDA kernel concurrency.

It launches long-running kernels (100 ms) into 48 different streams, and the kernels record the maximum number of concurrent kernels they observe.

I added code to explore the effects of recording and querying events, querying streams, etc.

With WDDM, my program reports a maximum of 32 concurrent kernels, which is to be expected for my GPU, a GTX 1070 Ti.

When I add event recording and queries, the concurrent kernel maximum drops to 1.
I understand this is to be expected with WDDM.

When I turn on HAGS and reboot the PC, I get:

  • Good news: querying events and streams no longer affects the maximum number of concurrent kernels.
  • Bad news: The maximum number of concurrent kernels drops to 8.

Is there an explanation for the reduced number of concurrent kernels when using HAGS?
Can anything be done about it?

My host CPU has 4 cores. 32/4 = 8. Coincidence?

The comments in the code below summarize the test results.

I can post the entire program (1 file) if anyone is interested.


  • The GPU is used only for data processing with CUDA. There is no display attached.
  • I tried changing my program’s “Graphics Performance Preference” to “High Performance”: No effect on the number of concurrent kernels.
    uint32_t  N_BlocksPerKernel  =    1;
    uint32_t  N_ThreadsPerBlock  =   64;
    for ( uint32_t i  =  0; i < nStreams; ++ i ) {  // e.g., nStreams = 48
                                                                   // vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv----- effects of adding the code
                                                                   //                          WDDM    HAGS
                                                                   //                          ====    ====
//    if (    i >=  5 ) {
//      cudaStreamWaitEvent( Streams[ i ],  Events[ i-5 ] );       // --> N_ConcurrentMax  ==     5       5    (include cudaEventRecord() code, below)
//    }                                                            //

      k_test_spin<<< N_BlocksPerKernel, N_ThreadsPerBlock,         // --> N_ConcurrentMax  ==    32       8
      0, Streams[ i ] >>> ( i );                                   //
      cudaEventRecord(        Events[ i ], Streams[ i   ] );       // --> N_ConcurrentMax  ==    32       8
//    cudaEventQuery(         Events[ i ]                 );       // --> N_ConcurrentMax  ==     1       8    (include cudaEventRecord() code, above)
//    cudaStreamQuery(       Streams[ i ]                 );       // --> N_ConcurrentMax  ==     1       8
//    cudaStreamWaitEvent(   Streams[ i ],  Events[ i   ] );       // --> N_ConcurrentMax  ==    32       8    (include cudaEventRecord() code, above)

      ce = cudaGetLastError();
      if ( ce != cudaSuccess ) {
        std::cout  << "Launch failed: " << cudaGetErrorString(ce);    return 1;

    ce = cudaDeviceSynchronize();
    if ( ce != cudaSuccess ) ...

  // code below prints out N_ConcurrentMax, as determined by the kernels

The CUDA driver by default creates 8 connections to the GPU. Creating more than 8 compute streams results in multiple streams interleaving on the same connection. This can result in stalls and false dependencies that break concurrency. The cudaEventRecord issued after the grid launch will break concurrency as the front end cannot process the cudaEventRecord until after k_test_spin on the same stream completes.


  1. Set the environment variable CUDA_DEVICE_MAX_CONNECTIONS to 32 to increase the connections to the maximum value. This will use more resources.
  2. If you do recommendation 1 and reduced streams to 32 or you removed the cudaEventRecord you are likely to improve concurrency. cudaEvents are released at the front end. Adding these to a stream stalls the connection until all prior work is completed the stream.
  3. Use NVIDIA trace tools to investigate concurrency. The trace tools use more advanced techniques to trace grids. These techniques do not break concurrency.
  4. For this simple test case it would be easy to add inline PTX to read the special register %globaltimer. %globaltimer returns a 8-byte timestamp in ns since 01/01/1970 on most platforms. If you are not running an NVIDIA trace tools or on GH100+ the resolution is in microseconds. For the test kernel you could read this as the start and end of the kernel and write the values to memory to draw a timeline of the grid execution.

Thx Greg.

I changed the environment variable as you suggested and it worked.

I was also curious about whether the concurrency of kernels launched dynamically into cudaStreamFireAndForget was limited by CUDA_DEVICE_MAX_CONNECTIONS.

I found that it was not. Even with CUDA_DEVICE_MAX_CONNECTIONS left at the default (8),
I could easily get 32 concurrent kernels by mixing host and dynamic launches.
This makes sense in light of this post that explains a little more about connections:
How Many Streams?

Thx again.

To clarify, in the cases you cited only the resolution changes, not the units?
That is, the value read from the register always has units of ns?

Correct. The units is always nanoseconds. The update frequency changes.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.