Fewer concurrent kernels with Hardware Accelerated GPU Scheduling (HAGS)

CU_Steve · August 17, 2023, 10:21pm

I program in C++ using CUDA 12.2, Windows 10, and Visual Studio.

I wrote a program to explore CUDA kernel concurrency.

It launches long-running kernels (100 ms) into 48 different streams, and the kernels record the maximum number of concurrent kernels they observe.

I added code to explore the effects of recording and querying events, querying streams, etc.

With WDDM, my program reports a maximum of 32 concurrent kernels, which is to be expected for my GPU, a GTX 1070 Ti.

When I add event recording and queries, the concurrent kernel maximum drops to 1.
I understand this is to be expected with WDDM.

When I turn on HAGS and reboot the PC, I get:

Good news: querying events and streams no longer affects the maximum number of concurrent kernels.
Bad news: The maximum number of concurrent kernels drops to 8.

Is there an explanation for the reduced number of concurrent kernels when using HAGS?
Can anything be done about it?

My host CPU has 4 cores. 32/4 = 8. Coincidence?

The comments in the code below summarize the test results.

I can post the entire program (1 file) if anyone is interested.

PS:

The GPU is used only for data processing with CUDA. There is no display attached.
I tried changing my program’s “Graphics Performance Preference” to “High Performance”: No effect on the number of concurrent kernels.

    uint32_t  N_BlocksPerKernel  =    1;
    uint32_t  N_ThreadsPerBlock  =   64;
    for ( uint32_t i  =  0; i < nStreams; ++ i ) {  // e.g., nStreams = 48
                                                                   // vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv----- effects of adding the code
                                                                   //                          WDDM    HAGS
                                                                   //                          ====    ====
//    if (    i >=  5 ) {
//      cudaStreamWaitEvent( Streams[ i ],  Events[ i-5 ] );       // --> N_ConcurrentMax  ==     5       5    (include cudaEventRecord() code, below)
//    }                                                            //

      k_test_spin<<< N_BlocksPerKernel, N_ThreadsPerBlock,         // --> N_ConcurrentMax  ==    32       8
      0, Streams[ i ] >>> ( i );                                   //
   
      cudaEventRecord(        Events[ i ], Streams[ i   ] );       // --> N_ConcurrentMax  ==    32       8
//    cudaEventQuery(         Events[ i ]                 );       // --> N_ConcurrentMax  ==     1       8    (include cudaEventRecord() code, above)
//    cudaStreamQuery(       Streams[ i ]                 );       // --> N_ConcurrentMax  ==     1       8
//    cudaStreamWaitEvent(   Streams[ i ],  Events[ i   ] );       // --> N_ConcurrentMax  ==    32       8    (include cudaEventRecord() code, above)

      ce = cudaGetLastError();
      if ( ce != cudaSuccess ) {
        std::cout  << "Launch failed: " << cudaGetErrorString(ce);    return 1;
      }
    }


    ce = cudaDeviceSynchronize();
    if ( ce != cudaSuccess ) ...

  // code below prints out N_ConcurrentMax, as determined by the kernels

Greg · August 18, 2023, 4:27am

The CUDA driver by default creates 8 connections to the GPU. Creating more than 8 compute streams results in multiple streams interleaving on the same connection. This can result in stalls and false dependencies that break concurrency. The cudaEventRecord issued after the grid launch will break concurrency as the front end cannot process the cudaEventRecord until after k_test_spin on the same stream completes.

RECOMMENDATIONS

Set the environment variable CUDA_DEVICE_MAX_CONNECTIONS to 32 to increase the connections to the maximum value. This will use more resources.
If you do recommendation 1 and reduced streams to 32 or you removed the cudaEventRecord you are likely to improve concurrency. cudaEvents are released at the front end. Adding these to a stream stalls the connection until all prior work is completed the stream.
Use NVIDIA trace tools to investigate concurrency. The trace tools use more advanced techniques to trace grids. These techniques do not break concurrency.
For this simple test case it would be easy to add inline PTX to read the special register %globaltimer. %globaltimer returns a 8-byte timestamp in ns since 01/01/1970 on most platforms. If you are not running an NVIDIA trace tools or on GH100+ the resolution is in microseconds. For the test kernel you could read this as the start and end of the kernel and write the values to memory to draw a timeline of the grid execution.

CU_Steve · August 18, 2023, 10:39pm

Thx Greg.

I changed the environment variable as you suggested and it worked.

I was also curious about whether the concurrency of kernels launched dynamically into cudaStreamFireAndForget was limited by CUDA_DEVICE_MAX_CONNECTIONS.

I found that it was not. Even with CUDA_DEVICE_MAX_CONNECTIONS left at the default (8),
I could easily get 32 concurrent kernels by mixing host and dynamic launches.
This makes sense in light of this post that explains a little more about connections:
How Many Streams?

Thx again.

CU_Steve · August 20, 2023, 6:14pm

To clarify, in the cases you cited only the resolution changes, not the units?
That is, the value read from the register always has units of ns?

Greg · September 1, 2023, 3:15pm

Correct. The units is always nanoseconds. The update frequency changes.

system · September 15, 2023, 3:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	905	June 18, 2010
Concurrent kernel and events on Kepler CUDA Programming and Performance	16	10937	January 29, 2014
Increased time to synchronize streams via event record/wait when HAGS is enabled CUDA Programming and Performance	3	553	February 11, 2024
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	883	July 27, 2017
Concurrent executions of streams CUDA Programming and Performance	6	421	December 19, 2022
[SOLVED] Concurrent Kernel Execution CUDA Programming and Performance	7	5877	May 21, 2016
Kernel launch concurrency CUDA Programming and Performance	10	1800	December 11, 2014
Concurrent Kernels CUDA Programming and Performance	2	3506	April 2, 2013
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2506	April 29, 2019
Persistent kernel runs slower when with more threads Jetson Orin NX cuda	6	49	October 14, 2024

Fewer concurrent kernels with Hardware Accelerated GPU Scheduling (HAGS)

Related topics