I program in C++ using CUDA 12.2, Windows 10, and Visual Studio.
I wrote a program to explore CUDA kernel concurrency.
It launches long-running kernels (100 ms) into 48 different streams, and the kernels record the maximum number of concurrent kernels they observe.
I added code to explore the effects of recording and querying events, querying streams, etc.
With WDDM, my program reports a maximum of 32 concurrent kernels, which is to be expected for my GPU, a GTX 1070 Ti.
When I add event recording and queries, the concurrent kernel maximum drops to 1.
I understand this is to be expected with WDDM.
When I turn on HAGS and reboot the PC, I get:
- Good news: querying events and streams no longer affects the maximum number of concurrent kernels.
- Bad news: The maximum number of concurrent kernels drops to 8.
Is there an explanation for the reduced number of concurrent kernels when using HAGS?
Can anything be done about it?
My host CPU has 4 cores. 32/4 = 8. Coincidence?
The comments in the code below summarize the test results.
I can post the entire program (1 file) if anyone is interested.
PS:
- The GPU is used only for data processing with CUDA. There is no display attached.
- I tried changing my program’s “Graphics Performance Preference” to “High Performance”: No effect on the number of concurrent kernels.
uint32_t N_BlocksPerKernel = 1;
uint32_t N_ThreadsPerBlock = 64;
for ( uint32_t i = 0; i < nStreams; ++ i ) { // e.g., nStreams = 48
// vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv----- effects of adding the code
// WDDM HAGS
// ==== ====
// if ( i >= 5 ) {
// cudaStreamWaitEvent( Streams[ i ], Events[ i-5 ] ); // --> N_ConcurrentMax == 5 5 (include cudaEventRecord() code, below)
// } //
k_test_spin<<< N_BlocksPerKernel, N_ThreadsPerBlock, // --> N_ConcurrentMax == 32 8
0, Streams[ i ] >>> ( i ); //
cudaEventRecord( Events[ i ], Streams[ i ] ); // --> N_ConcurrentMax == 32 8
// cudaEventQuery( Events[ i ] ); // --> N_ConcurrentMax == 1 8 (include cudaEventRecord() code, above)
// cudaStreamQuery( Streams[ i ] ); // --> N_ConcurrentMax == 1 8
// cudaStreamWaitEvent( Streams[ i ], Events[ i ] ); // --> N_ConcurrentMax == 32 8 (include cudaEventRecord() code, above)
ce = cudaGetLastError();
if ( ce != cudaSuccess ) {
std::cout << "Launch failed: " << cudaGetErrorString(ce); return 1;
}
}
ce = cudaDeviceSynchronize();
if ( ce != cudaSuccess ) ...
// code below prints out N_ConcurrentMax, as determined by the kernels