Why kernel executions in different streams are not parallel?

bit_drinkwater · April 28, 2019, 4:11pm

I just learned stream technique in CUDA, and I tried it. Howerver undesired result returns, namely, the streams are not parallel.

I have a data matrix with size (5,2048), and a kernel to process the matrix.

My plan is to decompose the data in ‘nStreams=4’ sectors and use 4 streams to parallel the kernel execution.

Part of my code is like the following:

int rows = 5;
int cols = 2048;

int blockSize = 32;
int gridSize = (rows*cols) / blockSize;
dim3 block(blockSize);
dim3 grid(gridSize);

int nStreams = 4;    // preparation for streams
cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
for(int ii=0;ii<nStreams;ii++){
    checkCudaErrors(cudaStreamCreate(&streams[ii]));
}

int streamSize = rows * cols / nStreams;
dim3 streamGrid = streamSize/blockSize;

for(int jj=0;jj<nStreams;jj++){
    int offset = jj * streamSize;
    Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
}    // d_Data is the matrix on gpu

Visual Profiler result shows that 4 different streams are not parallel. Stream 13 is the first to work and stream 16 is the last. There is 12.378us between stream 13 and stream 14. And each kernel execution lasts around 5us. In the line of ‘Runtime API’ above, it says ‘cudaLaunch’.

Could you give me some advice? Thanks!

(I don’t know how to upload pictures in this forum, so I just describe the result in words.)

njuffa · April 28, 2019, 4:32pm

(1) GPUs are throughput-optimized architectures, not latency optimized. As long as a kernel can fill the machine with work (fully utilize the execution resources), it will not run concurrently with other kernels, as that is likely to reduce throughput. Your kernel configuration indicates that the kernel is large enough to fill the entire GPU, for all currently shipping GPUs. Note that a block size of 32 is rarely the best choice. The sweet spot for block size is typically 128 to 256 threads per block.

(2) Regardless of GPU or host platform, the minimal overhead of launching a kernel is about 5 microseconds. Since that has been the case for a dozen years, it is reasonable to assume that this is due to hardware limitations (in particular, PCIe latencies) and won’t change anytime soon.

(3) Without knowledge of the specific GPU, host system, and OS environment you used for your experiment it does not make much sense to hypothesize about your other observations. Please note that if you are on a Windows system with the default WDDM driver, you will frequently encounter performance artifacts due to launch batching (a technique the NVIDIA driver uses to overcome the massive latency imposed by the WDDM driver model). This effect may be worse if you use Windows 10, which uses the WDDM 2.x driver model.

bit_drinkwater · April 28, 2019, 5:08pm

(1) GPUs are throughput-optimized architectures, not latency optimized. As long as a kernel can fill the machine with work (fully utilize the execution resources), it will not run concurrently with other kernels, as that is likely to reduce throughput. Your kernel configuration indicates that the kernel is large enough to fill the entire GPU, for all currently shipping GPUs. Note that a block size of 32 is rarely the best choice. The sweet spot for block size is typically 128 to 256 threads per block.

(2) Regardless of GPU or host platform, the minimal overhead of launching a kernel is about 5 microseconds. Since that has been the case for a dozen years, it is reasonable to assume that this is due to hardware limitations (in particular, PCIe latencies) and won’t change anytime soon.

(3) Without knowledge of the specific GPU, host system, and OS environment you used for your experiment it does not make much sense to hypothesize about your other observations. Please note that if you are on a Windows system with the default WDDM driver, you will frequently encounter performance artifacts due to launch batching (a technique the NVIDIA driver uses to overcome the massive latency imposed by the WDDM driver model). This effect may be worse if you use Windows 10, which uses the WDDM 2.x driver model.

Sorry I forgot to provide additional information. My GPU is Tesla M6 and OS is Red Hat Enterprise Linux 7

Robert_Crovella · April 28, 2019, 7:24pm

The answer on your cross-posting is on-target:

[url]c++ - Why kernel executions in different streams are not parallel? - Stack Overflow

CUDA streams do not guarantee concurrent execution. They simply provide the possibility for it. If you fail to meet other requirements of concurrent execution, you will not see concurrency. For example, if one of your kernels “occupies” the GPU, there is no “room” for anything else to run on the GPU.

Essentially what njuffa referred to in his point #1.

bit_drinkwater · April 29, 2019, 1:01am

Thanks Robert!

Topic		Replies	Views
My streams are not running concurrently CUDA Programming and Performance	7	1793	March 6, 2018
Kernel launch concurrency CUDA Programming and Performance	10	1805	December 11, 2014
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3086	January 19, 2018
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	2000	June 24, 2016
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3677	October 21, 2017
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1141	September 26, 2018
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1462	September 14, 2017
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	117	November 18, 2024
Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly? CUDA Programming and Performance cuda , kernel	6	47	March 16, 2025
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3758	November 6, 2018

Why kernel executions in different streams are not parallel?

Related topics