Why kernel executions in different streams are not parallel?

I just learned stream technique in CUDA, and I tried it. Howerver undesired result returns, namely, the streams are not parallel.

I have a data matrix with size (5,2048), and a kernel to process the matrix.

My plan is to decompose the data in ‘nStreams=4’ sectors and use 4 streams to parallel the kernel execution.

Part of my code is like the following:

int rows = 5;
int cols = 2048;

int blockSize = 32;
int gridSize = (rows*cols) / blockSize;
dim3 block(blockSize);
dim3 grid(gridSize);

int nStreams = 4;    // preparation for streams
cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
for(int ii=0;ii<nStreams;ii++){
    checkCudaErrors(cudaStreamCreate(&streams[ii]));
}

int streamSize = rows * cols / nStreams;
dim3 streamGrid = streamSize/blockSize;

for(int jj=0;jj<nStreams;jj++){
    int offset = jj * streamSize;
    Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
}    // d_Data is the matrix on gpu

Visual Profiler result shows that 4 different streams are not parallel. Stream 13 is the first to work and stream 16 is the last. There is 12.378us between stream 13 and stream 14. And each kernel execution lasts around 5us. In the line of ‘Runtime API’ above, it says ‘cudaLaunch’.

Could you give me some advice? Thanks!

(I don’t know how to upload pictures in this forum, so I just describe the result in words.)

(1) GPUs are throughput-optimized architectures, not latency optimized. As long as a kernel can fill the machine with work (fully utilize the execution resources), it will not run concurrently with other kernels, as that is likely to reduce throughput. Your kernel configuration indicates that the kernel is large enough to fill the entire GPU, for all currently shipping GPUs. Note that a block size of 32 is rarely the best choice. The sweet spot for block size is typically 128 to 256 threads per block.

(2) Regardless of GPU or host platform, the minimal overhead of launching a kernel is about 5 microseconds. Since that has been the case for a dozen years, it is reasonable to assume that this is due to hardware limitations (in particular, PCIe latencies) and won’t change anytime soon.

(3) Without knowledge of the specific GPU, host system, and OS environment you used for your experiment it does not make much sense to hypothesize about your other observations. Please note that if you are on a Windows system with the default WDDM driver, you will frequently encounter performance artifacts due to launch batching (a technique the NVIDIA driver uses to overcome the massive latency imposed by the WDDM driver model). This effect may be worse if you use Windows 10, which uses the WDDM 2.x driver model.

Sorry I forgot to provide additional information. My GPU is Tesla M6 and OS is Red Hat Enterprise Linux 7

The answer on your cross-posting is on-target:

[url]c++ - Why kernel executions in different streams are not parallel? - Stack Overflow

CUDA streams do not guarantee concurrent execution. They simply provide the possibility for it. If you fail to meet other requirements of concurrent execution, you will not see concurrency. For example, if one of your kernels “occupies” the GPU, there is no “room” for anything else to run on the GPU.

Essentially what njuffa referred to in his point #1.

Thanks Robert!