Question about CUDA kernels parallel execution

Hi,
I am currently testing the parallel execution of multiple CUDA kernels on the Jetson Orin device, which has 16 SMs. Each SM has a maximum of 1536 resident threads and a maximum of 16 resident thread blocks. Below is the test program I am using.
sm_test.cu (5.4 KB)
I have implemented two kernels for vector addition and measured the execution time of a single kernel as well as two kernels running concurrently. If the execution time of two kernels running concurrently is equivalent to that of a single kernel, then we can conclude that the two kernels are running in parallel. Below are the test results.

nElem block grid blocks_per_sm warps_per_block single_kernel_time two_kernel_time
12288 768 16 2 24 9.014720 ms 21 ms
8192 512 16 3 16 5.970816 ms 12 ms
6144 384 16 4 12 4.747264 ms 9 ms
4096 256 16 6 8 4.064064 ms 5 ms
3072 192 16 8 6 4.070944 ms 4 ms
2048 128 16 12 4 3.962016 ms 4 ms

From the results, we can observe that the bottom three tests run the kernels in parallel, while in the top three tests, the kernels do not run parallelly.
We’ve discovered a pattern that when the total number of threads across two kernel functions on each SM exceeds 512, parallel execution is no longer possible.
What constraints lead to this phenomenon?

Thanks very much!

On this forum, please post code inline, to make it searchable and more easily studied and discussed. I also recommend indicating the CUDA version you used, and the command line you used to compile the code.

I suppose one possibility is running into a bandwidth limit. Once you hit that limit, additional memory-bound work will increase total duration linearly. This could be confirmed or refuted with a profiler.

I don’t think we can jump to that conclusion with certainty just based on the data you have shown. I don’t have a Jetson Orin device to work on (if you want to discuss this with the Jetson Orin crowd, there is a separate/specific forum for that), but I modified your code to run on a GTX 1660 Super (22 SMs, cc7.5) device. Here is an example with some profiler output:

$ cat t39.cu
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <time.h>

#include <chrono>
#include <iostream>

void cudaCheck(cudaError_t ret, std::ostream& err = std::cerr) {
  if (ret != cudaSuccess) {
    printf("Cuda failure: %s", cudaGetErrorString(ret));
    abort();
  }
}

void sumArrays(float* a, float* b, float* res, const int size) {
  for (int i = 0; i < size; i += 4) {
    res[i] = a[i] + b[i];
    res[i + 1] = a[i + 1] + b[i + 1];
    res[i + 2] = a[i + 2] + b[i + 2];
    res[i + 3] = a[i + 3] + b[i + 3];
  }
}
void initialData(float* ip, int size) {
  time_t t;
  srand((unsigned)time(&t));
  for (int i = 0; i < size; i++) {
    ip[i] = (float)(rand() & 0xffff) / 1000.0f;
  }
}

void checkResult(float* hostRef, float* gpuRef, const int N) {
  double epsilon = 1.0E-8;
  for (int i = 0; i < N; i++) {
    if (abs(hostRef[i] - gpuRef[i]) > epsilon) {
      printf("Results don\'t match!\n");
      printf("%f(hostRef[%d] )!= %f(gpuRef[%d])\n", hostRef[i], i, gpuRef[i], i);
      return;
    }
  }
  printf("Check result success!\n");
}

__global__ void sumArraysGPU(const float* a, const float* b, float* c, int32_t n) {
  int32_t idx = threadIdx.x + blockIdx.x * blockDim.x;
  if (idx < n) {
    for (int i = 0; i < 100000; i++) {
      c[idx] = a[idx] + b[idx];
    }
  }
}

__global__ void sumArraysGPU_1(const float* a, const float* b, float* c, int32_t n) {
  int32_t idx = threadIdx.x + blockIdx.x * blockDim.x;
  if (idx < n) {
    for (int i = 0; i < 100000; i++) {
      c[idx] = a[idx] + b[idx];
    }
  }
}

int main(int argc, char** argv) {
  int dev = 0;
  cudaSetDevice(dev);
  cudaStream_t stream;
  cudaStreamCreate(&stream);
  cudaStream_t stream_1;
  cudaStreamCreate(&stream_1);
  cudaEvent_t startEvent, stopEvent;
  cudaCheck(cudaEventCreate(&startEvent));
  cudaCheck(cudaEventCreate(&stopEvent));
  cudaEvent_t startEvent_1, stopEvent_1, startEvent_2, stopEvent_2;
  cudaEventCreate(&startEvent_1);
  cudaEventCreate(&stopEvent_1);
  cudaEventCreate(&startEvent_2);
  cudaEventCreate(&stopEvent_2);
  const int nsms = 22;
  int nElem = nsms*512;;

  int nByte = sizeof(float) * nElem;
  float* a_h = (float*)malloc(nByte);
  float* b_h = (float*)malloc(nByte);
  float* res_h = (float*)malloc(nByte);
  float* res_from_gpu_h = (float*)malloc(nByte);
  memset(res_h, 0, nByte);
  memset(res_from_gpu_h, 0, nByte);

  float *a_d, *b_d, *res_d;
  cudaMalloc((float**)&a_d, nByte);
  cudaMalloc((float**)&b_d, nByte);
  cudaMalloc((float**)&res_d, nByte);

  initialData(a_h, nElem);
  initialData(b_h, nElem);

  cudaMemcpy(a_d, a_h, nByte, cudaMemcpyHostToDevice);
  cudaMemcpy(b_d, b_h, nByte, cudaMemcpyHostToDevice);

  float* a_h_1 = (float*)malloc(nByte);
  float* b_h_1 = (float*)malloc(nByte);
  float* res_h_1 = (float*)malloc(nByte);
  float* res_from_gpu_h_1 = (float*)malloc(nByte);
  memset(res_h_1, 0, nByte);
  memset(res_from_gpu_h_1, 0, nByte);

  float *a_d_1, *b_d_1, *res_d_1;
  cudaMalloc((float**)&a_d_1, nByte);
  cudaMalloc((float**)&b_d_1, nByte);
  cudaMalloc((float**)&res_d_1, nByte);

  initialData(a_h_1, nElem);
  initialData(b_h_1, nElem);

  cudaMemcpy(a_d_1, a_h_1, nByte, cudaMemcpyHostToDevice);
  cudaMemcpy(b_d_1, b_h_1, nByte, cudaMemcpyHostToDevice);
//  int test_cases[] = {128, 192, 256, 384, 512};
  int test_cases[] = {512};
  for (int i = 0; i < sizeof(test_cases)/sizeof(test_cases[0]); i++){
    dim3 block(test_cases[i]);
    nElem = block.x*nsms;
    dim3 grid((nElem + block.x - 1) / block.x);
    printf("Execution configuration<<<%d,%d>>>\n", grid.x, block.x);

    printf("####### Launch single kernel #######\n");
    cudaCheck(cudaEventRecord(startEvent, stream));
    sumArraysGPU<<<grid, block, 0, stream>>>(a_d, b_d, res_d, nElem);
    cudaCheck(cudaEventRecord(stopEvent, stream));
    cudaCheck(cudaEventSynchronize(stopEvent));
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, startEvent, stopEvent);
    printf("Single kernel execution time: %f ms\n", elapsedTime);

    printf("####### Launch two kernel #######\n");

    auto total_start = std::chrono::high_resolution_clock::now();  // 记åååæé
    cudaCheck(cudaEventRecord(startEvent, stream));
    sumArraysGPU<<<grid, block,  0, stream>>>(a_d, b_d, res_d, nElem);
    cudaCheck(cudaEventRecord(stopEvent, stream));

    cudaCheck(cudaEventRecord(startEvent_1, stream_1));
    sumArraysGPU_1<<<grid, block, 0, stream_1>>>(a_d_1, b_d_1, res_d_1, nElem);
    cudaCheck(cudaEventRecord(stopEvent_1, stream_1));

    cudaCheck(cudaEventSynchronize(stopEvent));
    cudaCheck(cudaEventSynchronize(stopEvent_1));
    cudaEventElapsedTime(&elapsedTime, startEvent, stopEvent_1);
    auto total_stop = std::chrono::high_resolution_clock::now();  // 记åçææé

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(total_stop - total_start);
    std::cout << "Two kernel execution time: " << duration.count() << "," << elapsedTime << " ms\n";  // èå毫çæ
    }
 //   cudaMemcpy(res_from_gpu_h, res_d, nByte, cudaMemcpyDeviceToHost);
  //   cudaMemcpy(res_from_gpu_h_1, res_d_1, nByte, cudaMemcpyDeviceToHost);
  //   sumArrays(a_h, b_h, res_h, nElem);
  //   checkResult(res_h, res_from_gpu_h, nElem);
  //   sumArrays(a_h_1, b_h_1, res_h_1, nElem);
  //   checkResult(res_h_1, res_from_gpu_h_1, nElem);
  cudaFree(a_d);
  cudaFree(b_d);
  cudaFree(res_d);
  cudaFree(a_d_1);
  cudaFree(b_d_1);
  cudaFree(res_d_1);
  cudaStreamDestroy(stream);
  cudaStreamDestroy(stream_1);
  // Destroy the CUDA events
  cudaEventDestroy(startEvent);
  cudaEventDestroy(stopEvent);
  free(a_h);
  free(b_h);
  free(res_h);
  free(res_from_gpu_h);
  free(a_h_1);
  free(b_h_1);
  free(res_h_1);
  free(res_from_gpu_h_1);
  cudaDeviceReset();

  return 0;
}

$ nvcc -o t39 t39.cu -arch=sm_75 -Xptxas=-v
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z14sumArraysGPU_1PKfS0_Pfi' for 'sm_75'
ptxas info    : Function properties for _Z14sumArraysGPU_1PKfS0_Pfi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 380 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12sumArraysGPUPKfS0_Pfi' for 'sm_75'
ptxas info    : Function properties for _Z12sumArraysGPUPKfS0_Pfi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 380 bytes cmem[0]
$ ./t39
Execution configuration<<<22,512>>>
####### Launch single kernel #######
Single kernel execution time: 8.440832 ms
####### Launch two kernel #######
Two kernel execution time: 16,16.7485 ms
$ nsys nvprof --print-gpu-trace ./t39
WARNING: t39 and any of its children processes will be profiled.

Execution configuration<<<22,512>>>
####### Launch single kernel #######
Single kernel execution time: 8.449664 ms
####### Launch two kernel #######
Two kernel execution time: 16,16.7538 ms
Generating '/tmp/nsys-report-271d.qdstrm'
[1/3] [========================100%] report5.nsys-rep
[2/3] [========================100%] report5.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd               Device                Ctx  Strm                             Name
 -----------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  ---------------------------------  ---  ----  ----------------------------------------------------------
 259,427,846          2,880     131                                                                                0.045         15,644.434  Pageable  Device    NVIDIA GeForce GTX 1660 SUPER (0)    1     7  [CUDA memcpy HtoD]
 259,440,166          2,688     132                                                                                0.045         16,761.868  Pageable  Device    NVIDIA GeForce GTX 1660 SUPER (0)    1     7  [CUDA memcpy HtoD]
 259,780,935          2,688     136                                                                                0.045         16,761.868  Pageable  Device    NVIDIA GeForce GTX 1660 SUPER (0)    1     7  [CUDA memcpy HtoD]
 259,792,583          2,688     137                                                                                0.045         16,761.868  Pageable  Device    NVIDIA GeForce GTX 1660 SUPER (0)    1     7  [CUDA memcpy HtoD]
 259,849,575      8,418,292     139    22     1     1   512     1     1       18         0.000         0.000                                                     NVIDIA GeForce GTX 1660 SUPER (0)    1    13  sumArraysGPU(const float *, const float *, float *, int)
 268,298,555     10,648,441     144    22     1     1   512     1     1       18         0.000         0.000                                                     NVIDIA GeForce GTX 1660 SUPER (0)    1    13  sumArraysGPU(const float *, const float *, float *, int)
 268,305,755     16,740,392     147    22     1     1   512     1     1       18         0.000         0.000                                                     NVIDIA GeForce GTX 1660 SUPER (0)    1    14  sumArraysGPU_1(const float *, const float *, float *, int)

Generated:
    /home/bob/bobc/misc/report5.nsys-rep
    /home/bob/bobc/misc/report5.sqlite
$

(CUDA 12.1)

Key items:

  • the code when run for the 22 blocks, 512 threads per block test case shows approximately double the execution time for the two-kernel case, even though two kernels should be able to run concurrently on this sm_75 device with 22 SMs.
  • the profiler suggests that in the two kernel case, the kernels may be running concurrently: the start times of the two kernels are within 8 microseconds of each other.
  • the second kernel is clearly associated with the overall duration in the 2-kernel case. However I conclude there is overlap, because the first kernel runs in ~8ms in the one-kernel case, but ~10ms in the two kernel case. If there were no concurrency, we would not expect the duration of the first launched kernel to be noticeably impacted by the second launch.

Yes, I realize this is not a full explanation. I’m simply pointing out that the data presented is not sufficient to conclude that the two kernels are not running in parallel (i.e. concurrently). You are not running into an occupancy limiter. The performance limiter is somewhere else. Bandwidth may be an issue. Note that these codes and footprints are small enough that the caches play an important role, so it may not be device memory bandwidth that is the limiter, but instead possibly L1 or L2 cache bandwidth. In my test case, the “footprint” for the 128 thread per block case is 128*22*4*3 bytes = 33Kbytes. The footprint for the 512 thread per block case is therefore 4x of that or 132Kbytes. The 16 block case will be somewhat smaller. All of those should fit in the L2, with perhaps some or all fitting in the L1. (My GTX 1660 Super has ~1.5MB of L2.)

For amusement, if you’d like to see some further compiler optimization, try adding the __restrict__ keyword to each of your kernel pointer parameters.

Thank you for your response.

I have attempted again using a different GPU device that has 68 SMs and conducted an analysis with Nvidia Nsight Compute(CUDA 11.3, cc 7.5). Please find the results below.

Test 1
Using a grid of 68 and a block of 128, the execution time for the 2-kernel case is approximately equivalent to that of the 1-kernel case.

Vector size:8704
Execution configuration<<<68,128>>>
####### Launch single kernel #######
Single kernel execution time: 4.264160 ms
####### Launch two kernel #######
Two kernel execution time: 4.994528 ms
Check result success!

Test 2
Using a grid of 68 and a block of 256, the execution time for the 2-kernel case is approximately double that of the 1-kernel case.

Vector size:17408
Execution configuration<<<68,256>>>
####### Launch single kernel #######
Single kernel execution time: 4.964416 ms
####### Launch two kernel #######
Two kernel execution time: 10.354144 ms
Check result success!

Test 3
As you suggested, I modified the kernel input pointer c to __restrict__ and maintained a grid of 68 and a block of 256. With these adjustments, the execution time for the 2-kernel case is approximately equivalent to that of the 1-kernel case. Additionally, both cases demonstrated significantly faster performance compared to the previous test without __restrict__.

Vector size:17408
Execution configuration<<<68,256>>>
####### Launch single kernel #######
Single kernel execution time: 0.020576 ms
####### Launch two kernel #######
Two kernel execution time: 0.017472 ms
Check result success!

Questions
Based on the tests conducted, I would like to raise a few questions.

Q1: According to the Nsight Compute screenshot, it appears that Test 1 reached a maximum bandwidth of 43.75%, while Test 2 achieved a higher maximum bandwidth of 73.80%. However, I’m wondering if the max bandwidth in the second test is a result of the 2-kernel case can not be executed “in parallel” or if something else may have contributed to it.

Q2: I might have some misconceptions about the maximum bandwidth. As per Test 2, each kernel achieved a maximum bandwidth of 73.80%. Hence, they couldn’t be executed entirely in parallel from the timeline view, based on my understanding from Q1. I was just wondering why both kernels cannot achieve a maximum bandwidth of 50%, thereby enabling them to run in parallel, as seen from the timeline view.

Q3: May I inquire as to the impact of the “restrict” keyword on the optimization of the nvcc compiler? It has been observed that incorporating this keyword can vastly enhance performance, with improvements of up to two orders of magnitude.

I apologize in advance if some of my questions seem silly as I am not an expert in CUDA programming. Thank you for your understanding and patience.

I haven’t fully analyzed what is happening here, so my responses to Q1 and Q2 will be rather limited and possibly incorrect.

From my perspective, if the two kernels cannot be executed “in parallel”, then why would we expect the max bandwidth to be any different? The fact that the max bandwidth is different (as well as other observations like the one I pointed out) suggests to me quite conclusively that to some degree, the two kernel case is executing both kernels concurrently. In this particular case, the L2 cache traffic could be impacted by what is actually in the L2 cache. And in the 2 kernel case there may be differences. But for the other observation I made (kernel durations) I don’t think you can confuse the matter that way. The kernel durations getting noticeably longer suggests to me one is impacting the runtime of the other.

Yes, I would agree that when they are executed “in parallel” they are not executed “entirely in parallel”. There is some limited overlap and/or some other limiter that comes into play when both are executing.

Sorry, I should not have diverted your attention with this. According to my inspection, these decoration allows the compiler to eliminate your for-loops:

  for (int i = 0; i < 100000; i++) {

That for loop exists just to give your kernel a sufficient duration to make analysis easier. Letting the compiler remove it isn’t really what you want. You should probably just ignore that statement.