Best way to implement element-wise add kernel

Hello. I’m trying to implement element-wise add kernel.
I would like to add 2^26 element of int32. I tested code below.

__global__ void add_b114(uint32_t *dst, uint32_t *src1, uint32_t *src2) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = tid; i < ELEM_SIZE ; i += blockDim.x * gridDim.x) {
    dst[i] = src1[i] + src2[i];
  }
}

And I tested thrust library like below

thrust::transform(thrust::device, src1->begin(), src1->end(), src2->begin(), dst->begin(), thrust::plus<uint32_t>());

At a first glance I think thrust should be fast but thrust is 8 times slow.
This is the nsight compute result. I also made a unit test for the sanity cehckc.

What’s the problem and Is there any best way to implement elementwise operation kernels??

When posting code on these forums, please format it correctly. Here is one possible method.

  1. Edit your post by clicking on the pencil icon below it.
  2. Select the code
  3. click the </> button at the top of the edit window
  4. save your changes

Please do that now.

There is nothing wrong with your non-thrust code, and a thrust implementation should not be 8 times slower. When I hear such a difference it makes me think you might be compiling the thrust code with -G or building a debug project on windows.

To my eye, your thrust usage is also a little bit unusual:

so it would probably be good to see the entire thrust code.

Finally, because of various behaviors such as lazy loading on modern GPUs, it’s more important to do proper benchmarking to compare apples-to-apples. At a minimum, I would recommend launching both operations twice in sequence, and using the second run of each for comparison.

1 Like

I’d recommend providing a complete test case. I’m a bit skeptical that your kernel implementation at 97us is correct. For 2^26 elements of size uint32_t, the kernel will load and store (combined) 2^26x12 bytes. To propose that that happens in 97us works out to a delivered bandwidth of 8TB/s, approximately. The memory bandwidth of a H100 PCIE is not 8TB/s (it is around 2TB/s)

Here is my complete test case, running on a L4 GPU:

# cat t263.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>
#include <cstdint>

const int nBLK = 1824;
const int nTPB = 256;
const int ELEM_SIZE = 67108864;

__global__ void add_b114(uint32_t *dst, uint32_t *src1, uint32_t *src2) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = tid; i < ELEM_SIZE ; i += blockDim.x * gridDim.x) {
    dst[i] = src1[i] + src2[i];
  }
}

using mt = uint32_t;

int main(){

  thrust::host_vector<mt> h_src1(ELEM_SIZE, 1);
  thrust::host_vector<mt> h_src2(ELEM_SIZE, 2);
  thrust::device_vector<mt> d_src1 = h_src1;
  thrust::device_vector<mt> d_src2 = h_src2;
  thrust::device_vector<mt> d_dst = h_src1;

  mt *h_s1, *h_s2, *d_s1, *d_s2, *d_r;
  h_s1 = new mt[ELEM_SIZE];
  h_s2 = new mt[ELEM_SIZE];
  cudaMalloc(&d_s1, sizeof(mt)*ELEM_SIZE);
  cudaMalloc(&d_s2, sizeof(mt)*ELEM_SIZE);
  cudaMalloc(&d_r, sizeof(mt)*ELEM_SIZE);
  cudaMemcpy(d_s1, h_s1, sizeof(mt)*ELEM_SIZE, cudaMemcpyHostToDevice);
  cudaMemcpy(d_s2, h_s2, sizeof(mt)*ELEM_SIZE, cudaMemcpyHostToDevice);
  add_b114<<<nBLK, nTPB>>>(d_r, d_s1, d_s2); // warm-up
  add_b114<<<nBLK, nTPB>>>(d_r, d_s1, d_s2);
  thrust::transform(d_src1.begin(), d_src1.end(), d_src2.begin(), d_dst.begin(), thrust::plus<mt>()); // warm-up
  thrust::transform(d_src1.begin(), d_src1.end(), d_src2.begin(), d_dst.begin(), thrust::plus<mt>());
  cudaDeviceSynchronize();
}
# nvcc -o t263 t263.cu
# nsys profile --stats=true ./t263
Generating '/tmp/nsys-report-729a.qdstrm'
[1/8] [========================100%] report6.nsys-rep
[2/8] [========================100%] report6.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /root/bobc/report6.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)        Name
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  --------------
     78.0      701,196,264         17  41,246,839.1  15,967,484.0     7,247  100,168,335  45,719,929.2  poll
     19.4      174,385,865        508     343,279.3      15,608.5     1,028   78,930,428   3,519,373.3  ioctl
      2.2       19,443,142         31     627,198.1       5,928.0     2,118   19,235,267   3,453,514.3  fopen
      0.2        2,017,079         27      74,706.6      12,527.0    10,175    1,245,774     235,366.0  mmap64
      0.1          837,463         44      19,033.3      17,629.5     6,921       57,665       7,695.8  open64
      0.0          428,842          9      47,649.1      44,542.0    37,626       75,691      11,494.1  sem_timedwait
      0.0          272,818          2     136,409.0     136,409.0   123,441      149,377      18,339.5  pthread_create
      0.0          210,100         20      10,505.0       4,956.0     2,614       70,433      14,733.6  mmap
      0.0          153,256         15      10,217.1       6,600.0     4,078       56,235      12,875.1  munmap
      0.0           84,286         49       1,720.1          67.0        60       80,780      11,529.6  fgets
      0.0           75,534         25       3,021.4       2,820.0     1,617        6,970       1,072.9  fclose
      0.0           61,157         53       1,153.9       1,047.0       729        5,063         581.4  fcntl
      0.0           41,237          6       6,872.8       6,395.5       388       13,879       4,429.5  fread
      0.0           38,664          6       6,444.0       6,236.5     2,775       11,018       2,765.1  open
      0.0           30,682         13       2,360.2       2,130.0     1,618        3,992         737.9  read
      0.0           29,739         10       2,973.9       2,896.0     1,424        5,035         950.2  write
      0.0           18,437          2       9,218.5       9,218.5     6,221       12,216       4,239.1  socket
      0.0           14,275          1      14,275.0      14,275.0    14,275       14,275           0.0  connect
      0.0            9,440          1       9,440.0       9,440.0     9,440        9,440           0.0  pipe2
      0.0            6,415          7         916.4         903.0       846        1,015          59.5  dup
      0.0            2,405          1       2,405.0       2,405.0     2,405        2,405           0.0  bind
      0.0            1,574          1       1,574.0       1,574.0     1,574        1,574           0.0  listen
      0.0              711         10          71.1          54.0        48          226          54.6  fflush

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)            Name
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  ----------------------
     38.5      266,895,811          3   88,965,270.3   61,525,726.0   60,742,431  144,627,654  48,206,629.3  cudaMemcpyAsync
     31.6      218,884,717          2  109,442,358.5  109,442,358.5  109,415,546  109,469,171      37,918.6  cudaMemcpy
     27.2      188,355,854          6   31,392,642.3      687,393.0      513,222  185,239,703  75,369,411.4  cudaMalloc
      1.9       13,426,295          5    2,685,259.0       76,570.0       72,204    9,958,527   4,291,585.2  cudaStreamSynchronize
      0.6        4,170,669          3    1,390,223.0    1,503,024.0      562,406    2,105,239     777,577.3  cudaFree
      0.1          772,845          4      193,211.3       20,731.5        9,551      721,831     352,534.3  cudaLaunchKernel
      0.0            5,770          1        5,770.0        5,770.0        5,770        5,770           0.0  cudaDeviceSynchronize
      0.0            1,640          1        1,640.0        1,640.0        1,640        1,640           0.0  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     51.4        6,833,477          2  3,416,738.5  3,416,738.5  3,383,554  3,449,923     46,930.0  add_b114(unsigned int *, unsigned int *, unsigned int *)
     48.6        6,467,236          2  3,233,618.0  3,233,618.0  3,221,410  3,245,826     17,264.7  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)      Min (ns)    Max (ns)    StdDev (ns)       Operation
 --------  ---------------  -----  ------------  -------------  ----------  -----------  ------------  ------------------
    100.0      484,170,208      5  96,834,041.6  108,984,009.0  60,624,904  144,063,903  35,689,009.3  [CUDA memcpy HtoD]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
  1,342.177      5   268.435   268.435   268.435   268.435        0.000  [CUDA memcpy HtoD]

Generated:
    /root/bobc/report6.nsys-rep
    /root/bobc/report6.sqlite
#

We see that the kernel execution time and the thrust time are both in the range of 3.32ms, +/- 0.10ms; there is not an 8 times difference. Furthermore, 3.3ms for L4 vs 770us for H100 is a reasonable speed-up (H100 about 4x faster than L4). I don’t think 3.3ms for L4 vs 97us for H100 is a reasonable speed up (H100 about 34x faster than L4). The ratio of peak memory bandwidths is 2TB/s:300GB/s so about 6:1. So I am skeptical of your kernel results. Anyway your results look questionable to me, so a full test case would be needed (like the one I have provided) to sort things out.

Again, please fix and format your code before proceeding further with posting here.

1 Like

Thank you for your detailed response. I believe there might have been a mistake on my part in what you mentioned. I will conduct more precise tests as you suggested. If it still seems strange while doing it, I’ll ask again. Thank you.