I’d recommend providing a complete test case. I’m a bit skeptical that your kernel implementation at 97us is correct. For 2^26 elements of size uint32_t, the kernel will load and store (combined) 2^26x12 bytes. To propose that that happens in 97us works out to a delivered bandwidth of 8TB/s, approximately. The memory bandwidth of a H100 PCIE is not 8TB/s (it is around 2TB/s)
Here is my complete test case, running on a L4 GPU:
# cat t263.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>
#include <cstdint>
const int nBLK = 1824;
const int nTPB = 256;
const int ELEM_SIZE = 67108864;
__global__ void add_b114(uint32_t *dst, uint32_t *src1, uint32_t *src2) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = tid; i < ELEM_SIZE ; i += blockDim.x * gridDim.x) {
dst[i] = src1[i] + src2[i];
}
}
using mt = uint32_t;
int main(){
thrust::host_vector<mt> h_src1(ELEM_SIZE, 1);
thrust::host_vector<mt> h_src2(ELEM_SIZE, 2);
thrust::device_vector<mt> d_src1 = h_src1;
thrust::device_vector<mt> d_src2 = h_src2;
thrust::device_vector<mt> d_dst = h_src1;
mt *h_s1, *h_s2, *d_s1, *d_s2, *d_r;
h_s1 = new mt[ELEM_SIZE];
h_s2 = new mt[ELEM_SIZE];
cudaMalloc(&d_s1, sizeof(mt)*ELEM_SIZE);
cudaMalloc(&d_s2, sizeof(mt)*ELEM_SIZE);
cudaMalloc(&d_r, sizeof(mt)*ELEM_SIZE);
cudaMemcpy(d_s1, h_s1, sizeof(mt)*ELEM_SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(d_s2, h_s2, sizeof(mt)*ELEM_SIZE, cudaMemcpyHostToDevice);
add_b114<<<nBLK, nTPB>>>(d_r, d_s1, d_s2); // warm-up
add_b114<<<nBLK, nTPB>>>(d_r, d_s1, d_s2);
thrust::transform(d_src1.begin(), d_src1.end(), d_src2.begin(), d_dst.begin(), thrust::plus<mt>()); // warm-up
thrust::transform(d_src1.begin(), d_src1.end(), d_src2.begin(), d_dst.begin(), thrust::plus<mt>());
cudaDeviceSynchronize();
}
# nvcc -o t263 t263.cu
# nsys profile --stats=true ./t263
Generating '/tmp/nsys-report-729a.qdstrm'
[1/8] [========================100%] report6.nsys-rep
[2/8] [========================100%] report6.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /root/bobc/report6.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ -------- ----------- ------------ --------------
78.0 701,196,264 17 41,246,839.1 15,967,484.0 7,247 100,168,335 45,719,929.2 poll
19.4 174,385,865 508 343,279.3 15,608.5 1,028 78,930,428 3,519,373.3 ioctl
2.2 19,443,142 31 627,198.1 5,928.0 2,118 19,235,267 3,453,514.3 fopen
0.2 2,017,079 27 74,706.6 12,527.0 10,175 1,245,774 235,366.0 mmap64
0.1 837,463 44 19,033.3 17,629.5 6,921 57,665 7,695.8 open64
0.0 428,842 9 47,649.1 44,542.0 37,626 75,691 11,494.1 sem_timedwait
0.0 272,818 2 136,409.0 136,409.0 123,441 149,377 18,339.5 pthread_create
0.0 210,100 20 10,505.0 4,956.0 2,614 70,433 14,733.6 mmap
0.0 153,256 15 10,217.1 6,600.0 4,078 56,235 12,875.1 munmap
0.0 84,286 49 1,720.1 67.0 60 80,780 11,529.6 fgets
0.0 75,534 25 3,021.4 2,820.0 1,617 6,970 1,072.9 fclose
0.0 61,157 53 1,153.9 1,047.0 729 5,063 581.4 fcntl
0.0 41,237 6 6,872.8 6,395.5 388 13,879 4,429.5 fread
0.0 38,664 6 6,444.0 6,236.5 2,775 11,018 2,765.1 open
0.0 30,682 13 2,360.2 2,130.0 1,618 3,992 737.9 read
0.0 29,739 10 2,973.9 2,896.0 1,424 5,035 950.2 write
0.0 18,437 2 9,218.5 9,218.5 6,221 12,216 4,239.1 socket
0.0 14,275 1 14,275.0 14,275.0 14,275 14,275 0.0 connect
0.0 9,440 1 9,440.0 9,440.0 9,440 9,440 0.0 pipe2
0.0 6,415 7 916.4 903.0 846 1,015 59.5 dup
0.0 2,405 1 2,405.0 2,405.0 2,405 2,405 0.0 bind
0.0 1,574 1 1,574.0 1,574.0 1,574 1,574 0.0 listen
0.0 711 10 71.1 54.0 48 226 54.6 fflush
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------ ----------------------
38.5 266,895,811 3 88,965,270.3 61,525,726.0 60,742,431 144,627,654 48,206,629.3 cudaMemcpyAsync
31.6 218,884,717 2 109,442,358.5 109,442,358.5 109,415,546 109,469,171 37,918.6 cudaMemcpy
27.2 188,355,854 6 31,392,642.3 687,393.0 513,222 185,239,703 75,369,411.4 cudaMalloc
1.9 13,426,295 5 2,685,259.0 76,570.0 72,204 9,958,527 4,291,585.2 cudaStreamSynchronize
0.6 4,170,669 3 1,390,223.0 1,503,024.0 562,406 2,105,239 777,577.3 cudaFree
0.1 772,845 4 193,211.3 20,731.5 9,551 721,831 352,534.3 cudaLaunchKernel
0.0 5,770 1 5,770.0 5,770.0 5,770 5,770 0.0 cudaDeviceSynchronize
0.0 1,640 1 1,640.0 1,640.0 1,640 1,640 0.0 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ----------------------------------------------------------------------------------------------------
51.4 6,833,477 2 3,416,738.5 3,416,738.5 3,383,554 3,449,923 46,930.0 add_b114(unsigned int *, unsigned int *, unsigned int *)
48.6 6,467,236 2 3,233,618.0 3,233,618.0 3,221,410 3,245,826 17,264.7 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- ------------ ------------- ---------- ----------- ------------ ------------------
100.0 484,170,208 5 96,834,041.6 108,984,009.0 60,624,904 144,063,903 35,689,009.3 [CUDA memcpy HtoD]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------
1,342.177 5 268.435 268.435 268.435 268.435 0.000 [CUDA memcpy HtoD]
Generated:
/root/bobc/report6.nsys-rep
/root/bobc/report6.sqlite
#
We see that the kernel execution time and the thrust time are both in the range of 3.32ms, +/- 0.10ms; there is not an 8 times difference. Furthermore, 3.3ms for L4 vs 770us for H100 is a reasonable speed-up (H100 about 4x faster than L4). I don’t think 3.3ms for L4 vs 97us for H100 is a reasonable speed up (H100 about 34x faster than L4). The ratio of peak memory bandwidths is 2TB/s:300GB/s so about 6:1. So I am skeptical of your kernel results. Anyway your results look questionable to me, so a full test case would be needed (like the one I have provided) to sort things out.
Again, please fix and format your code before proceeding further with posting here.