Here is a quick test of your most recent posted code:
# cat t382.cu
using T = float;
__global__ void batch_dot_product_kernel_threaded(
const T* __restrict__ d_A,
const T* __restrict__ d_B,
T* __restrict__ d_C,
const int vector_size,
const int num_vectors
) {
extern __shared__ T sdata[];
int vec_id = blockIdx.x; // Each block computes one vector's dot product
int tid = threadIdx.x; // Thread index within block
if (vec_id >= num_vectors) return;
int offset = vec_id * vector_size; // Starting index for this vector
T sum = 0.0f;
// Each thread computes a partial sum
for (int i = tid; i < vector_size; i += blockDim.x) {
sum += d_A[offset + i] * d_B[offset + i];
}
// Store partial sum into shared memory
sdata[tid] = sum;
__syncthreads();
// Reduce partial sums to one value per block
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// First thread writes result
if (tid == 0) {
d_C[vec_id] = sdata[0];
}
}
int main(){
const int bs = 512;
T *d_A, *d_B, *d_C;
const int nv = 300;
const int vs = 3000;
const int ds = sizeof(T)*nv*vs;
cudaMalloc(&d_A,ds);
cudaMalloc(&d_B,ds);
cudaMalloc(&d_C,sizeof(T)*nv);
batch_dot_product_kernel_threaded<<<nv, bs, bs*sizeof(T)>>>(d_A, d_B, d_C, vs, nv);
batch_dot_product_kernel_threaded<<<nv, bs, bs*sizeof(T)>>>(d_A, d_B, d_C, vs, nv);
cudaDeviceSynchronize();
}
# nvcc -o t382 t382.cu -arch=sm_89 -lineinfo
# nsys nvprof --print-gpu-trace ./t382
WARNING: t382 and any of its children processes will be profiled.
Generating '/tmp/nsys-report-9437.qdstrm'
[1/3] [========================100%] report65.nsys-rep
[2/3] [========================100%] report65.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
----------- ------------- ------ ---- ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------- --- ---- ----------------------------------------------------------------------------------
674,706,950 41,312 121 300 1 1 512 1 1 16 0.000 0.002 NVIDIA L4 (0) 1 7 batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
674,748,902 8,160 122 300 1 1 512 1 1 16 0.000 0.002 NVIDIA L4 (0) 1 7 batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
Generated:
/root/bobc/report65.nsys-rep
/root/bobc/report65.sqlite
#
# ./junk/cuda-samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA L4
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 8.5
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 6.7
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 253.6
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
#
On the second kernel run, we can see the kernel duration is about 8 microseconds. This is really almost too short to be measuring (too small a problem size to get a good idea of GPU capability) but the calculation would look like this:
L4 peak memory bandwidth as measured by bandwidthTest
: ~250GB/s
The kernel must read two float
vectors of length 3000 elements for each of 300 dot-products. The kernel also has to write some data but its small by comparison so we will ignore that.
3000x2x4x300 = 7,200,000 bytes read by the kernel in ~8us
So that works out to an observed bandwidth of about 900GB/s which is well above what you can achieve on the L4 (operating from GPU DRAM). The explanation for this (I believe) is that this data set is so small it is fitting in the L2 cache (48MB on L4) which has higher bandwidth than main memory. In any event, that vector-dot-product processing rate works out to 37M dot-products per second. If we convert that to FP32 multiply ops per second, it is 112GF/s. This is a tiny fraction of the L4 FP32 rate of ~30TF/s, but that is due to the memory-bound nature of this problem, as a first-order factor, even operating out of L2.
Anyway I suspect your most recent posted code is “pretty good”.
Let’s make nv = 30000 to make the problem 100x larger. Now the difference between first and second kernel runs is negligible:
# nsys nvprof --print-gpu-trace ./t382
WARNING: t382 and any of its children processes will be profiled.
Generating '/tmp/nsys-report-f1a3.qdstrm'
[1/3] [========================100%] report66.nsys-rep
[2/3] [========================100%] report66.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report
Start (ns) Duration (ns) CorrId GrdX GrdY GrdZ BlkX BlkY BlkZ Reg/Trd StcSMem (MB) DymSMem (MB) Bytes (MB) Throughput (MBps) SrcMemKd DstMemKd Device Ctx Strm Name
----------- ------------- ------ ------ ---- ---- ---- ---- ---- ------- ------------ ------------ ---------- ----------------- -------- -------- ------------- --- ---- ----------------------------------------------------------------------------------
681,129,267 2,864,002 121 30,000 1 1 512 1 1 16 0.000 0.002 NVIDIA L4 (0) 1 7 batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
683,993,973 2,845,507 122 30,000 1 1 512 1 1 16 0.000 0.002 NVIDIA L4 (0) 1 7 batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
Generated:
/root/bobc/report66.nsys-rep
/root/bobc/report66.sqlite
Our new kernel duration is 2.8ms and our data size 720MB, for an observed bandwidth of ~257GB/s, just as predicted by bandwidthTest
. This suggests to me the kernel code is approximately optimal. Of course, our FP32 delivered rate has dropped by about a factor of 4, again due to the memory-bound nature of the problem.
Yes, I have probably been sloppy in mixing e.g. MiB and MB in my calculations. You could clean that up, but I don’t think the conclusion is any different.