How to get the most dot products of batched vectors out of L4 GPU

So I’ve been looking at this:

and some forum posts.

I have a demo application where I can use 1->(lets say 32) CPU processes to issue chunks of work that consist of dot products of (lets say) 300 vector pairs, with 3000 FP32 elements per vector. The 300 vectors are laid out consecutively in host memory.
I tried doing this with blocks and threads and blocks only and threads only.

What I want to optimize is the amount of processes that can be used to saturate the work of the L4 GPU.

Lets take the 300 blocks(==vectors), 1 thread per block regime. Each block works on 3000 element pairs till done.
Experimentally I get a little over 2 processes (each 300 vector-pairs of 3000 elements FP32) to saturate the work that the L4 can do. So normally one process would finish in 20 seconds, 2 at a time would finish also in 20 seconds, and 4 at a time - well it ramps up to 40 seconds. So no reason to issue 4 at a time then I guess?

This makes me think that the L4 GPU can support only 600 independent FP32 calculations at once (I’m probably getting bit by host to device memcpy time maybe?). However looking at the L4 specs, I see 7500 FP32 cuda cores. Which makes me think I could get 10x more work out of it.

How do I improve my dot product throughput (prefer not to use cuBLAS or thrust) on the L4 to be more than 600 FP32 calculations at once?

extern "C" void isccu_batch_dot_product_no_malloc(const float* h_A, const float* h_B, float* h_C, const float* d_A, const float* d_B, float* d_C, const int vector_size, const int num_vectors) {

    size_t vecBytes = num_vectors * vector_size * sizeof(float);
    size_t resultBytes = num_vectors * sizeof(float);

    cudaMemcpy((float *)d_A, h_A, vecBytes, cudaMemcpyHostToDevice);
    cudaMemcpy((float *)d_B, h_B, vecBytes, cudaMemcpyHostToDevice);

    batch_dot_product_kernel<<<num_vectors, 1>>>(d_A, d_B, d_C, vector_size, num_vectors);
    //const int threads_per_block = 256; 
    //int shared_mem_size = threads_per_block * sizeof(float);
    //batch_dot_product_kernel_threaded<<<num_vectors, threads_per_block, shared_mem_size>>>(
    //d_A, d_B, d_C, vector_size, num_vectors);

    cudaMemcpy(h_C, d_C, resultBytes, cudaMemcpyDeviceToHost);
}

// CUDA kernel: each block processes one dot product
__global__ void batch_dot_product_kernel(const float* __restrict__ d_A, const float* __restrict__ d_B, float* __restrict__ d_C, const int vector_size, const int num_vectors) {

   // Calculate global thread index
   int idx = blockIdx.x;
    
   // Check if thread is within valid range
   if (idx < num_vectors)
   {
       float sum = 0.0f;
       // Compute dot product for one vector pair
       for (int i = 0; i < vector_size; i++) {
          sum += d_A[idx * vector_size + i] * d_B[idx * vector_size + i];
       }
       // Store result
       d_C[idx] = sum;
    }

}

when posting code here, please format properly. In a nutshell:

  1. edit your post above, clicking on the pencil icon below it
  2. select the code in the edit window
  3. click the </> button at the top of the edit pane
  4. save your changes

Please do that now.

That’s never a good idea on a CUDA GPU when you are interested in performance/throughput. You may want to get some basic understanding of what makes for high-performance CUDA code. The first four units of this online tutorial could be a good start. Your code design uses only 1 thread per block, which is not how to get performance out of a CUDA GPU.

Relatively higher performance (than what you have shown) batched vector dot product code is demonstrated and discussed in this blog series.

I did try these (below) and I didnt see the process saturation point change.

__global__ void batch_dot_product_kernel_threaded(
    const float* __restrict__ d_A,
    const float* __restrict__ d_B,
    float* __restrict__ d_C,
    const int vector_size,
    const int num_vectors
) {
    extern __shared__ float sdata[];

    int vec_id = blockIdx.x;         // Each block computes one vector's dot product
    int tid = threadIdx.x;           // Thread index within block

    if (vec_id >= num_vectors) return;

    int offset = vec_id * vector_size; // Starting index for this vector
    float sum = 0.0f;

    // Each thread computes a partial sum
    for (int i = tid; i < vector_size; i += blockDim.x) {
        sum += d_A[offset + i] * d_B[offset + i];
    }

    // Store partial sum into shared memory
    sdata[tid] = sum;
    __syncthreads();

    // Reduce partial sums to one value per block
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // First thread writes result
    if (tid == 0) {
        d_C[vec_id] = sdata[0];
    }
}

I haven’t studied it carefully, but after a quick look I would say that is a better looking code.

It should not require multiple processes to saturate a GPU, assuming by “process” we are using a typical definition of the word - an operating system process.

Batched vector dot products will be a memory bound operation. Your upper bound on throughput will be the rate at which you are processing vector elements (or vectors, if you prefer), compared to the peak memory bandwidth of the GPU you are operating on. That is how I would approach trying to estimate performance or measure improvement, rather than “process saturation point.”

Here is a quick test of your most recent posted code:

# cat t382.cu
using T = float;
__global__ void batch_dot_product_kernel_threaded(
    const T* __restrict__ d_A,
    const T* __restrict__ d_B,
    T* __restrict__ d_C,
    const int vector_size,
    const int num_vectors
) {
    extern __shared__ T sdata[];

    int vec_id = blockIdx.x;         // Each block computes one vector's dot product
    int tid = threadIdx.x;           // Thread index within block

    if (vec_id >= num_vectors) return;

    int offset = vec_id * vector_size; // Starting index for this vector
    T sum = 0.0f;

    // Each thread computes a partial sum
    for (int i = tid; i < vector_size; i += blockDim.x) {
        sum += d_A[offset + i] * d_B[offset + i];
    }

    // Store partial sum into shared memory
    sdata[tid] = sum;
    __syncthreads();

    // Reduce partial sums to one value per block
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // First thread writes result
    if (tid == 0) {
        d_C[vec_id] = sdata[0];
    }
}

int main(){
  const int bs = 512;
  T *d_A, *d_B, *d_C;
  const int nv = 300;
  const int vs = 3000;
  const int ds = sizeof(T)*nv*vs;
  cudaMalloc(&d_A,ds);
  cudaMalloc(&d_B,ds);
  cudaMalloc(&d_C,sizeof(T)*nv);
  batch_dot_product_kernel_threaded<<<nv, bs, bs*sizeof(T)>>>(d_A, d_B, d_C, vs, nv);
  batch_dot_product_kernel_threaded<<<nv, bs, bs*sizeof(T)>>>(d_A, d_B, d_C, vs, nv);
  cudaDeviceSynchronize();
}

# nvcc -o t382 t382.cu -arch=sm_89 -lineinfo
# nsys nvprof --print-gpu-trace ./t382
WARNING: t382 and any of its children processes will be profiled.

Generating '/tmp/nsys-report-9437.qdstrm'
[1/3] [========================100%] report65.nsys-rep
[2/3] [========================100%] report65.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm                                         Name       
 -----------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  ----------------------------------------------------------------------------------
 674,706,950         41,312     121   300     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
 674,748,902          8,160     122   300     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)

Generated:
    /root/bobc/report65.nsys-rep
    /root/bobc/report65.sqlite
#
# ./junk/cuda-samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA L4
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     8.5

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     6.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     253.6

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
#

On the second kernel run, we can see the kernel duration is about 8 microseconds. This is really almost too short to be measuring (too small a problem size to get a good idea of GPU capability) but the calculation would look like this:

L4 peak memory bandwidth as measured by bandwidthTest: ~250GB/s

The kernel must read two float vectors of length 3000 elements for each of 300 dot-products. The kernel also has to write some data but its small by comparison so we will ignore that.

3000x2x4x300 = 7,200,000 bytes read by the kernel in ~8us

So that works out to an observed bandwidth of about 900GB/s which is well above what you can achieve on the L4 (operating from GPU DRAM). The explanation for this (I believe) is that this data set is so small it is fitting in the L2 cache (48MB on L4) which has higher bandwidth than main memory. In any event, that vector-dot-product processing rate works out to 37M dot-products per second. If we convert that to FP32 multiply ops per second, it is 112GF/s. This is a tiny fraction of the L4 FP32 rate of ~30TF/s, but that is due to the memory-bound nature of this problem, as a first-order factor, even operating out of L2.

Anyway I suspect your most recent posted code is “pretty good”.

Let’s make nv = 30000 to make the problem 100x larger. Now the difference between first and second kernel runs is negligible:

# nsys nvprof --print-gpu-trace ./t382
WARNING: t382 and any of its children processes will be profiled.

Generating '/tmp/nsys-report-f1a3.qdstrm'
[1/3] [========================100%] report66.nsys-rep
[2/3] [========================100%] report66.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId   GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm                                         Name     
 -----------  -------------  ------  ------  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  ----------------------------------------------------------------------------------
 681,129,267      2,864,002     121  30,000     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
 683,993,973      2,845,507     122  30,000     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)

Generated:
    /root/bobc/report66.nsys-rep
    /root/bobc/report66.sqlite

Our new kernel duration is 2.8ms and our data size 720MB, for an observed bandwidth of ~257GB/s, just as predicted by bandwidthTest. This suggests to me the kernel code is approximately optimal. Of course, our FP32 delivered rate has dropped by about a factor of 4, again due to the memory-bound nature of the problem.

Yes, I have probably been sloppy in mixing e.g. MiB and MB in my calculations. You could clean that up, but I don’t think the conclusion is any different.

AFAIK most Nvidia GPUs have a L2 to global memory bandwidth ratio between 4:1 and 2:1. Which also fits the 900 / 250 GB/s. Accelerating memory is only part of the task of L2, it is also used as consistency layer between the SMs. And the second task probably is the more important one.

Perhaps you can fusion kernels or computation steps. How are the vectors generated? Perhaps you do not have to read them from global memory.

If they come from the CPU or an external device, PCIe speed may be the limiting factor, which is even slower than global memory!

If some of the vectors or their elements are reused (e.g. several vectors share elements), you possibly could save bandwidth. The same, if the vectors are sparse, or some can be computed on-the-fly (e.g. if they are coefficients).

Or perhaps FP16 is enough, or a fixed-point format?