How to get the most dot products of batched vectors out of L4 GPU

So I’ve been looking at this:

and some forum posts.

I have a demo application where I can use 1->(lets say 32) CPU processes to issue chunks of work that consist of dot products of (lets say) 300 vector pairs, with 3000 FP32 elements per vector. The 300 vectors are laid out consecutively in host memory.
I tried doing this with blocks and threads and blocks only and threads only.

What I want to optimize is the amount of processes that can be used to saturate the work of the L4 GPU.

Lets take the 300 blocks(==vectors), 1 thread per block regime. Each block works on 3000 element pairs till done.
Experimentally I get a little over 2 processes (each 300 vector-pairs of 3000 elements FP32) to saturate the work that the L4 can do. So normally one process would finish in 20 seconds, 2 at a time would finish also in 20 seconds, and 4 at a time - well it ramps up to 40 seconds. So no reason to issue 4 at a time then I guess?

This makes me think that the L4 GPU can support only 600 independent FP32 calculations at once (I’m probably getting bit by host to device memcpy time maybe?). However looking at the L4 specs, I see 7500 FP32 cuda cores. Which makes me think I could get 10x more work out of it.

How do I improve my dot product throughput (prefer not to use cuBLAS or thrust) on the L4 to be more than 600 FP32 calculations at once?

extern "C" void isccu_batch_dot_product_no_malloc(const float* h_A, const float* h_B, float* h_C, const float* d_A, const float* d_B, float* d_C, const int vector_size, const int num_vectors) {

    size_t vecBytes = num_vectors * vector_size * sizeof(float);
    size_t resultBytes = num_vectors * sizeof(float);

    cudaMemcpy((float *)d_A, h_A, vecBytes, cudaMemcpyHostToDevice);
    cudaMemcpy((float *)d_B, h_B, vecBytes, cudaMemcpyHostToDevice);

    batch_dot_product_kernel<<<num_vectors, 1>>>(d_A, d_B, d_C, vector_size, num_vectors);
    //const int threads_per_block = 256; 
    //int shared_mem_size = threads_per_block * sizeof(float);
    //batch_dot_product_kernel_threaded<<<num_vectors, threads_per_block, shared_mem_size>>>(
    //d_A, d_B, d_C, vector_size, num_vectors);

    cudaMemcpy(h_C, d_C, resultBytes, cudaMemcpyDeviceToHost);
}

// CUDA kernel: each block processes one dot product
__global__ void batch_dot_product_kernel(const float* __restrict__ d_A, const float* __restrict__ d_B, float* __restrict__ d_C, const int vector_size, const int num_vectors) {

   // Calculate global thread index
   int idx = blockIdx.x;
    
   // Check if thread is within valid range
   if (idx < num_vectors)
   {
       float sum = 0.0f;
       // Compute dot product for one vector pair
       for (int i = 0; i < vector_size; i++) {
          sum += d_A[idx * vector_size + i] * d_B[idx * vector_size + i];
       }
       // Store result
       d_C[idx] = sum;
    }

}

when posting code here, please format properly. In a nutshell:

  1. edit your post above, clicking on the pencil icon below it
  2. select the code in the edit window
  3. click the </> button at the top of the edit pane
  4. save your changes

Please do that now.

That’s never a good idea on a CUDA GPU when you are interested in performance/throughput. You may want to get some basic understanding of what makes for high-performance CUDA code. The first four units of this online tutorial could be a good start. Your code design uses only 1 thread per block, which is not how to get performance out of a CUDA GPU.

Relatively higher performance (than what you have shown) batched vector dot product code is demonstrated and discussed in this blog series.

I did try these (below) and I didnt see the process saturation point change.

__global__ void batch_dot_product_kernel_threaded(
    const float* __restrict__ d_A,
    const float* __restrict__ d_B,
    float* __restrict__ d_C,
    const int vector_size,
    const int num_vectors
) {
    extern __shared__ float sdata[];

    int vec_id = blockIdx.x;         // Each block computes one vector's dot product
    int tid = threadIdx.x;           // Thread index within block

    if (vec_id >= num_vectors) return;

    int offset = vec_id * vector_size; // Starting index for this vector
    float sum = 0.0f;

    // Each thread computes a partial sum
    for (int i = tid; i < vector_size; i += blockDim.x) {
        sum += d_A[offset + i] * d_B[offset + i];
    }

    // Store partial sum into shared memory
    sdata[tid] = sum;
    __syncthreads();

    // Reduce partial sums to one value per block
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // First thread writes result
    if (tid == 0) {
        d_C[vec_id] = sdata[0];
    }
}

I haven’t studied it carefully, but after a quick look I would say that is a better looking code.

It should not require multiple processes to saturate a GPU, assuming by “process” we are using a typical definition of the word - an operating system process.

Batched vector dot products will be a memory bound operation. Your upper bound on throughput will be the rate at which you are processing vector elements (or vectors, if you prefer), compared to the peak memory bandwidth of the GPU you are operating on. That is how I would approach trying to estimate performance or measure improvement, rather than “process saturation point.”

Here is a quick test of your most recent posted code:

# cat t382.cu
using T = float;
__global__ void batch_dot_product_kernel_threaded(
    const T* __restrict__ d_A,
    const T* __restrict__ d_B,
    T* __restrict__ d_C,
    const int vector_size,
    const int num_vectors
) {
    extern __shared__ T sdata[];

    int vec_id = blockIdx.x;         // Each block computes one vector's dot product
    int tid = threadIdx.x;           // Thread index within block

    if (vec_id >= num_vectors) return;

    int offset = vec_id * vector_size; // Starting index for this vector
    T sum = 0.0f;

    // Each thread computes a partial sum
    for (int i = tid; i < vector_size; i += blockDim.x) {
        sum += d_A[offset + i] * d_B[offset + i];
    }

    // Store partial sum into shared memory
    sdata[tid] = sum;
    __syncthreads();

    // Reduce partial sums to one value per block
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // First thread writes result
    if (tid == 0) {
        d_C[vec_id] = sdata[0];
    }
}

int main(){
  const int bs = 512;
  T *d_A, *d_B, *d_C;
  const int nv = 300;
  const int vs = 3000;
  const int ds = sizeof(T)*nv*vs;
  cudaMalloc(&d_A,ds);
  cudaMalloc(&d_B,ds);
  cudaMalloc(&d_C,sizeof(T)*nv);
  batch_dot_product_kernel_threaded<<<nv, bs, bs*sizeof(T)>>>(d_A, d_B, d_C, vs, nv);
  batch_dot_product_kernel_threaded<<<nv, bs, bs*sizeof(T)>>>(d_A, d_B, d_C, vs, nv);
  cudaDeviceSynchronize();
}

# nvcc -o t382 t382.cu -arch=sm_89 -lineinfo
# nsys nvprof --print-gpu-trace ./t382
WARNING: t382 and any of its children processes will be profiled.

Generating '/tmp/nsys-report-9437.qdstrm'
[1/3] [========================100%] report65.nsys-rep
[2/3] [========================100%] report65.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm                                         Name       
 -----------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  ----------------------------------------------------------------------------------
 674,706,950         41,312     121   300     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
 674,748,902          8,160     122   300     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)

Generated:
    /root/bobc/report65.nsys-rep
    /root/bobc/report65.sqlite
#
# ./junk/cuda-samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA L4
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     8.5

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     6.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     253.6

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
#

On the second kernel run, we can see the kernel duration is about 8 microseconds. This is really almost too short to be measuring (too small a problem size to get a good idea of GPU capability) but the calculation would look like this:

L4 peak memory bandwidth as measured by bandwidthTest: ~250GB/s

The kernel must read two float vectors of length 3000 elements for each of 300 dot-products. The kernel also has to write some data but its small by comparison so we will ignore that.

3000x2x4x300 = 7,200,000 bytes read by the kernel in ~8us

So that works out to an observed bandwidth of about 900GB/s which is well above what you can achieve on the L4 (operating from GPU DRAM). The explanation for this (I believe) is that this data set is so small it is fitting in the L2 cache (48MB on L4) which has higher bandwidth than main memory. In any event, that vector-dot-product processing rate works out to 37M dot-products per second. If we convert that to FP32 multiply ops per second, it is 112GF/s. This is a tiny fraction of the L4 FP32 rate of ~30TF/s, but that is due to the memory-bound nature of this problem, as a first-order factor, even operating out of L2.

Anyway I suspect your most recent posted code is “pretty good”.

Let’s make nv = 30000 to make the problem 100x larger. Now the difference between first and second kernel runs is negligible:

# nsys nvprof --print-gpu-trace ./t382
WARNING: t382 and any of its children processes will be profiled.

Generating '/tmp/nsys-report-f1a3.qdstrm'
[1/3] [========================100%] report66.nsys-rep
[2/3] [========================100%] report66.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

 Start (ns)   Duration (ns)  CorrId   GrdX   GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm                                         Name     
 -----------  -------------  ------  ------  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  ----------------------------------------------------------------------------------
 681,129,267      2,864,002     121  30,000     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)
 683,993,973      2,845,507     122  30,000     1     1   512     1     1       16         0.000         0.002                                                     NVIDIA L4 (0)    1     7  batch_dot_product_kernel_threaded(const float *, const float *, float *, int, int)

Generated:
    /root/bobc/report66.nsys-rep
    /root/bobc/report66.sqlite

Our new kernel duration is 2.8ms and our data size 720MB, for an observed bandwidth of ~257GB/s, just as predicted by bandwidthTest. This suggests to me the kernel code is approximately optimal. Of course, our FP32 delivered rate has dropped by about a factor of 4, again due to the memory-bound nature of the problem.

Yes, I have probably been sloppy in mixing e.g. MiB and MB in my calculations. You could clean that up, but I don’t think the conclusion is any different.

AFAIK most Nvidia GPUs have a L2 to global memory bandwidth ratio between 4:1 and 2:1. Which also fits the 900 / 250 GB/s. Accelerating memory is only part of the task of L2, it is also used as consistency layer between the SMs. And the second task probably is the more important one.

Perhaps you can fusion kernels or computation steps. How are the vectors generated? Perhaps you do not have to read them from global memory.

If they come from the CPU or an external device, PCIe speed may be the limiting factor, which is even slower than global memory!

If some of the vectors or their elements are reused (e.g. several vectors share elements), you possibly could save bandwidth. The same, if the vectors are sparse, or some can be computed on-the-fly (e.g. if they are coefficients).

Or perhaps FP16 is enough, or a fixed-point format?

Thank you. I’'ll have to think carefully about what you wrote. Where the vector comes from depends on the use case. Sometimes its disk, sometimes its cpu memory. Its not a zero copy situation so I will try and cut down the memory copying.

Do you think using “float4” to load 16 byte aligned data will improve the dot product saturation potential of the L4 GPU? I will try it since it seems easy to do.

So, in terms of timings: 1 FP32 dot product FMA op takes 2*4 bytes/“time unit” of host to device bandwidth. Both the FMA calculation time and the result writeback are negligible. The memory is moved from CUDA device memory to the cuda processor at a much higher and therefore negligible rate.

So lets say we have 9 GB/sec of bandwidth to add up to.

How do we get from 9 GB/sec (and L4 GPU) to 600 cuda processors fed at once (the saturation point) which I experimentally measured? How many cuda processors do you think you should be able to feed at once before we start running out of host to device bandwidth?

Actually I think its 13.5 in my case, not 9:

~/cuda-samples/cuda-samples/Samples/1_Utilities/bandwidthTest$ ./bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA L4
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.5

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			245.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
~/cuda-samples/cuda-samples/Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA L4"
  CUDA Driver Version / Runtime Version          12.4 / 12.0
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 22478 MBytes (23570219008 bytes)
  (058) Multiprocessors, (128) CUDA Cores/MP:    7424 CUDA Cores
  GPU Max Clock rate:                            2040 MHz (2.04 GHz)
  Memory Clock rate:                             6251 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 50331648 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 49 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 12.0, NumDevs = 1
Result = PASS

As Robert mentioned you have 30TF/s for 58 SMs, that means with 9 GB/s and 8 byte/FMA you can fill 0.002175 SMs or 0.2784 Cuda cores for a saturation point.

If you not only use the normal Cuda cores, but also the Tensor Cores (you need several Tensor Core operations per dot product to get to FP32 accuracy), then you can get even more dot products computed.

But the real limit is host->device bandwidth.

As a general software design principle, if one performs data processing where performance is bound by memory throughput, moving that processing to the GPU is not helpful unless the source data is already resident on the GPU. Adding additional data movement to a task already bound by memory bandwidth constraints is counterproductive, especially when this involves a severe bottleneck such as PCIe interconnect or access to SSD.

Performing a roofline analysis can be helpful in assessing the situation.

One more question - that host to device bandwidth of 13.5 GB/sec - it was on a rented AWS instance. But it says pinned memory. Why isn’t it anywhere close to the 32 gigabytes per second you would expect from 16 lanes of PCIE 4.0 ? Do I need to boot ubuntu 24 with big page table entries or some other setting? Also, on windows, what would be needed to get close to 16 lanes of PCIE 4.0 == 32 Gigabytes/sec?

Is there a planned PCIE 5.0 equivalent of the L4 coming out? Something that would show up on AWS?
We can’t really afford H100 or H200.

31:00.0 3D controller: NVIDIA Corporation AD104GL [L4] (rev a1)
	Subsystem: NVIDIA Corporation AD104GL [L4]
	Physical Slot: 3-1
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 254
	NUMA node: 0
	Region 0: Memory at cd000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 20800000000 (64-bit, prefetchable) [size=32G]
	Region 3: Memory at 20032000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq+ OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer+ 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
		Vector table: BAR=0 offset=00b90000
		PBA: BAR=0 offset=00ba0000
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nvidia_drm, nvidia

ubuntu@ip-172-31-17-172:~/cuda-samples/cuda-samples/Samples/1_Utilities/bandwidthTest$ 
ubuntu@ip-172-31-17-172:~/cuda-samples/cuda-samples/Samples/1_Utilities/bandwidthTest$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA L4
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.5

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			254.5

Result = PASS

N

13.5 GB/s sounds like you are getting Gen3 speed (or possibly a x8 link at gen4, see below). In my opinion, nvidia-smi -a output makes it easier to diagnose link issues. This:

makes it look like the server might not be providing a full x16 Gen4 link, but that may also be just power management in the server. It shouldn’t require any special OS settings to get full speed out of the PCIE link, in my experience.

The blackwell generation of GPUs will offer PCIE 5.0 in certain/various configurations. For example RTX Pro 6000 Blackwell Server edition offers PCIE Gen5 link/capability. This is a relatively recent product announcement, so I’m not sure about cloud availability yet. I wouldn’t be able to comment on future/unannounced products.

Any advice on how to get it to do 16 lanes?

I rented AWS G6 instance type and put ubuntu 24 on it.

ubuntu@ip-172-31-17-172:~/cuda-samples/cuda-samples/Samples/1_Utilities/bandwidthTest$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Wed May 28 16:00:43 2025
Driver Version                            : 550.144.03
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:31:00.0
    Product Name                          : NVIDIA L4
    Product Brand                         : NVIDIA
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1651524060865
    GPU UUID                              : GPU-80ad7f9e-f248-7048-de85-f92d065f1b91
    Minor Number                          : 0
    VBIOS Version                         : 95.04.65.00.37
    MultiGPU Board                        : No
    Board ID                              : 0x3100
    Board Part Number                     : 900-2G193-A800-001
    GPU Part Number                       : 27B8-895-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G193.0200.00.01
        OEM Object                        : 2.1
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : 1970/01/02 10:36:35.905
        Latest Duration                   : 102279 us
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    GSP Firmware Version                  : 550.144.03
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x31
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x27B810DE
        Bus Id                            : 00000000:31:00.0
        Sub System Id                     : 0x16CA10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 350 KB/s
        Rx Throughput                     : 300 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 23034 MiB
        Reserved                          : 556 MiB
        Used                              : 1 MiB
        Free                              : 22479 MiB
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 1 MiB
        Free                              : 32767 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 96 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 31 C
        GPU T.Limit Temp                  : 49 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    GPU Power Readings
        Power Draw                        : 16.27 W
        Current Power Limit               : 72.00 W
        Requested Power Limit             : 72.00 W
        Default Power Limit               : 72.00 W
        Min Power Limit                   : 40.00 W
        Max Power Limit                   : 72.00 W
    GPU Memory Power Readings 
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 765 MHz
    Applications Clocks
        Graphics                          : 2040 MHz
        Memory                            : 6251 MHz
    Default Applications Clocks
        Graphics                          : 2040 MHz
        Memory                            : 6251 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2040 MHz
        SM                                : 2040 MHz
        Memory                            : 6251 MHz
        Video                             : 1770 MHz
    Max Customer Boost Clocks
        Graphics                          : 2040 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 655.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None


my guess would be that is a limitation of the server design. The AWS G6 instance type can support up to 8 L4 GPUs. They may have designed the server in such a way that only 8 lanes per GPU are provided. That’s just a guess. Only AWS could confirm/refute, and they don’t seem to document this aspect in their instance type documentation that I could find.

Can you re-run nvidia-smi -a with the GPU under load and look at the PCIe data?

The listing you show is with the card idle and both the Generation and width are reduced as a power saving measure.