Trouble to Reach Peak Bandwidth of A100

Hello,

I’m trying to write a test kernel that uses a pointer chasing technique to measure the peak read bandwidth of DRAM on my GPU. However, I’m stuck at 1.80 TB/s rather than the expected 2.04 TB/s of my A100 80GB SXM stated in the data sheet. I measured the bandwidth based on the number of operations in the kernel and the execution of the kernel (timed with events). I didn’t account to latency of the kernel launch, but according to this post it should be negligible compared to the overall run time (10s of ms). I also profiled the kernel with NSight Compute (NCU), and it too agrees that I am at 88% of peak bandwidth.


Questions

  1. Is there an example of a kernel that achieves the advertised 2.0 TB/s for A100?

In this GTC talk from earlier this year the maximum memory bandwidth achieved by any kernel was around 90%. Also, data from this micro-benchmark achieved a similar bandwidth on their read kernel for the A100. If you know of other examples that would be relevant, I’d be grateful to see them.

  1. Is there something wrong in my kernel or setup that would prevent me from getting the maximum bandwidth?

I have tried to do my due diligence on this front, but it remains likely that the issue is my code. Below is the kernel and the kernel to initialize the data.

#define ELEMENTS_PER_THREAD (32)

template < unsigned int ilp>
__global__ void benchmark_kernel_double(uintptr_t  *addr_chain, long long int chain_length, int iters){
	// this kerneel is useful for benchmarking the latency due to a mix of loads and math
    
	long long int idx_0  = THREADS_PER_BLOCK*chain_length*blockIdx.x  + threadIdx.x;
	long long int idxs[ilp];
	double rolling_sums[ilp];
	
	int sub_chain_length = chain_length / ilp;
	
	
    for (int i = 0; i < ilp; i++){
		idxs[i] = idx_0 + (i * sub_chain_length)*THREADS_PER_BLOCK;
		rolling_sums[i] = __longlong_as_double((long long)&addr_chain[idxs[i]]);
	}
	
	uintptr_t addrs[ilp];
	
	int outter_iterations = iters/ilp;
	
	#pragma unroll 1
	for(int k=0; k<(outter_iterations); k++){

		#pragma unroll 
		for(int j=0; j<ELEMENTS_PER_THREAD; j++){
			
			// Load elements (memory intensive part)
			// reinterperate the rolling_sum as an address
			// this ok because it was initialized as an address and we have not 
			// changed the value with our math operations
			for (int i = 0; i < ilp; i++){
				addrs[i] = __double_as_longlong(rolling_sums[i]);
				rolling_sums[i] = *(double *)addrs[i];
			}
			
		}
	}
		
	
	// if this were not here then the compiler would realize that our kernel does nothing
	for (int i = 0; i < ilp; i++){
		if( rolling_sums[i] < 0.0 ) { // Designed so it never executes
			addr_chain[idxs[i]] =  rolling_sums[i];
		}	
	}
}

__global__ void setup_pchase(uintptr_t* addr_chain, long long int chain_length){
				
	// each block will read a width of blockDim.x*chain_length
    
    long long int idx_0 = THREADS_PER_BLOCK*chain_length*blockIdx.x  + threadIdx.x;
	long long int idx = idx_0;
	long long int idx_next;
    
	for(int j=0; j<(chain_length-1); j++){
		idx_next = idx + THREADS_PER_BLOCK;
		addr_chain[idx] = (uintptr_t)&addr_chain[idx_next];
		idx = idx_next;
	}
	
	// the last adress does not exist, we need to change it so that the 
	// sequence of adresses loops back to the start 
	addr_chain[idx] = (uintptr_t)&addr_chain[idx_0];
	
};

What I have considered

dynamic clock

My first thought was dynamic clock scaling was the culprit. Unfortunately I don’t have permission to set the clock on the system via nvidia-smi. I tried modifying the kernel to load the array 100 times and launching 20 instances of same kernel, but the measured bandwidth of each was not significantly different. I monitored the clock during this with nvidia-smi -q -d CLOCK --loop-ms=100, but it didn’t appear to change. Perhaps this is because the memory clock speed was already at the maximum?

==============NVSMI LOG==============

Timestamp                                 : Mon Jul 14 18:52:23 2025
Driver Version                            : 575.57.08
CUDA Version                              : 12.9

Attached GPUs                             : 1
GPU 00000000:C7:00.0
    Clocks
        Graphics                          : 1155 MHz
        SM                                : 1155 MHz
        Memory                            : 1593 MHz
        Video                             : 1050 MHz
    Applications Clocks
        Graphics                          : 1155 MHz
        Memory                            : 1593 MHz
    Default Applications Clocks
        Graphics                          : 1155 MHz
        Memory                            : 1593 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1593 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    SM Clock Samples
^C        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Memory Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

However, As I understand it NCU sets a fixed clock speed to reduce the variability between samples. The fact that NCU’s measurement of the memory bandwidth relative to the peak matched mine suggests that perhaps this is not due to clock scaling.

not enough concurrency

To test this theory I modified the kernel to use variable levels of instruction level parallelism (ILP). At ILP=1 the kernel archives full occupancy (64 warps per SM) and loads 1 double per warp a time. Thus, the “bytes in flight per SM” is 0.5 KiB for ILP=1. I increased the ILP from 1 through 16 for a maximum “bytes in flight per SM” of 4KiB.
At high levels of ILP the register usage limited the occupancy of the kernel.

ILP Occupancy Bytes in Flight per SM Bandwidth (GB/s)
1 64 0.5 1758.82
2 64 1 1810.52
4 64 2 1771.17
8 48 3 1765.74
16 32 4 1751.57

extra instructions

I tried my best to get a nice clean stream of LDGs in the SASS. According to NCU 99.6% of the instructions are LDs.

waves

I tweaked the kernel so that it is 22.99 waves per SM (this was the closest to an integer number I could get without going over) to help rule this out

remaining thoughts about what could be going wrong

  1. This is actually an issue with the dynamic clock speed and my naive attempts to test for that did not work.
  2. The large number of memory requests is causing lots of contention in the memory system. To achieve 2.0 TB/s perhaps you need to use wide loads to reduce the overall number of requests.

Thank you for your help,
Josh

The advertised number is a peak theoretical number. It cannot be achieved in actual measurement for various reasons. 1.80/2.0 = 90% is a reasonable upper bound on what can be achieved in actual code/measurement. This is common for many GPUs and is not unique/specific to the A100.

I think it would be correct to say that nobody in the history of CUDA has ever demonstrated a kernel that achieves peak theoretical bandwidth.

The theoretical memory bandwidth calculation is as follows:

5 stacks HBM x 1024 bits/stack/transfer x 1593 Mtransfers/sec * 2 (double-pumped) / 8 bits/byte = 2.039 TB/s.

There are 16GB/stack for a total of 80GB.

1 Like

If (1) 2.04 TB/sec in the theoretical bandwidth based on the product number of memory channels × bit width of each memory channel × operating frequency of each memory channel × DDR/QDR transmission characteritics; and (2) 1.80 TB/s is the maximum practically achievable measured transfer rate I would say this is exactly what you should expect:

For all processor-attached bulk memory implementations, measured maximum throughput is typically 85% to 90% of theoretical maximum.

1 Like

On one hand, I feel glad that I can stop hunting for a problem, on the other I feel a bit misled by the official documentation.

Can you provide anymore insight into why you would expect the practical limit to be less than the theoretical?

Prepare for much disappointment with every data sheet or specification that you will come across. In my four decades of experience, they all state theoretical throughput numbers.

Not really. I has been 30 years since I last needed to model the bulk memory behavior of an x86 processor. In terms of the “pipe” connecting two endpoints there is an assumption about error-free transmission, and in terms of the endpoints assumptions like 0-cycle context switching and infinitely deep buffers come to mind. All of these assumptions do not hold up in reality.

I appreciate your insights here. I guess I was naive to take the data sheet at face value, although I’m glad to have reassurance that I wasn’t missing something.

Thank you both for your timely responses.

You could try a kernel, which does more simple accesses than a “pointer chasing technique”: Just access large amounts of memory inside a loop, which can be unrolled. That could lead to a more optimized compilation and better filled pipelines.

As you could see the current 90% of the maximum throughput on Nsight Compute, you could directly compare.

If you see a higher throughput, it depended on your kernel and access profile.

some of the possible contributors to DRAM measured access not matching peak theoretical:

  1. DRAM access generally does not permit retrieval of data (or transmission of payload data) on every clock cycle. Some of the cycles may be used for other activities such as refresh, row/column selection, precharge, etc.
  2. Some of the payload cycles may be used for other activities, including but not limited to instruction fetch, TLB access, or other activities needed from time to time by the GPU.
  3. Depending on the DRAM design, some payload data may be consumed by ECC.
  4. The DRAM bus cannot typically “turn around” in zero cycles. Therefore any activity that mixes read cycles with write cycles may result in further inefficiency.

I’m not suggesting all or any of these are explanations for this specific case. I certainly acknowledge that people will have reasons why any or all of these don’t apply, or don’t explain the difference between 90% and 100%. I don’t have detailed technical writeups to support these claims; I understand folks may be skeptical. This article discusses various ideas/opinions, not specific to GPUs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.