Hello,
I’m trying to write a test kernel that uses a pointer chasing technique to measure the peak read bandwidth of DRAM on my GPU. However, I’m stuck at 1.80 TB/s rather than the expected 2.04 TB/s of my A100 80GB SXM stated in the data sheet. I measured the bandwidth based on the number of operations in the kernel and the execution of the kernel (timed with events). I didn’t account to latency of the kernel launch, but according to this post it should be negligible compared to the overall run time (10s of ms). I also profiled the kernel with NSight Compute (NCU), and it too agrees that I am at 88% of peak bandwidth.
Questions
- Is there an example of a kernel that achieves the advertised 2.0 TB/s for A100?
In this GTC talk from earlier this year the maximum memory bandwidth achieved by any kernel was around 90%. Also, data from this micro-benchmark achieved a similar bandwidth on their read kernel for the A100. If you know of other examples that would be relevant, I’d be grateful to see them.
- Is there something wrong in my kernel or setup that would prevent me from getting the maximum bandwidth?
I have tried to do my due diligence on this front, but it remains likely that the issue is my code. Below is the kernel and the kernel to initialize the data.
#define ELEMENTS_PER_THREAD (32)
template < unsigned int ilp>
__global__ void benchmark_kernel_double(uintptr_t *addr_chain, long long int chain_length, int iters){
// this kerneel is useful for benchmarking the latency due to a mix of loads and math
long long int idx_0 = THREADS_PER_BLOCK*chain_length*blockIdx.x + threadIdx.x;
long long int idxs[ilp];
double rolling_sums[ilp];
int sub_chain_length = chain_length / ilp;
for (int i = 0; i < ilp; i++){
idxs[i] = idx_0 + (i * sub_chain_length)*THREADS_PER_BLOCK;
rolling_sums[i] = __longlong_as_double((long long)&addr_chain[idxs[i]]);
}
uintptr_t addrs[ilp];
int outter_iterations = iters/ilp;
#pragma unroll 1
for(int k=0; k<(outter_iterations); k++){
#pragma unroll
for(int j=0; j<ELEMENTS_PER_THREAD; j++){
// Load elements (memory intensive part)
// reinterperate the rolling_sum as an address
// this ok because it was initialized as an address and we have not
// changed the value with our math operations
for (int i = 0; i < ilp; i++){
addrs[i] = __double_as_longlong(rolling_sums[i]);
rolling_sums[i] = *(double *)addrs[i];
}
}
}
// if this were not here then the compiler would realize that our kernel does nothing
for (int i = 0; i < ilp; i++){
if( rolling_sums[i] < 0.0 ) { // Designed so it never executes
addr_chain[idxs[i]] = rolling_sums[i];
}
}
}
__global__ void setup_pchase(uintptr_t* addr_chain, long long int chain_length){
// each block will read a width of blockDim.x*chain_length
long long int idx_0 = THREADS_PER_BLOCK*chain_length*blockIdx.x + threadIdx.x;
long long int idx = idx_0;
long long int idx_next;
for(int j=0; j<(chain_length-1); j++){
idx_next = idx + THREADS_PER_BLOCK;
addr_chain[idx] = (uintptr_t)&addr_chain[idx_next];
idx = idx_next;
}
// the last adress does not exist, we need to change it so that the
// sequence of adresses loops back to the start
addr_chain[idx] = (uintptr_t)&addr_chain[idx_0];
};
What I have considered
dynamic clock
My first thought was dynamic clock scaling was the culprit. Unfortunately I don’t have permission to set the clock on the system via nvidia-smi. I tried modifying the kernel to load the array 100 times and launching 20 instances of same kernel, but the measured bandwidth of each was not significantly different. I monitored the clock during this with nvidia-smi -q -d CLOCK --loop-ms=100, but it didn’t appear to change. Perhaps this is because the memory clock speed was already at the maximum?
==============NVSMI LOG==============
Timestamp : Mon Jul 14 18:52:23 2025
Driver Version : 575.57.08
CUDA Version : 12.9
Attached GPUs : 1
GPU 00000000:C7:00.0
Clocks
Graphics : 1155 MHz
SM : 1155 MHz
Memory : 1593 MHz
Video : 1050 MHz
Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Default Applications Clocks
Graphics : 1155 MHz
Memory : 1593 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1593 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
SM Clock Samples
^C Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
However, As I understand it NCU sets a fixed clock speed to reduce the variability between samples. The fact that NCU’s measurement of the memory bandwidth relative to the peak matched mine suggests that perhaps this is not due to clock scaling.
not enough concurrency
To test this theory I modified the kernel to use variable levels of instruction level parallelism (ILP). At ILP=1 the kernel archives full occupancy (64 warps per SM) and loads 1 double per warp a time. Thus, the “bytes in flight per SM” is 0.5 KiB for ILP=1. I increased the ILP from 1 through 16 for a maximum “bytes in flight per SM” of 4KiB.
At high levels of ILP the register usage limited the occupancy of the kernel.
| ILP | Occupancy | Bytes in Flight per SM | Bandwidth (GB/s) |
|---|---|---|---|
| 1 | 64 | 0.5 | 1758.82 |
| 2 | 64 | 1 | 1810.52 |
| 4 | 64 | 2 | 1771.17 |
| 8 | 48 | 3 | 1765.74 |
| 16 | 32 | 4 | 1751.57 |
extra instructions
I tried my best to get a nice clean stream of LDGs in the SASS. According to NCU 99.6% of the instructions are LDs.
waves
I tweaked the kernel so that it is 22.99 waves per SM (this was the closest to an integer number I could get without going over) to help rule this out
remaining thoughts about what could be going wrong
- This is actually an issue with the dynamic clock speed and my naive attempts to test for that did not work.
- The large number of memory requests is causing lots of contention in the memory system. To achieve 2.0 TB/s perhaps you need to use wide loads to reduce the overall number of requests.
Thank you for your help,
Josh

