i am hitting a bottleneck when accessing data from global memory. I am working with a GTX 980 card. Although the documentation states that the L2 cache transaction size is 32 bytes i am not able to reach anywhere near the theoretical global memory bandwidth unless i am loading coalesced block of at least 256 bytes (and i am prete sure this would also bottleneck if the memory bandwidth was high enough).
From my measurements it looks like the bottleneck is at the GPU level as a whole and the throughput seems to be equal to 1 transaction (coalesced patch of memory) per clock cycle irrespective of how big the transaction size is.
I came to this conclusion because:
changing the memory clock makes no difference unless it is so low that it becomes the limiting factor
changing the GPU clock changes the throughput close to linearly
increasing the number of independent load requests has no effect so latency is not the cause of the problem
issuing only 8 block - so that only the first half of the 16 SMs (as shown by Nsight) is executing - again no change (this suggests that the bottleneck is also not at the SM or GPC level)
So… can anyone tell me what the bottleneck is? Is it documented somewhere? Is there any way to somehow work around this (i do not have much hope that it is possible)?
I am providing the code so you can try it out yourself.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#define DATA_SIZE (1 << 30)
#define DATA_ACCESSES (1 << 6)
#define BLOCK_SIZE 128
#define BLOCKS_COUNT 1024
template<int COUNT, int PAGE_SIZE, typename T>
__launch_bounds__(BLOCK_SIZE, 3)
__global__ void kernel(T *data)
{
const int pages_count = DATA_SIZE / (PAGE_SIZE * sizeof(T));
const int bid = (blockIdx.x * BLOCK_SIZE + threadIdx.x) / PAGE_SIZE;
const int tid = (blockIdx.x * BLOCK_SIZE + threadIdx.x) % PAGE_SIZE;
unsigned int dummy[COUNT];
for (int c = 0; c < COUNT; c++)
{
dummy[c] = 0;
}
for (int i = 0; i < DATA_ACCESSES; i++)
{
#pragma unroll
for (int c = 0; c < COUNT; c++)
{
unsigned int page = ((bid * COUNT + c) * DATA_ACCESSES + i) * 1031;
unsigned int index = (page % pages_count) * PAGE_SIZE + tid;
T v = data[index];
unsigned int * va = reinterpret_cast<unsigned int *>(&v);
#pragma unroll
for (int j = 0; j < (sizeof(T) / sizeof(unsigned int)); j++)
{
dummy[c] ^= va[j];
}
}
}
if (bid != 1 << 24) return;
for (unsigned int c = 0; c < COUNT; c++)
{
reinterpret_cast<unsigned int *>(data)[COUNT * c + tid] = dummy[c];
}
}
int main()
{
uint4 * data;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMalloc(reinterpret_cast<void**>(&data), DATA_SIZE);
const int count = 32;
const int page_size = 8;
kernel<count, page_size> << <BLOCKS_COUNT, BLOCK_SIZE >> >(data);
cudaDeviceSynchronize();
cudaEventRecord(start, nullptr);
kernel<count, page_size> << <BLOCKS_COUNT, BLOCK_SIZE >> >(data);
cudaEventRecord(stop, nullptr);
cudaDeviceSynchronize();
float duration;
cudaEventElapsedTime(&duration, start, stop);
printf("%f ms\n", duration);
printf("%f GB/s\n", (1000.0f/duration) * count * sizeof(uint4) * DATA_ACCESSES * BLOCKS_COUNT * BLOCK_SIZE / static_cast<float>(1 << 30));
return 0;
}
you can change the memory block size by changing the page_size variable (block size in bytes = page_size * sizeof(uint4) - but you can also try to change the type to uint2 or just uint and increase the page_size accordingly, it will make no difference).
Your findings appear correct. As a rough rule of thumb on most cards you need to have two 128 byte transactions in flight per warp to reach about 80% of theoretical global memory bandwidth. This is to cover the latency of a fully loaded memory system, which can be higher than that of an otherwise idle GPU (more like 2000 cycles rather than the 400-800 cycles the Programming Guide used to mention - sorry I don’t have current numbers at hand from either the doc or newer devices).
tera thank you for your comment. I agree that everything you said applies in a general case… however in this particular case latency is not the problem.
I am already issuing way more loads than should be needed to cover any latency to the point where i should be achieving full practical global memory bandwidth (which is usually little more than 90% of the theoretical bandwidth).
What is actually happening is - as i launch more warps in parallel, the performance is actually decreasing… the GPU seems to be choking as i increase the number of transaction in flight.
Concretely, if i use the code above and make these changes (smaller blocks so i can increase BLOCKS_COUNT with better granularity):
Now this looks how it should look like when the limiting factor in the beginning is latency and than the global memory bandwidth. My memory is running at 3505 MHz over 256-bit-wide bus which translates to 219 GB/s of theoretical bandwidth. So in this case we are reaching almost 90% (the actual BW is even little higher as the timing is no that precise and includes all the overhead of starting a kernel).
So you see… the only difference between these two cases is how big the coalesced patch is, everything else is the same, we are issuing the same number of independent loads which are loading the same overall amount of data. And we also see that in the first case the bottleneck is’t the memory BW ant it also isn’t latency.
The question are:
What is the bottleneck?
Is it documented anywhere?
Is there some workaround?
Achieving 90% of theoretical bandwidth is an extremely good result, I do not recall seeing more than about 85% in actual applications on GPUs without ECC.
What you are observing are likely artifacts of the interaction between internal buffering, re-ordering, scheduling, and coalescing mechanisms in the memory controller, which change from GPU generation to GPU generation. I am not aware of any detailed documentation on the internal mechanisms of Maxwell-class GPU memory controllers, either from NVIDIA or from third parties who have reversed engineered it.
As for a workaround, it seems you have already discovered it: Use coalesced regions of 256 bytes. You may also want to look at the impact of the total number of threads running on memory throughput. In a previous GPU generation, maximizing memory throughput required up to 20x “oversubscription” of each SM’s warp execution resources. That is, the total number of thread blocks in the launched grid was ideally 20 times the number of thread blocks able to run concurrently. I do not know whether this heuristic still applies to Maxwell.
thank you guys for your comments. @njuffa regarding the workaround… that is the precisely the one thing that i can not change… the algorithm has to work with regions of 128 bytes and there is no way to coalesce the accesses as the regions are more or less randomly seeded in a rather large block of memory (more than 1 GB).
It is unfortunate that this bottleneck is not properly documented, it could have saved me quite a lot of time that i wasted trying to achieve higher throughput… now it is obvious that every attempt at that was doomed to fail :/
The transactions in flight do not need to be adjacent. Try prefetching the next 128 bytes block while you are processing the previous one. This might need checking with “cuobjdump -sass” and some fiddling and twiddling to ensure the compiler is ordering these as intended. I used to have a bit of trouble with the compiler ordering a register-to-register move directly after the load (which defeats the purpose of prefetching), so check that’s not the case for your code.
I do not think we can assume that there is a specific bottleneck somewhere in the Maxwell memory controller’s design. When there are multiple mechanisms interacting there could be any number of scenarios based on specific address patterns presented to the controller, and as tera points out, that is also a function of the machine code generated by the compiler, which at minimum schedules but may also re-order memory operations.
I suspect it is a bit like internet congestion, which is likewise a collection of scheduling, buffering, and protocol mechanisms, often tuneable: It is very difficult to get a handle on the overall system behavior even for people familiar with all the details,and non-intuitive results such as adding buffering lowering throughput easily occur.
So even given a detailed description of the memory controller it is doubtful whether it would be clear what the optimal strategy is. I used to work on, and also optimize software for, x86 CPUs, and the last time any optimization strategy could be completely devised on paper was probably the Pentium MMX (P55C) and the AMD K6-2, assuming a solid understanding of the CPU, chipset, and memory used at the time. Ever since then, optimization has been an experimental process that may well be informed by detailed device documentation, but cannot simply rely on it.
As for efficiency of memory throughput on Maxwell, I see numbers in line with my previously stated rule of thumb. My Quadro K2200 (sm_50) has a theoretical throughput of 80.16 GB/sec. The measured throughput of DCOPY and ZCOPY are only about 80% of that:
> dcopy -n16777216
dcopy: operating on vectors of 16777216 doubles (= 1.342e+008 bytes)
dcopy: using 128 threads per block, 65520 blocks
dcopy: mintime = 4.167 msec throughput = 64.42 GB/sec
> zcopy -n16777216
zcopy: operating on vectors of 16777216 double2s (= 2.684e+008 bytes)
zcopy: using 128 threads per block, 65520 blocks
zcopy: mintime = 8.313 msec throughput = 64.58 GB/sec
I guess the difference to CudaaduC’s test is that *COPY requires equal amounts of reads and writes, and is therefore exposed to the DRAM’s read-write-turnaround, while the sum reduction only performs loads?