I’m a bit of a CUDA n00b and my first program is a simple bandwidth test to measure read performance over random blocks. The test uses a single warp to read 32 values of N bytes each from consecutive memory locations, where the base address is a 32*N-byte aligned pseudo-random number, modulo a large amount (I use a 2GB device memory area). Each warp loops around a few times and then exits.
By using blocks of (32,2,1) I can get 100% occupancy on nSight ((32,1,1) gives 50% occupancy). I then issue a large number of blocks, to target around 500ms of execution time. I time the kernel on the CPU and compute memory bandwidth in GB/s.
From what I’ve read about CUDA optimization and about GDDR5X, I would expect to hit peak memory bandwidth at N=4 for 128B reads from memory. Peak memory bandwidth on the GTX 1080 Ti is advertised as 484GB/s but bandwidthtest.exe only achieves about 360GB/s (75%). So I would expect to get near to 360GB/s with my test at N=4.
But that’s not the case.
With 128B random reads, I only get 240GB/s, which is almost exactly 50% of 484GB/s.
With 256B random reads, I can hit 350GB/s, which is basically as good as bandwidthtest.exe.
So am I right in concluding that the GTX 1080 Ti is performing 256B fetches from GDDR5X?
Is there a way to make it perform 128B fetches so I can get nearer to the bandwidth limit in the N=4 case?
I am a software guy with some exposure to hardware along various points of my career. If you need an authoritative answer I would suggest consulting someone who designs DRAM-based memory subsystems for a living.
GDDR5X is specified in JEDEC standard JESD 232 (November 2015). It supports DDR and QDR operating modes, where DDR mode provides 6 Gbps per pin and QDR mode provides 12 Gbps per pin. QDR mode uses twice the burst length of DDR mode. There is thus a fixed coupling between the theoretical throughput and the granularity of access.
So, if I understand it correctly: Assuming the GPU’s memory controller could switch into either mode, you could increase the efficiency (percent of peak bandwidth used) of 128B accesses by switching to DDR mode, but this would also cut the theoretical throughput in half. Which would leave you off no better than using QDR mode at lower efficiency. In fact, almost certainly worse because one cannot get anywhere close to 100% efficiency.
The observation that only about 75%-80% of theoretical throughput can be achieved in practical terms is a common feature of modern DRAMs. E.g. on an Intel CPU you might find a DDR4 system memory that delivers 75 GB/sec theoretical but 60 GB/sec practically achievable bandwidth. You can’t get to 100% due to overhead like address/data pin sharing and read/write turnaround. As DRAMs have gotten faster this overhead has taken up more and more of the total available cycles. With older DRAM architectures efficiencies around 85% were common, but of course the maximum achievable throughput was also lower.
It doubles it from 32B to 64B per transfer. So the system has to fetch at least 64B at one time. This is still within the parameters to allow a 128B transfer size. The actual transfer size is a function of the memory controller design, not of GDDR5X itself. Whether it can be modified similarly.
Are there nVidia engineers here who can definitively answer for the specific memory controller on the GTX 1080 Ti?
@njuffa, yeah, I’ve read parts of that spec. There’s no overhead due to address/data pin sharing because on GDDR5X the pins are not shared. If you’re only performing read tests, similarly there is no overhead due to read/write turnaround. There may be other overheads though, opening banks and so on, it’s all very complex, and the final performance depends on the memory controller design. I get 75% from bandwidthtest.exe so that’s an indication of the overheads that exist in reality.
DDR mode definitely doesn’t help here. 128B reads on a 256B access halves the bandwidth; 128B reads on a 128B access that’s half as fast also halves the bandwidth :)
Looks like you already know everything there is to know about this aspect of the hardware then. Trying to suss out low-level details of the GPU hardware architecture is an exercise in frustration because NVIDIA traditionally does not make most of the details publicly available. Given the accelerated and unprecedented speed of DDR5X rollout, it would be reasonable to assume that NVIDIA made minimal changes to their previous DDR5 memory controller, thus leading to a doubling of the GPU’s access granularity as a direct consequence of the doubling of the DRAM’s access granularity.
I would suggest moving on to learning more aspects of CUDA, so you can shake your n00b status :-). The software is quite well documented, and the entire ecosystem has grown to an enormous size. I certainly can’t keep up with all of it anymore, so I would suggest focusing on those parts that are most relevant to your use case(s).
I have a master’s degree in Electrical Engineering and I am currently employed by NVIDIA in a technical capacity. However I am not the engineer who designed the DRAM controller for 1080Ti (or any GPU). (nor does that engineer patrol these forums).
I’m not aware of any user-controllable parameters that allow an end user to make an adjustment of this type.
The reason a 64-byte transfer seems to be OK in the context of a 128 byte global memory transaction but actually is not is due to the division of global memory into physical partitions. Global memory has a logical byte ordering, such that within a given 128byte transaction, the bytes are “striped” across multiple DRAM partitions. A DRAM partition is essentially a standalone DRAM array, and GPUs like 1080Ti have multiple DRAM partitions connected to their multiple on-chip DRAM controllers. A given 128 byte global transaction, if it must be serviced from DRAM, will engage multiple DRAM partitions, for GPUs that have more than 1 partition (generally 1 partition and 1 memory controller per 64 bits of DRAM bus width).
I’m not going to try to spell out the precise details of all the mappings that go on to determine which byte from a global memory line/transaction will be sourced from which partition. However, it stands to reason that if I have 4 partitions, and I want all 4 partitions of GDDR5X to be engaged for a given global transaction of 128 bytes, and I am using 64 bytes per partition per transaction, then I will have twice as many bytes as I need.
As njuffa states, if you are looking for a more precise answer than that, I don’t think you’ll find it publicly available, and NVIDIA doesn’t generally respond to requests for this level of technical detail.