I am a green hand and reading the manual of “CUDA C Programming Guid”.
I don’t understand the concept of throughput in the manual.
For example, in the section 5.3.2, there is a sentence “For example, if a 32-byte memory transaction is generated for each thread’s 4-byte access, throughput is divided by 8.”
Can someone explain the throughput in detail?
Thanks a lot.
Questions like this come up frequently. There is a great deal of published information on it. You might want to study slides 30-48 in the following presentation:
In a nutshell, DRAM subsystems on GPUs have a minimum addressable quantity, which is usually 32 bytes. If you request 32 bytes, and use 32 bytes, then that is full throughput for the memory bus: every requested byte is actually used by the program. If you request 32 bytes (the minimum) but only use 4 bytes, then 28 bytes transferred are wasted.
When adjacent threads in a warp request data, if that data is all adjacent, then the 32-byte transactions requested from DRAM can be effectively utilized by various threads in the warp. This is 100% utilization or throughput. If, on the other hand, each thread is generating a non-adjacent address, then to satisfy each threads needs, many more transactions will be required from DRAM, but a lot of “wasted” bytes will be transferred, and “throughput” goes down.
Thanks a lot.
It’s a very clear answer. I can futher understand the throughput now.