Question about the concepts of throughput and latency

I am a green hand and reading the manual of “CUDA C Programming Guid”.

I don’t understand the concept of throughput in the manual.

For example, in the section 5.3.2, there is a sentence “For example, if a 32-byte memory transaction is generated for each thread’s 4-byte access, throughput is divided by 8.”

Can someone explain the throughput in detail?

Thanks a lot.

Questions like this come up frequently. There is a great deal of published information on it. You might want to study slides 30-48 in the following presentation:

In a nutshell, DRAM subsystems on GPUs have a minimum addressable quantity, which is usually 32 bytes. If you request 32 bytes, and use 32 bytes, then that is full throughput for the memory bus: every requested byte is actually used by the program. If you request 32 bytes (the minimum) but only use 4 bytes, then 28 bytes transferred are wasted.

When adjacent threads in a warp request data, if that data is all adjacent, then the 32-byte transactions requested from DRAM can be effectively utilized by various threads in the warp. This is 100% utilization or throughput. If, on the other hand, each thread is generating a non-adjacent address, then to satisfy each threads needs, many more transactions will be required from DRAM, but a lot of “wasted” bytes will be transferred, and “throughput” goes down.

Nice resource, thanks

Thanks a lot.

It’s a very clear answer. I can futher understand the throughput now.