Cu_device_attribute_global_memory_bus_width

Hi CUDA masters,
Want to understand better what this device attribute means… it is defined as “global memory bus width in bits” in CUDA Driver API.
Is this the bus width between global device memory and the L2 cache level?
And does this mean (for example if the value is 256 bits = 32 bytes) that if the L1/L2 cache line width is 128bytes, then for an L1 or L2 cache miss, it will require 4 transactions between global and L2 to transfer a single cache line?

Also, is the device attribute CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE the clock rate for this global memory, or for all memory levels (L1,L2,Global,texture mem) on the device?

Thanks!

Related question: why is the global memory bus width = 256bits? Assume it has to do wit the GDDR5/6 memory units. Can someone clarify a bit more? I’m not DIMM-witted (sorry, nerd joke).
Thx.

It is intended to convey the number of data bus wires used to implement the electrical interface between the GPU chip itself and the GPU DRAM memory (which is off-chip). It would be approximately correct to say it is the bus width between (off-chip) device memory and the L2 cache.

I’m not aware of any CUDA GPU ever that had a L2 cache line of 128 bytes. To my knowledge it is uniformly 32 bytes. An L2 cache line miss would typically be serviced from (off-chip) DRAM memory. The transactions associated with the DRAM bus are not typically at the width of the DRAM bus. Typically, the DRAM bus is broken into so-called “partitions”. Depending on the mapping of L2 cache lines to physical DRAM memory, the memory controller will issue one or more transactions on one or more partitions, to service L2 cache miss(es).

An L1 cache miss would usually attempt to hit in L2 first. If it missed in L2, see above.

I’m not sure it is documented, but if we were for example to pretend that a partition has a width of 64 bits, then a cache line miss for L2 would require 4 partition transactions. Whether those partition transactions would all occur on the same partition or would be spread across partitions is something that I can’t answer, is generally undocumented, and probably depends on GPU architecture, the exact addresses involved, and possibly other factors. You can imagine that the GPU designers want to make it such that typical data access patterns will tend to fall into a mapping arrangement such that DRAM bus bandwidth will be maximally/optimally utilized.

GPUs have varying DRAM bus widths. I have seen GPUs that have a bus width as low as 64 bits and as high as 4096 bits (I think A100 GPUs are even higher, but I haven’t checked). If you have a GPU where the bus width is 256 bits, then that is because the GPU designers felt that was best for that particular product/SKU. I think there are many factors that go into such a design choice, including cost, DRAM technology, and desired bandwidth for the product segment (i.e. desired performance).

If I recall correctly from my processor-building days, the cost aspect (and possibly the power aspect, but my memory is very hazy there) of a memory interface encourages the use of as narrow a DRAM interface running at as high a clock speed as is possible to achieve a targeted transfer rate. The targeted transfer rate is a function of the performance class a particular GPU falls into. This applies to classical DRAM technology. GPUs like the A100 use novel DRAM technology called HBM2 which typically uses very wide interfaces.

@Robert_Crovella - I based my comment about 128 byte L1/L2 cache line off this info from the current/latest Nsight Compute docs
Sector: Aligned 32 byte-chunk of memory in a cache line or device memory. An L1 or L2 cache line is four sectors, i.e. 128 bytes.
see Quantities table, “sector” row: Kernel Profiling Guide :: Nsight Compute Documentation

Now, per your comment about “transactions associated with the DRAM bus are not typically at the width of the DRAM bus”… suppose, as in case of Turing T4, the DRAM bus is 256bits. Does this have to do with the GDDR DIMM package having 64bit interface? So in the case of 4 transactions needed to pull a full 32byte “sector” - is there a similar situation here like in shared memory banks, i.e. possibility of bank conflicts (partition conflicts?) or at least reduced efficiency if all 4 sectors had to come from the same partition versus were interleaved amongst 4 partitions and could be accessed in parallel?

Yes, the concept you are now asking about is called “partition camping”. There isn’t really anything you can do about it. It’s been discussed elsewhere so if you want more info, a google search may suffice.

I was under the impression that partition camping hasn’t been a serious issue for over a decade now, fixed by address bit scrambling heuristics. Of course that can only minimize the risk, not completely eliminate it, but it has been quite successful at preventing partition camping effects for commonly encountered addressing patterns.