Coalesced read/write memory details More informations about coalesced memory

I have a question targeted on those much more experienced CUDA developers than I am. A lot of informations have been written about coalesced memory read/write operations. I understand basics of that concept and I also understand how to achieve high read/write bandwith with coalescing. A have read CUDA Programming Guide with attention as well as Optimization Strategies described by Mark Harris (SC 2007) and I have also read many topics on this forum. But … I still don’t fully understand to “background” of coalescing. To be more precise:

Devices support 128-bit memory read using one read instruction. 384-bit memory bus of the device should thus support reading of 3 float4 vectors during one clock cycle. Before I tried coalescing, I believed, that using float4 data types would permit maximal memory bandwith (uncoalesced float4 read/write), but it doesn’t. So what coalescing exactly means? Does it support read/write of some blocks of device memory? And if so, what is reason for such behaviour? Does it mean, that address bus of the device supports only one address during one clock cycle and whole block of that address should be read/written? And is there any instruction which permits clolesced read/write of more than 128-bits?

I would be grateful for any explanation or links to some technical details about CUDA related devices I should read. I’m using CUDA when working on my thesis and I really need to understand to concept of coalesced memory. Thank you very much…

Sorry, the information you are requesting is implementation-specific and deeply RTL-level.

Hello, thank you for your reaction, I really need to find some info about coalescing. We maybe misunderstood each other. I don’t mind some implementation or hardware specific details. I just would like to know what ‘coalescing’ means. I don’t believe that so many developers use it and nobody knows what does the term ‘coalescing’ mean. More precisely, I would like to know:

  • does coalescing cause usage of one instruction-per-block for memory read? Or standard float4 read instructions are used and coalescing causes something else?

This kind of information doesn’t seem to me as being implementation-specific or deeply RTL-level detail, but maybe I’m just wrong :-(

Coalescing means that a memory read by consecutive threads in a warp is combined by the hardware into several, wide memory reads. The requirement is that the threads in the warp must be reading memory in order. For example, if you have a float array called data and you want to read many floats starting at offset n, then thread 0 in the warp must read data[n], thread 1 must read data[n+1], and so on. Those 32-bit reads, which are issued simultaneously, are merged into several 384 bit reads in order to efficiently use the memory bus. Coalescing is a warp-level activity, not a thread-level activity.

You also stumbled upon a performance bug regarding 128-bit floats. Many people have found that memory throughput is maximized when 32 or 64-bit reads are coalesced, but for some reason 128-bit coalescing has half the throughput. (8 and 16 bit reads are not coalesced at all.)

Does that help clarify things?

Thank you very much Seibert for your reply. Your explanation is by far the best one I have found so far. It clarify things enough, although I hoped I will find some sort of technical article or other kind of information which will describe even more “technical” details of coalescing. In order that you understand my motivation for this thread, here are those thoughts which occupied my mind:

The width of device system bus is 384-bits. So the device should be able to supply 384/8/4=12 float values per single memory clock cycle. The real issue starts here. If the device (precisely it’s address bus) supported 12 independent addresses during one clock cycle, there would be no problem with coalescing and 12 independent threads would be able to fetch 12 independent float values during one clock cycle. But now it seems to me, that the device supports much less independent addresses during one memory clock cycle, so those fetches must be coalesced into larger blocks and that is the point. Do you think that I’m close to truth with these thoughts or there is some other hardware issue which stays at background of coalescing?

I understand that I must seem like real “fault-finder” to all participants of this forum, but I only wanted to find some technical background which I should use in theoretical and also in practical sections of my thesis. Thank you once more and excuse my poor English please, I’m the real opposite of native English speaker :) :magic:

Actually I think a lot of people appreciate the hard questions and going into detail is always interesting. So don’t worry.
I personally find all this very interesting. I agree that seibert’s explanation is the best so far on the dark secrets behind coalesced memory access. I am also longing for more technical details as I am like you writing a thesis at the moment. Although it would be great to get more details on what the reasons for coalescing are and how it is implemented I don’t get my hopes up. I think NVIDIA tends to keep things like this confidential.

From my understanding of the problem I think you’re close to truth with your thoughts.

Unfortunately, this is pretty much all the detail I’ve seen mentioned in the forum, so I’m not sure if anything else has been revealed.

You might also try fishing through these course lectures:

This class was team-taught by a UIUC professor along with NVIDIA’s Chief Scientist. There might be some interesting details buried in those slides.

Thank you guys very much for your support and motivation. I still seems really strange to me, that so many developers use CUDA and coalescing and so few of them are curious what does their code really induce on the device side. I’m actually downloading those slides Seibert mentioned. I believe Mark Harris from NVIDIA would be able to respond to our questions but I have serious doubts he will ever read this :ermm: Nonetheless, I promise I will paste here any further information I will be able to find about coalescing. Thx once more…

Well it is simple, I just want my program to be super fast, I don’t care how the device does it…

Well, from a practical standpoint there isn’t much of a reason to need to understand the lowest level of the hardware. Coalescing per the manual results in good performance (except for 128-bit types as was mentioned) and not coalescing results in bad performance. That’s all we need to know to write efficient programs. Any kernel with a sufficient number of blocks that does coalesced read/writes will max out the 70GiB/s memory bandwidth on the GPU. Try to write a bit of C/C++ code that copies memory in a for loop and see if you get near the maximum bandwidth of your RAM, I don’t recall ever getting close to the max without writing SSE code.

Additionally, I don’t see coalescing as any different than using stride 1 memory accesses on the CPU, which everybody should know results in better performance than not using stride 1 accesses. These two behaviors probably stem from the same root cause: Changing the high bits in DRAM addressing takes time because it needs to power up those sections of the memory. Changing the low bits in the memory address can be done much more quickly. This is why CPU memory controllers automatically read the next so many bytes into cache on a memory read, because they can do it “for free”. Think of coalescing as the same thing, a look-ahead cache read except that there is no cache to put it in. Unless a thread is going to use the look-ahead value, the read has gone to waste the same as it would on the CPU if you weren’t using stride 1 acesses.

So the number of bits in the memory bus and the warp size don’t add up. I wouldn’t waste your time thinking about it. NVIDIA engineers have done something to solve it. There are (32*32) bits in a warp coalesced read and the memory bus is 384 bit, so the full coalesced read is 2.67 trips across the memory bus. Maybe there is a second memory controller that can use that other 1/3 of a trip, or maybe it is wasted and that is why we can’t read exactly the theoretical peak memory performance. Only the NVIDIA engineers that designed it know, and I can’t possibly imagine a situation where knowing one way or another (or another) would in any way help you right efficient code. As I said before, all you need to do is coalesce your accesses and the hardware will perform beautifully (with the only caveat being the 128-bit reads, which is well-documented on the forums and at least mentioned in the programming guide).

A look at general SDRAM theory can clarify at least a bit of it.
From a really broad perspective (that is, not specifically for GPU) coalescing in memory controllers is done to improve the burstiness of traffic to the SDRAM. It is possible to get to close to the theoretical bandwidth of an SDRAM, but only in cases when there are long bursts during which addresses stay within a number of specific ranges (due the bank paged nature of SDRAMs.)

If you have multiple agents reading from and writing to memory with different addresses, it will hurt performance if you serve them on a first-come first-serve basis, because requests will interleave and the requesting address will jump from one to the other, instead of having bursts with nicely linear addresses. Most contemporary SDRAM’s only support bursts of a certain minimum length. A burst length of 4 is common. This means that a 32-bit wide memory chip can only transfer blobs of data with an atomic minimum size of 4 * 32 bits. So even if your program only needs 1 32-bit word, the memory controller HAS to read all 4 of them, which means your maximum efficiency can be no higher than 25% of the theoretical bandwidth. Clearly, that’s not acceptable (and it is just one way in which to lose bandwidth!)

So regular memory controllers that can serve multiple agents will typically NOT issue a request to the memory as soon as the request arrives, but wait instead, with some time-out counter, to see if there isn’t another request from the same agent with an address that follows the previous one. If there is, then those 2 (or 3 or 4…) consecutive request are all issued in 1 burst, instead of one by one.

This coalescing in the the time domain: over a certain period of time, you gather as many coherent requests as possible. It’s as simple technique that can be incredibly effective to lift the bandwidth of a memory controller.

In this case of CUDA, you have multiple threads requesting memory at the same time, so the coalescing here doesn’t happen in the time domain, but the basic principle is the same: try to find a whole bunch of transactions that are coherent (as defined by the rules in the CUDA manual), so you can issue them together instead of a number of individual small request that are very inefficient. That’s really all there is to it.

One observation wrt 384-bit: as Mr. Anderson pointed out, I wouldn’t worry about it. The CUDA programming manual doesn’t give different coalescing rules for different GPUs: they seem to be identical for a 384-bit 8800GTX, a 320-bit 8800GTS640, a 256-bit 8800GT, a 192-bit 8800GS and a 128-bit 8600GT… In other words: it’s an implementation detail that just seems to work, no matter how wide the external bus really is. (Which is really a good thing for us programmers: imagine having to program differently for each card with a different memory bus width!)