coalescing heuristics

Are there any specified coalescing heuristics that I can use as a rule of thumb when considering global memory operations?

I dont know what kind of heuristics you want , I think that the general explanation in the programming guide explains global memory access quite well . do you have any specific problems ?

I was hoping to get a general since as to what is the best way to read and write to global memory in numerical values. Stuff like:

(I am not sure if this stuff is correct. That is why I made the post in the first place)

-from global memory it is optimize for readying 4,8, and 16 bytes per bank

-there are 16 ( or is it 32? ) banks, so it can do 16 of the reads per memory clock cycle.

-It talke 400-600 clock cycles to access global memory

–I think it is 5 clock cycles per flop, so for a G80…

— so you must to at least (400 clock cycles (cc) / 5 cc per flop) * 128 stream processors = 10240 flops to hide one memory access

  • so that is something like 10240 flops / 16 * 16 bytes = 40 flops per byte read or 160 flops per word

-accesses should be aligned. (ie kth thread writes to the kth element)

So as a rule of thumb,

-at what point do you have enough aremetic intinsity to mask the global memory latency?

—is it a minium of 160 flops / word?

-if you don’t have enough arethmetic intensity, at what rate is the performance degredated?"

-What else would I need to know?

The term bank is used in the shared memory model. When reading from global memory into registers or shared memory it is important to have the right access pattern in order to achieve maximum bandwidth. There are two requirements to achieve the defined goal: 1. sizeof(type) must be equal to 4,8 or 16. 2. coalesced memory access by choosing the right access pattern within a half-warp which is further explained in the programming guide.

The shared memory is devided into 16 memory banks, so that 16 threads can access their shared memory simultaneously to achieve high memory bandwidth (as fast as registers, dont know if that really takes only one single clock cycle). However, it is important to have no bank conflicts which happens when threads try to access the same banks.

I dont think I can really answer your question: How much floating point operations one must provide (5 cc is not correct, it depends on the operation, e.g. 4 for FMAD) in order to hide memory latency. There are far more variables in the system. Read carefully chapter 5. There you will see that the effektive instruction throughput is determined by many factors and not only by the degree of arithmetic intensity. You might have a code of high arithmetic intensity but you program does not provide the right access pattern which limits your effective instruction throughput. You must test your program with different configurations and choose one out of it. The cuda occupancy calculator and the cuda profiler are very nice tools for optimizing one’s code.