coalescing future..

Hi,

Been playing with cuda for a bit and it is impressive how easy it is to use, hats off to the nv folk. However… am quite disappointed with the rather extreme coalescing rules, as it has such a large perf penalty. It seems the most efficient use is akin to SoA vs AoS style programming e.g. passing alot of arrays into a kernel vs an array of a structure gives significant perf advantage.

Gather functionality is no problem using the tex unit, so randomish patterns still run at reasonable speed, but theres no nice solution for scatter functionality at speed… best I can find is resolve to a 2 pass kind of like solution so, pass 1 writes the index the 2nd pass should read randomly(via tex) and write in a nice orderly fashion. e.g. mapping from a scatter to a gather pattern at a x2 cost in bandwidth… not nice… but beats using uncoalesced writes…

So guess my question is, in the future will uncoalesed rw always just suck so badly? or will the rules get relaxed more? Currently, I think this is the biggest flaw in cuda.

Cheers.

Coalescing is as far as I understand it, directly related to memory bus width, so for a 256 bit bus, you need to be writing 8 consecutive floats to fill the bus for 1 clock cycle. So you need to write 8 floats coalesced to be able to fill the memory bus.

So I am afraid this is just a hardware limitation that is not easily ‘fixed’ (unless you can make your memory bus 64 times as small & your frequency 64 times as high…

But I remember reading not so long ago some slides about CUDA (that were already a bit old) where it was stated that CUDA did not support texture writes yet. Now I don’t know what was meant by it, possible the fact you cannot write to a CudaArray was meant, but maybe they have some cache in mind to ease the trouble of coalesced writes.

Ahh yeah sorry, should probably change the wording a bit, to be randomish global reads/writes at a reasonable speed.

It seems any uncoalesed rw forces the entire part of the wrap to serialize e.g. guessing theres only 1 load/store issued to global memory per multi processor per PC(divergent or not). So thats 32(or probably 16)x32b load/store for an entire wrap PC, so a cache of 1… Thus if anything in that wrap misses then it costs an entire wrap pass + no latency hiding - presumably 1000+ cycles, not pretty.

Guess what Im asking the hw gods for is, if at some point there will be a small mini-global memory cache per multi-processor (note. this does NOT mean globally coherent memory…) A 4x32B cache would go a long way and can get rid of that direct mapping of threadID to position in the coalesce buffer -> thread0 -> buffer[0], thread1->buffer[1] etc etc which is just nasty.

Having said that… guess this is why you guys do

if (threadid == 0)
sharedmem = fetch data in a friendly way

sync threads
.
.
blah = sharedmem[rand_line][rand_offset]
.
.
sync threads

if (threadid == 0)
copy out

essentially a software memory cache, what a pain.

Hmm seems Ive answered my own question. But for those of us considering to “just ride the gpu wave” e.g. 6 months from now its not a huge problem, thats all I`m looking for. Alternatively if a sw cache is the way to go in the future, thats also helpful.

Fetching data is usually not in a if (threadIdx.x == 0), you want all threads reading a part of global mem into shared mem. And this is usually done if you need to access the same index in global mem from several threads in a block. Or if you need to have random access in a small part of global memory.

But I think what you should realize is that this shared memory is something new for CUDA, as far as I have read, you had nothing like it in GPGPU before. So this is already much, much more flexible as before. Also writing to random memory locations might be slower than coalesced, it was not possible at all before I believe.

For the threadIdx.x==0 thing was thinking along the lines of a situation like

warp[0] -> 16B line @ offset 0x0000
warp[1] -> 16B line @ offset 0x0010
warp[2] -> 16B line @ offset 0x0000
warp[3] -> 16B line @ offset 0x0010

warp[4] -> 16B line @ offset 0x0020
warp[5] -> 16B line @ offset 0x0030
warp[6] -> 16B line @ offset 0x0020
warp[7] -> 16B line @ offset 0x0030
.
.
.

etc. e.g. there is overlap between the threads in the memory they access.

Would think in this kind of situation its better for thread 0 to always fetch the memory, e.g. only fetch from offset 0x0000, 0x0010, 0x0020, 0x0030, same for store. Assuming the MP will do suboptimial things when theres memory aliasing, should probably write a test.

Sorry if its quite a negative sounding post… yeah i know shared memory and global memory are new and really cool and cuda is quite impressive. Just looking for direction on where it will go, and what parts nv plan to improve.

Its the same old story of gpu roadmaps and plotting a point some time in the future… should you investing the time to wire register combinders optimialy or focus on other areas and wait for SM1.1 etc etc. Or stop posting on the interweb and just code :)

If you have to e.g. fetch 16 values to shared mem, you would be better off using :

if threadIdx.x < 16
shared_mem[threadIdx.x] = globalmem[offset+threadIdx.x];

__syncthreads();

If I understand correctly what you want to do. And about the future, NVIDIA is usually quite tightlipped about what is coming, it was only this weekend when it was announced 3D textures will be part of 2.0 (which goes in beta the coming weeks), so further in the future is even more difficult ;)

By using 1-D textures mapped to memory you can get reasonable speed with non-coalesced access. It really helps a lot (especially if your access is semi-serialized, but too complex to do nice coalescing)… of course, this only applies to reading