Hi folks! I have been scouring the forums, the book (Programming Massively Parallel Processors by Kirk & Hwu) and the Programming Guide and can’t seem to find an answer to this question. I am using a GTX 470 and intended to only target deivces with compute 2.x and higher capability.
I am building a Genetic Algorithm using CUDA so I have each block set to the max thread size (1024 for the device I am using). My device can also run 1024 threads per SM. I need to calculate a set of data, max 16kB memory, for each thread then perform a calculation on each piece of data.
One option is to allocate a 16kB chunk for each thread in global memory. I checked my max number of blocks and figured out I can do this safely without running out of global memory. The issue is that global memory is quite slow and I do not think the L1 and L2 cache will effectively help speedup using global memory and coalescing won’t work either for reasons I won’t bore you with.
The second option I am considering is to allocate a 16kB chunk of shared memory and have each thread use it.
From the book, I understand that threads do execute in order, but it is not clear if they execute simultaneously on the SM. If they do execute simultaneously, each thread would overwrite the shared “buffer” and garble each other’s data making this idea worthless.
The other concern I have with this idea is what if all the threads are not on one SM (i.e. the Warps are scheduled on different SMs)? Slide 4 on www.sdsc.edu/us/training/assets/docs/NVIDIA-01-Intro.pdf seems to indicate that each SM has its own physical shared memory. Does this mean that if threads 0-511 are on SM0 and threads 512-1023 are on SM1 that there would be two copies of my shared “buffer” - one in each SM’s physical shared memory? Or is there only one copy in all SMs of the shared “buffer” which would make this idea worthless.
I know this idea seems like a stretch I just want to squeeze out every performance advantage and understand the memory better.