Maximising memory per thread

Hi All

I am trying to decide how I best can optimize the amount of shared memory I have per thread, without having any idle threads.

My problem is that each thread in my algorithm has to access a lot of memory. Because this memory can’t be read coalesced due to how the algorithm works, I want to pre-load this into shared memory for faster access. I can use several threads per block of memory, but I have an overhead for each thread of memory, as well as an overhead for splitting my memory chunks up.

So… I would like to be able to load as much memory into shared memory for each active thread as possible.

My card is a 260GTX with 27 SM in total.

As far as I understood, my number of threads per SM is 8 SP’s per SM and 4 threads per SP, equal to 32 threads (also known as a warp). With a shared memory of 16KB per SM, I could in theory have 512 Bytes of memory per thread. Is this correct?

Because If I try to insert this into the CUDA GPU Occupancy Calculator, I only get an occupancy of 25%, which seems less than ideal.

Thoughts and suggestions are greatly appreciated!


Henrik Andresen

32 is the absolute minimum number of active threads per multiprocessor. Your hardware supports up to 32 warps or 1024 threads simultaneously. Shared memory is managed and allocated on a per block level, rather than per thread, so it is better to think of total shared memory per block (then divide by the threads per block you are using to get the equivalent number of bytes per thread).

The occupancy calculator tells you how many active warps per multiprocessor there will be for a given set of execution parameters. On a compute 1.3 card, 25% = 8 active warps per MP = 256 threads. So my guess is you have used 256 threads per block with 16kb per block shared memory and less than 64 registers per thread. That won’t give you 512b per thread.

You are correct of course. I only get 3% occupancy for the 32 threads.

Should a high occupancy be something that I aim for, or how should this metric be used in my design?

Thank you


On the GT200, there are instruction pipeline latencies and other things which need to be covered, and it turns out you need active 6 warps or 192 threads per multiprocessor to do that (ie. 18.75% occupancy). After that, the amount of performance improvement with increasing occupancy will be code dependent, but it is pretty rare to get any performance improvements over about 75% occupancy. So the design aim should be something like “aim for a block size that gives at least 18.75% occupancy and maximizes performance”. Usually that means experimenting with block sizes to see what is the fastest, and it can vary widely from code to code.

OK, thx a bunch.

I’ll try to achieve that and see what I get.