simple questions about block memory from 64-bin Histogram sdk example

Hello:
Please pardon my entry-level question, but I am trying to understand memory layout and am using the 64-bin Histogram example from the sdk… In the whitepaper, it says that the maximum shared memory per block is 16,384 bytes. So for a typical block size of 192 threads/block, we are limited to 85 bytes/thread. OK. In the next sentence, they say “so at a maximum, subhistograms with up to 64 bins using single-byte counters can fit into shared memory”.

I assume the 64 bin figure comes from (16,384 bytes/block) (1 block/256 threads) = 64 bytes/thread. But 64 is smaller than the value of 85 that they just calculated above. So why is 64 bytes/thread the max value and not 85?

Next, they say that using single byte counters introduce a 255-byte limit to the data size processed by each thread. Where does this number come from? They just stated that the maximum was 64 bytes per thread. I see that (85)(3)=255, but if that is where the 255 figure comes from I don’t know why.

A page or so later they say that arrays are of size 4, 8, or 16 bytes, and input data is loaded as 4-byte words. OK. Then they say the data size processed by each thread is limited to 63 double words. This is (63)(4)(2)=505 bytes, which does not match any of the figures calculated above.

Lastly, they go on to say that the data size processed by the entire thread block is limited to (THREAD_N)(63 double words) = 48,384 bytes for 192 threads. But we have a limit of 16,384 bytes/block. So we are over the limit. Also, (63 double words/thread)(8 bytes/double word)(192 threads/block) = 96,768 bytes/block, which is twice as large as their figure of 48,384 bytes/block.

Can someone please explain all of these apparent discrepancies?

Thanks much.

John

  1. So why is 64 bytes/thread the max value and not 85?
    – num of entries in a radix histogram is always power of 2, we can’t have 128, so go down to 64.
    2.Where does this number come from?
    – if we use a byte to store counts, we can at most count to 2^8-1 (255), no more (o/w overflow the byte)
    3,data size processed by each thread is limited to 63 double words
    and 4, So we are over the limit.
    – you refer to this sentence:
    “For the reasons mentioned above, the data size processed by each thread is limited to 63 double words, and the data size processed by the entire thread block is limited to THREAD_N * 63 double words. (48,384 bytes for 192 threads)”

Note the 48384 is not resident in shared memory, but the total amount of throughput from the block.
Your methodology of dividing calculations is correct though.
reason for 63: (dont’ wanna check further…pls think yourself…)