simple questions about block memory from 64-bin Histogram sdk example

Sorry for the double-post…I realize that I should have posted my question to this discussion rather than the General CUDA GPU Computing Discussion.

Please pardon my entry-level question, but I am trying to understand memory layout and am using the 64-bin Histogram example from the sdk… In the whitepaper, it says that the maximum shared memory per block is 16,384 bytes. So for a typical block size of 192 threads/block, we are limited to 85 bytes/thread. OK. In the next sentence, they say “so at a maximum, subhistograms with up to 64 bins using single-byte counters can fit into shared memory”.

I assume the 64 bin figure comes from (16,384 bytes/block) (1 block/256 threads) = 64 bytes/thread. But 64 is smaller than the value of 85 that they just calculated above. So why is 64 bytes/thread the max value and not 85?

Next, they say that using single byte counters introduce a 255-byte limit to the data size processed by each thread. Where does this number come from? They just stated that the maximum was 64 bytes per thread. I see that (85)(3)=255, but if that is where the 255 figure comes from I don’t know why.

A page or so later they say that arrays are of size 4, 8, or 16 bytes, and input data is loaded as 4-byte words. OK. Then they say the data size processed by each thread is limited to 63 double words. This is (63)(4)(2)=505 bytes, which does not match any of the figures calculated above.

Lastly, they go on to say that the data size processed by the entire thread block is limited to (THREAD_N)(63 double words) = 48,384 bytes for 192 threads. But we have a limit of 16,384 bytes/block. So we are over the limit. Also, (63 double words/thread)(8 bytes/double word)(192 threads/block) = 96,768 bytes/block, which is twice as large as their figure of 48,384 bytes/block.

Can someone please explain all of these apparent discrepancies?

Thanks much.


without reading the whole doc:

This is only reminding you of the MAX of an 8-bit (1 byte) value. So if you want to use byte counters (and not ints or something) you have a max of 255 before you get a counter overflow.

2^8-1 = 255