simple questions about block memory from 64-bin Histogram sdk example

John_260 · August 4, 2008, 1:52am

Hello:
Sorry for the double-post…I realize that I should have posted my question to this discussion rather than the General CUDA GPU Computing Discussion.

Please pardon my entry-level question, but I am trying to understand memory layout and am using the 64-bin Histogram example from the sdk… In the whitepaper, it says that the maximum shared memory per block is 16,384 bytes. So for a typical block size of 192 threads/block, we are limited to 85 bytes/thread. OK. In the next sentence, they say “so at a maximum, subhistograms with up to 64 bins using single-byte counters can fit into shared memory”.

I assume the 64 bin figure comes from (16,384 bytes/block) (1 block/256 threads) = 64 bytes/thread. But 64 is smaller than the value of 85 that they just calculated above. So why is 64 bytes/thread the max value and not 85?

Next, they say that using single byte counters introduce a 255-byte limit to the data size processed by each thread. Where does this number come from? They just stated that the maximum was 64 bytes per thread. I see that (85)(3)=255, but if that is where the 255 figure comes from I don’t know why.

A page or so later they say that arrays are of size 4, 8, or 16 bytes, and input data is loaded as 4-byte words. OK. Then they say the data size processed by each thread is limited to 63 double words. This is (63)(4)(2)=505 bytes, which does not match any of the figures calculated above.

Lastly, they go on to say that the data size processed by the entire thread block is limited to (THREAD_N)(63 double words) = 48,384 bytes for 192 threads. But we have a limit of 16,384 bytes/block. So we are over the limit. Also, (63 double words/thread)(8 bytes/double word)(192 threads/block) = 96,768 bytes/block, which is twice as large as their figure of 48,384 bytes/block.

Can someone please explain all of these apparent discrepancies?

Thanks much.

John

VrahoK · August 4, 2008, 9:43am

without reading the whole doc:

This is only reminding you of the MAX of an 8-bit (1 byte) value. So if you want to use byte counters (and not ints or something) you have a max of 255 before you get a counter overflow.

2^8-1 = 255

Vrah

Topic		Replies	Views
simple questions about block memory from 64-bin Histogram sdk example CUDA Programming and Performance	1	4062	August 5, 2008
oclHistogram sample. Don't understand shared memory restrictions.... CUDA Programming and Performance	11	5417	March 27, 2011
the memory using about histogram CUDA Programming and Performance	0	1796	November 1, 2007
Size limit on dynamic allocated shared memory CUDA Programming and Performance	2	1526	November 6, 2008
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2905	April 29, 2009
question about the sample of histogram64 CUDA Programming and Performance	2	2394	October 12, 2007
max number of block CUDA Programming and Performance	21	18138	April 20, 2010
Fast 256-bin histogram CUDA Programming and Performance	6	2495	May 9, 2016
maximum number of blocks CUDA Programming and Performance	3	2462	April 10, 2008
Maximum element amount in shared memory CUDA Programming and Performance	3	3658	April 20, 2010

simple questions about block memory from 64-bin Histogram sdk example

Related topics