oclHistogram sample. Don't understand shared memory restrictions....

Hi Folks,
I’m having trouble understanding the memory limitations in the 64 bin sample provided by NVIDIA, oclHistogoram.

The Pdf doc states

"Such strategy however introduces some serious limitations: 16 KB per average 192 work-items in a group amount to the maximum of ~85 bytes of local memory per work- item. So this approach limits the histogram resolution to 64 bins on G8x / G9x / G10x NVIDIA GPUs. From the implementation perspective, byte counters also introduce 255- byte limit to the data size processed by single work-item, which must be taken into account during data subdivision between the execution threads. "

So if you have a work group size of 64, with 192 work items and each uses 64 bytes (one byte per counter in the 64 bin) that gives 192 x 64 = 12288 bytes.

Why not just reduce the work group size to 32 work items. Then you would have 32 x 64 = 2048 bytes. You could even increase the bin size to 256 and still be under the 16Kb limit.

Obviously I’m missing something. Any ideas?

Any advice much appreciated.


Actually, the 192 work items is per work-group. I’m not sure what you mean when you talk about a work group of 64 having 192 work items.

Either you have 64 work items in a work group (work group size of 64) or you have 192 work items per work group (work group size of 192).

The 85 bytes per work item is due to the local memory size of 16384 bytes per work group divided by 192 work items, which means each work item has at most about 85 bytes of local memory that can be allocated to it.

The 255 limitation is simply that a bin can store values of 0 to 255, which means you have to be careful not to have have work groups that are too large so that even if all the values you are writing to the bins are close to each other there is no overflow. Otherwise, it’s possible that when you write to the bins you will overflow (reset your count for the bin to zero).

i too find the explanation rather vague for a first time openCL user. this example is using advanced concepts than your simple matrix mult. the whitepaper is not written well

he says
“16 KB per average 192 work-items in a group amount to the maximum of ~85 bytes of local memory per work-item. So this approach limits the histogram resolution to 64 bins”

  1. how does this limit it to 64bins?

  2. how did he arrive on 192 items in a work group. it could have be any number from 64 to 256, but he chose 192 why?

for the 256bit

  1. why do you have to do per warp? please explain this concept. i dont understand how parallization of histogram is happening.

“192 work-items per work-group / 32 work-items per warp * 256 counters per sub-histogram * 4 bytes per counter = 6KB per work-group”
2. i got the math 6kb/work-gp. but why 6 kb when you can have 16kb perwork gp?

oops i double posted- pl delete this post

  1. how did he arrive on 192 items in a work group. it could have be any number from 64 to 256, but he chose 192 why?

I’m going to check out the oclHistogram example and white paper before looking at your other questions, but I’ll give the question of why 192 a shot.

Actually, there is no reason it has to be 192 work-items per work-group (I have to admit, the OpenCL nomenclature is just more confusing than the thread and thread block nomenclature of CUDA). However, it is a compromise (what isn’t a compromise in engineering?). If you add more threads per block, the local memory per thread is less which can be a problem if you need a lot of local memory (which is why you don’t want too many threads). At the same time, you don’t really want to go to a low number of threads per block as then you lose out on the benefits of parallelism (especially if you are using each thread to move data from device memory to the shared memory).

thanks a lot film, i am looking how to parallelize a 256 bit histogram for a 1024x768 & above image.

i totally agree, the cuda blocks where somehow much more easier to understand. Now you have authors referring to openCL wrk-grps/wrk-items/blocks/warps/sub-groups/compute units/PEs and that slows you down. they should have just kept the terms simple and stated which are the hardware ones and logical abstractions. you find the explanation of them scattered in different pdf’s.

ok thanks i see the reasoning for some optimal size for the work group. so i guess he did some profiling and 192 happened to be a heuristic which could well change for a different architecture.

Thanks for the clarification on the 85 byte issue.

Sorry, when I said work group size I meant, the number of work groups. That being 64.

Incidentally, I have been through the 256 bin example with a fine tooth comb. I’ll continue to post thoughts on my OpenCL blog at http://maxopencl.blogspot.com/

Kind regds,


ok when? i have been at it too. it has taken me on a journey on different lands in the gpu. currently i am lost in local memory (ie cuda shared mem) bank conflicts. all i can say is its a roller coaster

I have some stuff up on the blog already. Did you have a look?

Yeah, it’s not easy. It is very interesting though. And the good news is that because it’s not that easy, not everyone will be doing it!

Which means, if you can make it through the pain barrier you will have some very valuable new skills.

Some new tools will be available shortly from gremedy.com, NVIDIA (already on Linux/Windows), etc. that will make life easier.

Once you’ve figured out the parallel stuff, it means “normal” sequential programming seem dull in comparison!

i quickly skimmed thru, didnt find the histo post. however i am close to breaking it, one more day hopefully. will post what i understand here to help other dumb people like me :pirate:

ok i have fully understood how the 64bin histogram work( 5daysx8hours yeah new to gpu computing). this example has really worked me up mentally and driven me up the wall. But at the end, all i can say is WOW, this is some pretty hardcore stuff. any1 has doubts please ask!

Hi Folks,

I hope it is OK to reactivate this thread, but I dont want do open a new one, if the topic is almost the same. I’m trying to understand the 256 bin histogram example. Besides this is directly copied from the cuda version, where the white paper is much better written, but there are also some key points missng.

So I have my Informations from the OCL paper and this one: CUDA whitepaper. There he speaks about:

6 warps (192 threads) * 256 counters * 4 bytes per counter == 6KB

So what does this mean exactly? Do we need just 6KB from local mem for this solution? So we could use more warps? I have to explain this in my master thesis and I hope you guys could help me out here.


The 64 bin histogram is cleare to me:
16384byte/192≈85 byte means 85 byte per thread/work item and so limited to 64 bins.

PPS: Has someone a good explaination why the 265 bin hist dont work on a AMD 5830? I thinks it is of the lesser local mem (8KB) and the smaller SM? Am I right?

PPPS: If it is the case that I’m correct and we only need 6KB from the 16KB, we could use 16 Warps on a GTX580 with its 16 SM units? Because we need 1 KB per Warp/Sub-Hist.