oclHistogram sample. Don't understand shared memory restrictions....

shobogenzo · February 19, 2010, 11:04am

Hi Folks,
I’m having trouble understanding the memory limitations in the 64 bin sample provided by NVIDIA, oclHistogoram.

The Pdf doc states

"Such strategy however introduces some serious limitations: 16 KB per average 192 work-items in a group amount to the maximum of ~85 bytes of local memory per work- item. So this approach limits the histogram resolution to 64 bins on G8x / G9x / G10x NVIDIA GPUs. From the implementation perspective, byte counters also introduce 255- byte limit to the data size processed by single work-item, which must be taken into account during data subdivision between the execution threads. "

So if you have a work group size of 64, with 192 work items and each uses 64 bytes (one byte per counter in the 64 bin) that gives 192 x 64 = 12288 bytes.

Why not just reduce the work group size to 32 work items. Then you would have 32 x 64 = 2048 bytes. You could even increase the bin size to 256 and still be under the 16Kb limit.

Obviously I’m missing something. Any ideas?

Any advice much appreciated.

Cheers,
Max

fjlm · February 27, 2010, 5:53pm

Actually, the 192 work items is per work-group. I’m not sure what you mean when you talk about a work group of 64 having 192 work items.

Either you have 64 work items in a work group (work group size of 64) or you have 192 work items per work group (work group size of 192).

The 85 bytes per work item is due to the local memory size of 16384 bytes per work group divided by 192 work items, which means each work item has at most about 85 bytes of local memory that can be allocated to it.

The 255 limitation is simply that a bin can store values of 0 to 255, which means you have to be careful not to have have work groups that are too large so that even if all the values you are writing to the bins are close to each other there is no overflow. Otherwise, it’s possible that when you write to the bins you will overflow (reset your count for the bin to zero).

Dr.Synth · March 29, 2010, 8:48pm

i too find the explanation rather vague for a first time openCL user. this example is using advanced concepts than your simple matrix mult. the whitepaper is not written well

he says
“16 KB per average 192 work-items in a group amount to the maximum of ~85 bytes of local memory per work-item. So this approach limits the histogram resolution to 64 bins”

how does this limit it to 64bins?
how did he arrive on 192 items in a work group. it could have be any number from 64 to 256, but he chose 192 why?

for the 256bit

why do you have to do per warp? please explain this concept. i dont understand how parallization of histogram is happening.

“192 work-items per work-group / 32 work-items per warp * 256 counters per sub-histogram * 4 bytes per counter = 6KB per work-group”
2. i got the math 6kb/work-gp. but why 6 kb when you can have 16kb perwork gp?

Dr.Synth · March 29, 2010, 8:50pm

oops i double posted- pl delete this post

fjlm · March 29, 2010, 9:32pm

how did he arrive on 192 items in a work group. it could have be any number from 64 to 256, but he chose 192 why?

I’m going to check out the oclHistogram example and white paper before looking at your other questions, but I’ll give the question of why 192 a shot.

Actually, there is no reason it has to be 192 work-items per work-group (I have to admit, the OpenCL nomenclature is just more confusing than the thread and thread block nomenclature of CUDA). However, it is a compromise (what isn’t a compromise in engineering?). If you add more threads per block, the local memory per thread is less which can be a problem if you need a lot of local memory (which is why you don’t want too many threads). At the same time, you don’t really want to go to a low number of threads per block as then you lose out on the benefits of parallelism (especially if you are using each thread to move data from device memory to the shared memory).

Dr.Synth · March 30, 2010, 7:18am

thanks a lot film, i am looking how to parallelize a 256 bit histogram for a 1024x768 & above image.

i totally agree, the cuda blocks where somehow much more easier to understand. Now you have authors referring to openCL wrk-grps/wrk-items/blocks/warps/sub-groups/compute units/PEs and that slows you down. they should have just kept the terms simple and stated which are the hardware ones and logical abstractions. you find the explanation of them scattered in different pdf’s.

ok thanks i see the reasoning for some optimal size for the work group. so i guess he did some profiling and 192 happened to be a heuristic which could well change for a different architecture.

shobogenzo · March 30, 2010, 3:22pm

Thanks for the clarification on the 85 byte issue.

Sorry, when I said work group size I meant, the number of work groups. That being 64.

Incidentally, I have been through the 256 bin example with a fine tooth comb. I’ll continue to post thoughts on my OpenCL blog at http://maxopencl.blogspot.com/

Kind regds,

Max

Dr.Synth · March 31, 2010, 1:10pm

ok when? i have been at it too. it has taken me on a journey on different lands in the gpu. currently i am lost in local memory (ie cuda shared mem) bank conflicts. all i can say is its a roller coaster

shobogenzo · March 31, 2010, 2:24pm

I have some stuff up on the blog already. Did you have a look?

Yeah, it’s not easy. It is very interesting though. And the good news is that because it’s not that easy, not everyone will be doing it!

Which means, if you can make it through the pain barrier you will have some very valuable new skills.

Some new tools will be available shortly from gremedy.com, NVIDIA (already on Linux/Windows), etc. that will make life easier.

Once you’ve figured out the parallel stuff, it means “normal” sequential programming seem dull in comparison!

Dr.Synth · March 31, 2010, 8:41pm

i quickly skimmed thru, didnt find the histo post. however i am close to breaking it, one more day hopefully. will post what i understand here to help other dumb people like me External Media

Dr.Synth · April 3, 2010, 9:52pm

ok i have fully understood how the 64bin histogram work( 5daysx8hours yeah new to gpu computing). this example has really worked me up mentally and driven me up the wall. But at the end, all i can say is WOW, this is some pretty hardcore stuff. any1 has doubts please ask!

DerGraue · March 27, 2011, 2:31pm

Hi Folks,

I hope it is OK to reactivate this thread, but I dont want do open a new one, if the topic is almost the same. I’m trying to understand the 256 bin histogram example. Besides this is directly copied from the cuda version, where the white paper is much better written, but there are also some key points missng.

So I have my Informations from the OCL paper and this one: CUDA whitepaper. There he speaks about:

6 warps (192 threads) * 256 counters * 4 bytes per counter == 6KB

So what does this mean exactly? Do we need just 6KB from local mem for this solution? So we could use more warps? I have to explain this in my master thesis and I hope you guys could help me out here.

Regards
DerGraue

PS:
The 64 bin histogram is cleare to me:
16384byte/192â‰ˆ85 byte means 85 byte per thread/work item and so limited to 64 bins.

PPS: Has someone a good explaination why the 265 bin hist dont work on a AMD 5830? I thinks it is of the lesser local mem (8KB) and the smaller SM? Am I right?

PPPS: If it is the case that I’m correct and we only need 6KB from the 16KB, we could use 16 Warps on a GTX580 with its 16 SM units? Because we need 1 KB per Warp/Sub-Hist.

Topic		Replies	Views
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4489	October 24, 2008
local / global work (group) sizes and memory limit calculations How to find out how much private mem CUDA Programming and Performance	3	21937	August 15, 2011
Possible CUDA improvements CUDA Programming and Performance	7	6123	July 14, 2008
Fast 256-bin histogram CUDA Programming and Performance	6	2267	May 9, 2016
How to copy global memory to a local memory CUDA Programming and Performance	4	6570	August 1, 2011
Code optimization with CDP and dynamic shared memory allocation CUDA Programming and Performance	18	42	January 13, 2025
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1955	October 8, 2020
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6677	June 2, 2012
Questions about global and local work size CUDA Programming and Performance	23	55360	November 1, 2010

oclHistogram sample. Don't understand shared memory restrictions....

Related topics