Shared memory problem

I have a problem with shared memory that I don’t fully understand.

I have a functionality like this:

__shared__ unsigned char sh_progs[2048];//2048 comes from maximum number of nodes in a binary tree of depth 11 (2047 actually)
__shared__ unsigned char sh_adf1_progs[2048];
__shared__ unsigned char sh_adf2_progs[2048];

The variables sh_adf1_progs and sh_adf2_progs is only filled with data if another parameter nrProgsArrays is > 1. So it looks like this ( part of a loop):

	sh_progs[offset] = progs[base+offset];
	if(nrProgsArrays==2)
		sh_adf1_progs[offset] = adfProgs1[(int)adfProgs1[programIndex]+offset];
	if(nrProgsArrays==3){
		sh_adf2_progs[offset] = adfProgs2[(int)adfProgs2[programIndex]+offset];
		sh_adf2_progs[offset] = adfProgs2[(int)adfProgs2[programIndex]+offset];
	}

What I do here is to move a small part of a very big array that is kept in global memory into shared memory. The data that is taken is unique for the block (Im using 2048 blocks) and will be used very often so it should
increase performance quite much to have that data in the shared memory, at least that was my intention.

The only place I use the code is in the calling of another function, where the data is used frequently, on the kernel. So it looks like this

if(nrProgsArrays==1)
	stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==2)
	stackInterpreter(sh_progs, sh_adf1_progs, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==3)
	stackInterpreter(sh_progs, sh_adf1_progs, sh_adf2_progs, stack, bar, lastSignal);

My problem is if nrProgsArrays=1 ,and then sh_adf1_progs, sh_adf2_progs is not really used at all, then the execution takes X time. But if I change in the signatures to NULL instead of sh_adf1_progs, sh_adf2_progs ( Because they are not used because nrProgsArrays=1) so it looks like this:

if(nrProgsArrays==1)
	stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==2)
	stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==3)
	stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);

Then the execution takes 0.5 X time. So for some reason it takes twice the amount of time when the values in the parameter list ( which isnt used because nrProgsArrays=1 ) isnt NULL.
Howcome this can happen?

I tried skipping using shared memory for sh_adf1_progs, sh_adf2_progs and just had them passed from global memory directly into the function( and they still reside in global memory then I assume ) and then it also was faster than using shared memory.
I think I may miss some important aspect of shared memory here, perhaps there is some threshold when there is not as good to use shared memory because some swapping overhead comes into place?

I’m not sure I really understand what you are doing, but I’ve noticed two things:

    [*]How does the size of sh_progs relate to the fact that you are using 2048 blocks?

    I’d rather expect that the size of sh_progs times the number of blocks gives the size of progs, i.e. the total size of the array you are processing.

    [*]Accesses to contiguous parts of sh_progs (in shared memory) will cause 4-way bank conflicts. On devices with compute capability 1.2 or higher, accesses to progs (in global memory) will not suffer from these due to internal reorder buffers. To avoid the bank conflicts, you could either access sh_progs with a stride of 4, or declare sh_progs as unsigned int (which would then of course restrict you to a smaller part of the global array).

Thanks for the answer!

You are absolutely right that the size of sh_progs has nothing to do with the fact that Im using 2048 blocks. It was a miswrite from me when I added that comment for this post.

The reason the size of the arrays is set to 2048 is so they can keep a binary tree with max depth of 11. ( I edited my first post to avoid further confusion ). I am using 2048 blocks thou

so I mixed things up when I wrote the post External Media

All nodes are represented as an integer with a postfix notation. No integer is bigger than 200 so that is why I am using an unsigned char, so I dont need to use more shared memory than I have too.

This sounds promising but Im not sure I totally understand what you mean here. What does “access sh_progs with a stride of 4” mean? Is the access pattern to the shared memory dependent of the data type that I saved in the memory?

Access to shared memory is optimized for contiguous chunks of 32-bit words, as that is the native size of the most common data types (int, float, 8-bit RGBA).

If you can arrange things so that thread n accesses sh_progs[4n+const] then no bank conflicts will occur, as every access goes into it’s own 32-bit word (inside a contiguous address range). For 16-bit types, you would need accesses to 2n+const. 32-bit types need accesses at n+const, as this is what the device is optimized for. Contiguous access to 64-bit types will cause bank conflicts again as thread n and n+8 go into the same memory bank. This should be explained in the programming guide.

If you can’t change the access pattern, it might (or might not) still be worth wasting 3/4 of the shared memory by using ints where chars would suffice in order to have bank-conflict free access.

Ok, thanks for a good explanation.

By the way my first post probably was a bit unclear so I’ll do a short summary of what I meant.

My intention was for the code to be general in some sense so that I could use the same code despiteless of how many of the shared memory variables that was actually used. So I always do the “creation” of the shared memory variables.

shared unsigned char sh_progs[2048];//2048 comes from maximum number of nodes in a binary tree of depth 11 (2047 actually)
shared unsigned char sh_adf1_progs[2048];
shared unsigned char sh_adf2_progs[2048];

I’m using binary trees for my particular problem and the amount of trees needed can range between 1-3. The arrays contain a postfix notation for these binary trees. So the first array is always filled with data. And if the variable nrProgsArrays=2 that means that I want to use two trees and then sh_adf1 also will be filled with data. If nrProgsArrays=3 both sh_adf1, sh_adf2 will be filled with data. So if nrProgsArrays=1 then there is no data available for sh_adf1, sh_adf2 and hence they cant be filled with anything. The numner of trees used is changing from different invokations of the kernel.

Because I have three different possibilites, I have 3 different function available function calls. One for nrProgsArrays=1, one for nrProgsArrays=2 and one for nrProgsArrays=3. (This probably isnt needed anymore but it remained from before wehn I used global memory). The only difference between them is that I send in NULL for arrays that is not used. For example when nrProgsArrays=1 that means sh_adf1, sh_adf2 is not used so NULL is sent in instead of them. (This may seems totally unnecessary now but before I did some operations in the function call so the arrays had to be filled or I would read out of memory etc). When I used the code below and had nrProgsArrays=1 then the execution time was X seconds.

if(nrProgsArrays==1)
tackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==2)
stackInterpreter(sh_progs, sh_adf1_progs, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==3)
stackInterpreter(sh_progs, sh_adf1_progs, sh_adf2_progs, stack, bar, lastSignal);

Then the only thing I changed to test was that I replaced sh_adf1 with NULL when nrProgsArrays==2 and sh_adf1,sh_adf2 with NULL for nrProgsArrays==3. So after that code would look like this:

if(nrProgsArrays==1)
stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==2)
stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==3)
stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);

That was the only thing I changed and I expected that to have no influence whatsoever because the execution wont go into those branches there when nrProgsArrays==1. But to my surprise the execution time was half of what it was before.

So what confuses me is how this change that I think should have no effect at all, causes this huge performance difference. Why I think it should have no effect is because the code that is changed is not even executed when nrProgsArrays==1, which it is here. The only reason I could think of is that the compiler skips the allocation of the shared memory in the second case because it can realise that sh_adf1 and sh_adf2 is never used. But even if this is the case I can’t see why this huge performace difference occurs. Then there must be some kind of overhead when using more shared_memory. For example if I in one case used 40 % of available shared memory and in another case used 60 % there could be overheads if the context were to be swapped for another block to execute while the first block is waiting on some operation. I dont even know if one block is “put away” while waiting for some expensive operation like an access to global memory and during that time another block can execute. If that was the case then the shared memory probably had to be copied to global memory if it was 60 % of the available memory because otherwise the new block wouldn’t have place for its shared memory. If both used 40 % there shouldnt be any need to copy the shared memory to global memory because both would fit into the available memory.

I hope it makes sense what Im trying to say, I am just guessing thou because I cant really understand what is happening.

What I said about bank conflicts was based on the assumption that you have mostly contiguous accesses to shared memory. Given that you actually operate on trees, I guess this is not the case though.

I think this is exactly what is happening. Note that functions on the device are always inlined, so the compiler is free to optimize different invocations of the same function according to the specific parameters used. And the Open64 compiler used for Cuda is known for its aggressive optimization, so I fully expect it to optimize the arrays away if you don’t read from them (even if you initialize them to zero).

No such swapping occurs. However, with 40% of shared memory used, two blocks can execute simultaneously on each multiprocessor. mutually filling the gaps while the other thread waits on some operations, as you described. This way execution of the second block is almost free, so the throughput can be up to twice as large as if 60% of shared memory were used and only one block can execute on each multiprocessor at any one time.

How many threads do you use per block? The scheduling is actually done in warps of 32 threads, not in full blocks. So to take full advantage of the gaps as said above, you would want six warps running concurrently on each multiprocessor (more then that still help to hide global memory latency). For reasons not discussed here, you also want the number of threads per block to be at least 64 (or a multiple of that) for optimum throughput. Numbers for Fermi are different.

Check “occupancy” in the programming guide.

As said above, no such swapping to main memory happens. Rather, if no further block can be launched due to lack of shared memory, the multiprocessor just sits idle, and all clock cycles during the latency of operations in the first block are just wasted.

It absolutely makes sense. Hope you got enough insight now to see why this is happening.

Thanks a lot for a great answer. It has been night here so been asleep, thats the cause of the late answer.

I start to understand how things work. My problem most likely come from the fact no further block can be launched due to lack of shared memory, so when I increase the used shared_memory the positive effect of that is much less than the negative effect of the multiprocessor just sits idle because it cant launch more blocks on the multiprocessor.

I am only using 32 threads at the moment so I might benefit from increasing the number of threads to a multiple of 64 as you suggested.

I am gonna read about “occupancy” in the programming guide now =)

It’s been night here as well… :)

If you can increase the number of threads per block without increasing the shared memory needs, that would be the easiest way to improve occupancy. You would need at least 192 threads if only one block fits into shared memory concurrently.

The other option of course is to reduce the shared memory need per block.

Oh thought we only had nights here in Sweden =)

I checked the Cuda_occupancy_calculator which was intresting to see how things effect each other. I compiled my code with flags so I could see how much shared memory I used and how many registers.

The usage of shared_memory was a major concern for me but also another thing I noticed was that I used loads of registers and that effected performance much as well. I was using 57 registers at the moment, which sounds
quite much. The problem I have is quite complex so I need quite a few variables, but I think I should be able to cut down the register use quite much. Is the relation between how many register used and variables in my code so simple that one register is needed for each unique variable?

Another thing I noticed in the Cuda_occupancy_calculator that caught my attantion was this line :“Total # of 32-bit registers / Multiprocessor = 16384”. Can this be correct, can each MP have 16384 registers? Doesn’t an ordinary CPU have like 20 registers?

I can reassure you we also have nights in the UK. External Media

Not necessarily, as the compiler does quite a bit of optimisation in between. This might either lead to more registers needed (e.g. to hold common subexpressions), or fewer (if variables can be optimised away).

Yes, the architecture is quite different from a CPU. Registers are actually more like per-thread on-chip memory. So you’ll have to multiply the number of registers from nvcc’s output (57 in your case) by the number of threads per block to get the number of registers needed per multiprocessor. Thus the 57 registers aren’t a big concern, you can still reach the 192 threads per multiprocessor required for optimal scheduling (in fact you can even have 256 threads).

So if shared memory only allows one block per MP, I’d recommend 256 threads per block. Otherwise, try 128, so that two blocks can fit simultaneously.

Ok thanks again! I really appreciate your good answers.