Question about shared memory usage How to use as reg and volatile effect

There’s some topic discuss about reduce number of register by using shared memory.But I don’t fully understand how it work, because reg is local to the function while shared memory is shared between threads in the same block, so all the threads can change value of shared memory that may lead to undefined behavior.

If we want to use shared memory as a register, what should we do

What is the meaning of
volatile _shared T smem;

i try to test the effect of volatile shared by a small test
volatile shared int sid;
if( tid ==5)
sid = tid;
g_odata = sid;

and print out g_odata, i got
5 5 5 5 5 5 5 5 5 5 5 5 5
however without volatile , i got
0 0 0 0 0 5 0 0 0 0 0 0
Can any one explain the results.

The similar code
shared int sid;
if( tid ==5)
sid = tid;
__synthreads();
g_odata = sid;

yields the same result with volatile shared mem. Why it is.
I don’t think what happen here is not clearly explained in doc

Any idea is appreciated. Thank you

I’m not sure there’s been too much success in reducing register count by using shared memory (please post if you’ve achieved speedup in this way). The reason is that while shared memory can be one of the instruction’s operands, the destination for the result is always a register. If the result has to be stored in smem, that involves computing the result and storing it in a register first, then moving the contents of the register to smem. You can look at .ptx code to see this in action (or PTX spec that comes with CUDA toolkit).

Paulius

I had success once by putting 9 read-only parameters in shared memory, turning 27 registers into 18 regs and 9 shared memory slots, and increased occupancy a lot. Now that we have decuda, this will be a lot easier since one can closely monitor ptxas’s behavior. Volatile is to prevent nvcc from doing stupid things like CSEing shared memory (like in your first example), and is usually mandatory in 1.0.

I have one kernel that loops over 27 different iterations, a=-1,b=-1,c=-1 … a=1,b=1,c=1. These values are constant over each iteration of the loop and shared for the entire block. Putting them into shared memory reduced the register count by three. There were a few other tricks I used in that kernel I don’t recall at the moment: the total register usage reduction was from 30 down to 22.

To the OP: You can still used shared memory even when each thread accesses a different “non-shared” value. Just declare a shared array to have the width of the block size and index sdata[threadIdx.x] every time you want to access that variable. Do note that the compiler will load that value into a register to use it in computations, so putting it in shared memory may not actually reduce your register count unless that value is used only for a short time and the compiler has other values that can be dumped into the same tmp register afer you are done with sdata[threadIdx.x]. I have had a little success with this technique, but not much.

Overall, I’ve found little gain from increasing the occupancy beyond 50%. Sometimes the things you add (shared mem access) to push the occupancy higher add overhead that makes things slower in the end.

Interesting results, asadafag and MisterAnderson42.

Have you guys also tried constant memory for read-only data? In the kernel with which I’ve used constant memory, performance was equivalent to using literals in the code. Though in my case, all threads were reading the same exact value. I wasn’t trying to save registers either, just comparing constant mem vs literals. It’d be interesting to compare constant vs smem in your scenarios.

Paulius

I had tried putting three of them in constant memory. In my case, nearly (but not always) all threads are reading the same value. In the end, the performance dropped despite higher occupancy. The others are different per thread.

I’ve also had good success using shared memory as a place to store things that have to hang around a while and would otherwise consume registers. Several versions of our Coulomb potential kernels used this as a means of reducing register pressure when unrolling loops to re-use atom data multiple times, resulting in increased arithmetic intensity…

John Stone

We have a couple of kernels that are the perfect use case for the constant memory as they loop over several thousand data items with all threads reading the same elements at the same time in exactly the pattern where constant memory performs well, and we’ve had excellent results. YMMV of course…

John

Constant memory would be a good idea for my loop, but I oversimplified it a little earlier. Each block loops over 27 values a=A-1, b=B-1, c=C-1 … a=A+1, b=B+1, c=C+1 where A,B,C vary from block to block, so constant mem is not really an option there.

In some early code, I used constant memory to store device pointers rather than passing the device pointer as an argument to the function. There were significant (1-2%) performance gains, probably because the pointer didn’t need to be passed across the PCI-Express on every kernel call. Unfortunately, the API headache of detecting when these pointers change (it happens, though rarely when an array is expanded) prevented me from using it in the full application.

In my experience, constant memory always performed (a little bit) better than reading things into shared memory at the beginning of the kernel and using that. Both in the case that all threads read the same location, and in the case that each thread (potentially) reads a different one.
Note that I did not evaluate the case that every thread always reads a different value. It’s very possible that shared mem is faster in that case (especially if you can avoid bank conflicts).

Interesting,
For the following scenario what would be better?

Every thread requires access to a look up matrix for computing further results.

[1] I load the lookup table in the global memory, let each thread cache this table in
the shared memory and then work on the computation.

                                        OR           

[2] I load the table into constant memory and perform the same computations?

What effect would it have on :
[1] Performance
[2] Register Usage
[3] Bank Conflicts

See this post: http://forums.nvidia.com/index.php?showtopic=46742 for a crazy benchmark I did a while back to test this.

The answer is: it depends. If every thread in a warp always accesses the same element in the table at the same time, constant memory wins for performance. If every thread in the warp accesses different elements (even with bank conflicts) then shared memory is better.