optimizing registers by using shared memory when specifying -maxregcount maximizing the utility of s

I noticed that my program is only using 200 bytes out of the 16k/block shared memory, but the register is nearly 60. When I used -maxrregcount 30, nvcc did reduce the register count, but it uses the slow local(global) memory instead. I used a blocksize of 128, which makes the remaining shared memory have the capability to hold an extra of ~30 registers per thread.

I am wondering if there is a way to let nvcc to use the spare shared memory, rather than the local memory, for register optimization. I admit that by hand-code shared memory array in the code, I can do this, but it will make the code very verbose and difficult to extend.

if this feature does not exist, may I request to add this feature in the future CUDA releases?

I don’t see how it is possible to spill to shared memory. The amount of available shared memory per thread is not known at compile time.

nvcc should know how much shared memory a kernel consumes, and the hardware shared memory limitation can be specified as a result of the “-arch compute_1?” option, so, what else is not known at the compile time?

But the execution configuration isn’t known at compile time - how many threads per block and how much dynamic shared memory is required. The compiler must know the available shared memory resources on a per thread basis. That is unlike registers and local memory, where there are hard per thread limits which are known a priori.

ok, I guess the block size is now know at compile time :(

any coding suggestions to convert registers to shared memory array? What’s in my mind is something like

__shared float sreg[];

 sreg[threadIdx.x]=... // use sreg[threadIdx.x] as a register var

but this the code will look messy when I replace 30 of my registers …

Be really mindful of shared memory bank conflicts. It is pretty to cause a lot of warp serialization if you are not careful. The other issue is that any dynamic shared memory allocations you make at runtime coexist with your “spilled” registers, so you need to code offsets in everywhere in such a case.

just think about it for one more second, it seems to me that there is still a way to predict a lower-bound for the spared shared memory per thread.

For a given CC device, nvcc knows the maximum registers/block as NR, maximum shared memory bytes as NS; a kernel using Nr registers and Ns shared memory can at most run at M=int(NR/Nr) threads per block; when running more than M threads, CUDA will not launch the kernel, so, M is the upper bound for the block size. In this case, nvcc can predict that the minimum amount of shared memory that is available for each thread is (NS-Ns)/M.

For example, my kernel uses 60 registers, and for CC11 device, the maximum register is 8192, so the maximum thread/block is 136; if I use 200B from a 16384B shared memory, that gives me at least (16384-200)/136=119B per thread, which is about 29 float registers.

If nvcc can use this amount of registers in the register optimization, it can take the full advantage of the low-latency of the shared memory. Complicated kernels, such as mine, can gain a lot from this optimization.

Do you agree?

yes, that’s the part that I really hate to do; it makes the code unreadable.

I would agree that the compiler could do this. It would be even more interesting if it was done after the execution configuration was known by a just-in-time compiler. Compared to other things though it may not get much attention, there are a lot of interesting compiler optimizations that could be done in CUDA (hoisting redundant computations out of a group of threads an then broadcasting the results, for example). This is really only applicable to kernels that use a lot of registers, and it may be easier to just add more registers to the next generation of hardware.

If someone wants to try it out and maybe write a paper on it, you could easily add another register allocation pass to Ocelot that would do this before executing each kernel. You can probably modify one of the existing register allocation passes and just edit the part that adds in the spill code.

OpenCL should certainly be able to do this a lot easier.

I would say this optimization will appear to be attractive whenever people are considering using the -maxrregcount option (which is usually the case for register-intensive kernels).

do you mean optimizing the ocelot emitted CPU code from PTX?

I’ve used exactly such approach in one of my kernels which used large amount of registers.

It did help quite a bit - especially because without it the register pressure drove the occupancy to less than 25%.

You do have to take into account the following:

  1. What avidday said about smem bank conflicts

  2. You’ll probably need more __syncthreads which takes time

  3. I think it seemed to me that sometimes the compiler had some overhead when using a formula which used both registers

and smem variables - it looked to me a bit slower done that way - but I’m not sure.

  1. Make sure that there is enough shared mem to hold the register - otherwise different blocks/threads will write to the same location.

I think it might be worth it. You can even re-use this smem array to hold more than one “register” if there is no dependancy between them

the best compiler is in your head :)

You’ll probably won’t be able to do that - you’ll need BLOCK_THREAD_SIZE * 30 * sizeof(float) of smem - its too big :)

Remember the smem size you’ll use will also affect the occupancy.

hope that helps,

eyal

A JIT “optimization pass” with the known execution parameters definitely makes the most sense. Doing it without execution parameters seems rather less elegant. Of course, another way to go would be to use compiler directives to pass a target set of execution parameters and have the compiler generate PTX which tries to maximize “fast” memory utilization. That starts to get much closer to the OpenCL compilation model.

If you use shared memory for explicit spilling of 32-bit registers, you shouldn’t have any bank conflicts and no __syncthreads() is needed! You just need to keep the spilled registers as a structure of arrays.

__shared__ int reg1[BLOCK_SIZE];

__shared__ int reg2[BLOCK_SIZE];

__shared__ int reg3[BLOCK_SIZE];

(assuming BLOCK_SIZE is a known at compile-time constant)

and you would access it via

reg2[threadIdx.x]

Since no 2 threads access the same cell there is no need for synchronisation.

Since the access pattern is perfectly coalesced, consecutive threads access consecutive banks and no bank conflict is incurred.

If you need to store 64-bit values, split them into 2 32-bit values and keep them separately (or accept 2-way bank conflict).

However, some registers hold the same values among several threads or they differ by 1 (e.g. that is often the case with loop counters). You might want to try storing a single shared base value and recompute base+threadIdx.x on demand. A little bit more computation but may reduce your register pressure a bit. In my code I often have 1 or 2 for loops which guide global execution of my code and most of register-pressure stuff is inside. Since the loop counter register is “live” all that time, it consumes the expensive register memory, or is spilled to local.

On the other hand, do not be paranoic about local memory. Granted, it is slow, but compiler will spill those registers that are seldom used (unless you use often all of them, obviously). It is not the size of local memory which should matter, but how often you access it!

On a new Fermi architecture, local memory will be primarly cached in L1 cache which, from my understanding, will be as fast as shared memory (since it is physically the same thing). So you will be given 16KB or 48KB (depending on setting) of fast local memory.

Ocelot can also execute CUDA programs on any cuda-capable nvidia GPU as well as any x86 CPU. It supports PTX-to-PTX transformation passes as well.

I mean running a pass on the original PTX to insert spills to shared memory when there is enough space. The pass can be run immediately before a kernel is executed, so it would have the launch parameters and could determine the amount of free memory.