How to specify maxrregcount to clBuildProgram?

How can I specify the ‘maxrregcount’ option to the ptx assembler through OpenCL? Passing “–maxrregcount=64” to clBuildProgram gives me the following error message:

I realize this will spill some variables out to global memory, but I’m ok with that. The inner loop of my algorithm has 20-40 texture accesses. I know that loop can execute with only 64 registers based on executing it in isolation. That loop is then executed 20 times by another loop. So I can afford a few global memory accesses at that level since they will occur so infrequently. The entire purpose of this exercise is to increase the work-group size executing on a single multiprocessor. Currently, I’m bounded to 192 work items by register pressure. I’ve heard somewhere that at least 256 work items are needed to start effectively hiding texture latency. Furthermore, texture latency appears to be a constant constraint based on this paragraph from the CUDA best practices guide:

I’m also curious what will change with textures in the upcoming Fermi architecture? The white papers appear to be sparse on the issue.

Thanks,

Brian

Hi,
I can’t answer your question, really, but would it help to define some variables as globals in your kernel?
Downside is, maybe you waste a few registers, upside is you can select your more infrequently used vars as globals, i think.
Maybe use some shared memory, faster than global.
Correct me if I’m wrong!
Jan

Unfortunately, there’s no easy way to do that. For example, I have a bunch of private variables declared like the following:

float orientation[7];

Trying the following:

__global float orientation[7];

Yields the following error:

This make the job of migrating variables from private to global much more difficult since the memory will have to be allocated by my host code. Furthermore, to get good performance the access will have to be nicely aligned for good memory coalescing. All quite doable, but it would be preferable if the compiler did it for me. The depressing thing is the compiler already is pushing some of my variables out to global. Reading the PTX assembly I see several “ld.local” and “ld.store” instructions. I just wish I could make it a little more aggressive with it. I know it’s possible to do in the CUDA world with the maxrregcount flag to the PTX assembler, should be some way of doing it with NVidia’s OpenCL.

Also, I’m already using the limit on shared memory for other data, so unfortunately I can’t spill into that memory space.

Thanks,

Brian

Hi Brian,

Pity you/we’re up against a barrier that does not exist in CUDA. Hopefully, there will be an equivalent of maxregcount or some other means to influence register use in openCL.

Only thing: perhaps just declaring:
float orientation[7];
outside a function-body will put it in global memory without needing to allocate it through your host-code. Shouldn’t cause a warning.
I didn’t test this with a very small kernel; i think the variable is required to be visible by threads in other workgroups/running on other SM’s as well, so can’t be in a register private to a thread or shared memory visible only to a single workgroup. This would fail if you need the variable to be private to / modifiable by several threads, but if you could assign modification to a single thread and all other threads just need to read it, it should be ok.
Anybody, correct me if I’m wrong.

Then, coalescing is something you need to achieve by orchestrating the threads, not clear that a compiler can do that, so you might want to do that sooner or later anyway, to achieve optimal parallellisation.

Last attempt at advice: save registers by recalculating. Write ugly code or make it look better though macros. The notion is that recalculation is often more efficient in parallel GPU code than storing/loading a variable.

good luck,
Jan

That worked! Better yet, it improved my kernel’s performance! Though it looks a little scary. Where in the OpenCL standard does it say that variables declared outside of functions are in global device memory? I ask because it appears to behave like a private variable to that work item.

My assumption is that NVidia’s compiler writers know how to arrange work item privates into global memory to achieve good coalescing. I think a safe one to make. Furthermore, the upcoming fermi architecture will have an L1 cache to global memory, so those extra private variables in global memory shouldn’t hurt quite so much.

This was my first attempt. I pinpointed which variables to remove by commenting them out until I got the work group size I wanted. Then I moved all the math inside the loop (a lot of math), and the compiler took registers right back, !@#$%^&!

Well it shouldn’t work like that AFAIK.

The standard says

__constant is implicitly global, but simply declaring “float orientation[7]” at file scope doesn’t even imply __global, let alone __constant.

Such declaration is undefined in the standard:

It doesn’t say what’s the generic address space for file scope variables. And there shouldn’t be one IMO, the first quotation says we need to be explicit with file scope variables address space. Or perhaps the generic address space could be __constant, I guess.

A proper __constant file scope variable would be read only and would need to be initialized at compile time. If a compiler allows mutable file scope variables (like CUDA does) it has far reaching consequences - with constant read-only variables the only dependencies between kernels are the ones carried by the arguments (writeable buffers and images). If we have mutable file scope vars, any reasoning about function dependency is impossible without thorough code analysis. This might be particularly evil with out of order execution or multi-device computation…

Do you actually write to orientation?

Yes.

So the verdict is that it works just fine on the NVidia implementation. The compiler creates a private array (apparently in global memory since it reduces my register usage) and all my tests pass. On OSX I get the following error:

So apparently this is non-standard OpenCL. So I’m back to square one.

Sorry to hear that. Thanks anyway for pointing all this out. The solution seemed obvious, I didn’t think of a reason why vars in global memory should not be allowed at the file scope…

For the compiler and specification, I would argue for an implementation of explicit __global vars at function scope. Allocation would be rather similar to dynamic allocation needed for registers and shared memory. The alternative is allocating global memory through the host, that would require creating (and deleting) a buffer for each kernel invocation. Host allocation would make using the variables in the function more complex, e.g. an array of variables indexed by get_global_id(). The host-allocation scheme would be further complicated by multi-gpu. And lastly, the allocated variables would have an unnecessarily long lifetime.

So allowing

void afunc(void)

{

__global float orientation[7];

......

}

would help a lot when somewhat larger local variables are needed.

Jan

I agree, I hope this is added in a later standard. It’s something the compiler should be able to do for the user fairly easily and in the most efficient way possible for the hardware.

Also, I found the flag I wanted. An OpenCL guru at NVidia who is now my saint told me about the “-cl-nv-maxrregcount=n” flag that can be used with the NVidia OpenCL implementation. Yay! It even provides marginal performance gains… Doh! Back to the profiler…

-Brian