smart ideas for an interesting problem

So I’m having a bit of problem that’ll take a little bit to explain, although it all boils down to the fact that the gpu doesn’t support function pointers. So on the gpu I’ve got a big library of complex variable functions such as :

float2 sin(float2 z)
float2 exp(float2 z)
float2 tanh(float2 z)
etc…

And I’ve also got another function which is executed once per thread. Let’s say

device float2 doMath(float 2 z){
return z + sin(exp(z + z * z));
}

The thing is, is that during the lifetime of the application, the body of the doMath function needs to change, say to z*z + tanh(z) / z, or something. The rest of the shader remains constant, but the function being computed changes. With function pointers,this would be fairly easy, but as it stands, my options are limited and somewhat complex.

Currently I am solving the problem by modifying the .cu file, recompiling to a shared object, and then dynamically relinking to the library from inside my application. This process takes about 3 seconds(almost all of which is in the compliation phase) which is unacceptable for my purposes. The application is ported from a gpgpu application using glsl, in which I did a similar process. This only took about 0.5s, so I’m guessing I can do better than the 3s. I’m sure though that the nvcc complier is being much much smarter than whatever was compliing the glsl code, which I imagine is what’s responsible for the extra time.

So I’ve got a few ideas of where to go with this, none of which I’ve seriously investigated, as I’d like some advice first.

  1. Compile the shader with less agressive optimization. Will this get me anywhere? Any general measures of how much of a performance hit this would take? Is there anything I could do in structuring my shader to compensate?

  2. Switch from the runtime api to the driver api. As the driver api’s modules still take cubin files as input, I’d still need to do some of the recompilation gymnastics to get the cubin files, but maybe I’d gain something from this that I don’t quite understand. As the driver api’s modules seem to somewhat support the idea of dynamicly loading/reloading code I suspect there maybe be some advantage to doing things in this fashion.

  3. Putting the doMath function in some kindof library and only recompling that file. I’ve tried this and it doesn’t seem to work. I think I recall reading somewhere that all function calls are inlined, which would explain why this would fail. Is there anything I’m missing here?

  4. Try to do something extremely clever to fake function pointers directly on the gpu. Or maybe just somehow for this one particular circumstance

  5. Start writing ptx or cubin files manually in order to avoid some of the compliation process. This seems like madness. Would it even get me anywhere?

Any other thoughts? I’ve been racking my brain on this for a while now, and could use some advice? Thanks!

Under what conditions (exactly) do you need to change the doMath function’s internals?

Does it change depending on the value of a (combination of?) variable(s)?
Does it change over time/frames/etc?
etc…

Without knowing when it’s supposed to change, and by how much - the advice people can give is limited.

It changes basically whenever I want it to. I need to be able to freely change between any syntacticly valid expression, with no warning. The process that does this is is not quite relevant to the question at hand. Lets just say I hit a key on my keyboard, and the host changes the function to some other valid expression of a complex variable.

The details of this aren’t quite relevant. The question is fairly general: how best to dynamically recompile/reload a program in order to overcome the lack of function pointers. I can’t imagine anyone except myself coming up with a solution particular to my implementation.

CUDA is definitely not designed for this level of dynamic code generation. However, the latest CUDA 2.1 beta Programming Guide describes a new way to load PTX into the driver and have it compiled there (Section 4.5.3.4) without having to make an external call to ptxas. That probably speeds things up, but regenerating the PTX for your entire kernel just to change a device function might be a little tedious.

Interesting. Maybe I can compile to ptx a version of my kernel without the getMath function, or with a filler function, and then, when necessary, recompile just getMath into a ptx file, manually splice the results, and then user the driver api to load the resulting file. Sounds like madness, but it might actually work.

Is there really no way to just normally compile 2 .cu files(the main kernel, and the getMath function) to objects, and then have the linker deal with it when necessary? I’ve tried defining getMath as an extern in the main kernel, but the compiler tells me that device extern is not supported. Is that the end of that?

May b, you should use a “switch”, “case” statement and return different values for different scenarios…

Is this change deterministic, or randomized? If it’s deterministic, why not write some macros or something to make separate versions of your “outer” functions that call each version of your math function, then determine what needs to be called in your host code and go from there?

I don’t quite understand… you say the expression could be anything yet you also say function pointers will be sufficient for you? Function pointers will only let you pick from a predefined set of functions. The same that a switch() statement can accomplish.

I really do need the full generality here, so switch/case wont work, neither will precompiling the scenarios, etc. I have a library of complex functions, along with arithmetic operations (+, -, , /) and need to be able to render any valid string(from zz to z*sin(cos(1.0 / z)) + z / tan(z). There are a exponential number(in the length of the string) of possibilities. And for all practical purposes the change is random and asynchronous.

I think you’re completely right about the function pointers. I think the actual problem is the lack of recursion. A recursive algorithm with a switch statement simulating the function pointers would do the trick. But even then, I’m not sure I’d want to proceed in that fashion, as the kernel would then have the overhead of the algorithm for each thread, which seems like a waste, as the flow is known before hand and doesnt change across threads.

As far as the GPU is concerned, there is no linker. I don’t think there is any way for a function from one cubin to jump to a function in another cubin (which sounds like what you are asking). Also, the nvcc compiler inlines pretty much everything, so in most (all?) cases, device functions don’t exist as callable entities anyway. If you are writing your own PTX, the ISA does define call and ret instructions, but the call instruction does not support indirection.

You can replace your recursion with some clever loops, but, yes, the overhead will remain. I wonder how significant it will be. I don’t think it will be more than 5x.

E.g., you can make an interpreter for a simple bytecode. On the CPU, convert your expression into a series of operations, then loop through and execute them on the GPU.

That’s an interesting idea. I haven’t yet gotten around to optimization of my kernel yet, so I’m not sure where the bottlenecks are. I imagine it’s memory bound, so maybe the overhead wouldn’t matter so much? This function is called in the most critical section of the shader, so I’m not quite sure. It would definitely take something mighty clever, as the expressions being evaluated are arbitrary trees(because I also have binary arithmetic opperations).

It’s a difficult decision though. Overall, I would prefer the overhead of the compilation to a reduction in efficiency of my kernel, if that’s what it came to. I imagine the weird ptx idea from earlier would in fact improve the compilation performance, although it’s unclear by how much. It would be difficult to do it correctly, although it seems entirely possible.

Why not build yourself a normal (i.e. host) function that parses your string and generates PTX code? You could then call one of the driver functions (can’t remember the name of it) and execute the PTX code dynamically. I don’t know how fast this would all work out, but it’s got to be somewhat faster than regenerating .cu files and actually re-compiling the code each time (though the driver would have to compile the PTX to machine code).

You could do that conversion in the GPU itself as a separate kernel.

First kernel would convert to mnemonics. (which will be faster than CPU)

Second kernel would execute them.

Yeah, something like this is what I’m thinking. I’m not sure at all how the ptx compilation stuff works though. I doubt I could just recompile and add a single function though, although if I can that would be fantastic, and somebody should correct me.

What I think is more likely is that I keep a copy of the rest of the kernel, sans my function lying around in ptx form, and then, when necessary, recompile the one function to ptx, splice it into the kernel ptx, and hand that to the driver api. This seems complicated, but possible. Can anyone give me any advice about this?

I’ve already tried compiling just the math function to ptx, and it takes a fraction of a second, which is good. Anybody have any ideas how long it would take for the driver api to compile the ptx into machine code? I guess relative to how long nvcc takes to compile?

ptxas is sometimes quick, sometimes takes ten minutes. It’s at its worst when it’s trying to shrink the register usage.

The trouble with the insert-ptx method is if you change the original kernel, you’ll have to figure out anew how to splice your piece in. (The registers change, etc.) Doesn’t seem too robust.

Can you isolate the math part into its own, separate kernel? A small kernel will also compile quick.

but is it ever slower than using nvcc? isn’t compilation to ptxas an intermediate stage in the normal nvcc compilation process?

Yep, you’re absolutely right that a simple implementation wouldn’t be robust. I’d thought about the register problem, but maybe a very smart regular expression + serach/replace scheme could do this robustly. Maybe I’m being naivley optimistic, as I’m not particularly familar with ptxas files, but it seems like it is theoretically possible to do this correctly/consistently.

That’s an interesting idea that I hadn’t even considered. Thanks. I’ll have to seriously think about that. I’m guess though that it won’t quite be good enough although it’s hard to describe why without a whole lot of other information about the system. I think it would absolutely work, but I think that not without an unacceptable performance decrease. Although I’m not 100% sure.

If that’s true, then function pointers would ALSO be useless to you. Since the function doesn’t even exist until you get the string, there’s no function to point to.

Perhaps one promising answer, which is quite CUDA friendly, is to make a small state machine… a very small virtual machine which is just a big case statement driven by a “program” which is a simple array of opcodes in a switch statement.

You get your function string, and on the CPU before launching the GPU kernel, you parse that string and make an evaluation tree for it. This evaluation tree is in effect a compilation, making a recipe that basically takes the string of “sin(x+22)/(1+x*x)” and creates simple instructions to control some subset of registers R1 R2 R3 R4…

that “program” would look something like:

R2=R1+22

R2=sin(R2)

R3=R1*R1

R3=R3+1

R1=R2/R3

Your state machine could use a program something like one int for the operation (+ - sin log, set-to-constant, add constant, etc) , one int to identify the destination register, and one or more values to identify the source register(s) or constants, each opcode defining its number and type of required values any way it needed.

So the above program would be something like:

Add-constant 2 1 22.0f

sin 2 2

  • 3 1 1

add-constant 3 1.0f

/ 1 2 3

where the first command in each column is actually just whatever opcode you assigned that operation to.

The CUDA code would probably be something like storing registers in shared memory, each thread would have just 3 or 4 of them,

more if you really needed them for super-complex evals. (but you can do almost anything with only 4.)

while (v[0]!=END_OPCODE)

switch (v[0])

{

case ADD_CONSTANT_OPCODE:

reg[v[1]]=  reg[v[2]] + ((float *)v)[3]; 

v+=4; // ate 4 values from stream

break;

case SIN_OPCODE:

reg[v[1]]=sin(reg[v[2]]);

 v+=3; // ate 3 values from stream

 break;

case MULT_OPCODE:

  reg[v[1]]=reg[v[2]]*reg[v[3]];

  v+=4; // ate 4 values from stream

  break;

… other operations here.

}

The nice part is that in CUDA, the interpreting part is quite reasonable since you get no divergence in your warps because every thread is evaluating the same expression and will take the same branches.

Making such a parser/compiler/interpreter is tedious but straightforward… kind of a classic homework assignment for university computer science classes. It’s easy to google for “expression parsing” or “formula parsing”

SPWorley, I guess great minds think alike ;)