So I’m having a bit of problem that’ll take a little bit to explain, although it all boils down to the fact that the gpu doesn’t support function pointers. So on the gpu I’ve got a big library of complex variable functions such as :
float2 sin(float2 z)
float2 exp(float2 z)
float2 tanh(float2 z)
etc…
And I’ve also got another function which is executed once per thread. Let’s say
device float2 doMath(float 2 z){
return z + sin(exp(z + z * z));
}
The thing is, is that during the lifetime of the application, the body of the doMath function needs to change, say to z*z + tanh(z) / z, or something. The rest of the shader remains constant, but the function being computed changes. With function pointers,this would be fairly easy, but as it stands, my options are limited and somewhat complex.
Currently I am solving the problem by modifying the .cu file, recompiling to a shared object, and then dynamically relinking to the library from inside my application. This process takes about 3 seconds(almost all of which is in the compliation phase) which is unacceptable for my purposes. The application is ported from a gpgpu application using glsl, in which I did a similar process. This only took about 0.5s, so I’m guessing I can do better than the 3s. I’m sure though that the nvcc complier is being much much smarter than whatever was compliing the glsl code, which I imagine is what’s responsible for the extra time.
So I’ve got a few ideas of where to go with this, none of which I’ve seriously investigated, as I’d like some advice first.
-
Compile the shader with less agressive optimization. Will this get me anywhere? Any general measures of how much of a performance hit this would take? Is there anything I could do in structuring my shader to compensate?
-
Switch from the runtime api to the driver api. As the driver api’s modules still take cubin files as input, I’d still need to do some of the recompilation gymnastics to get the cubin files, but maybe I’d gain something from this that I don’t quite understand. As the driver api’s modules seem to somewhat support the idea of dynamicly loading/reloading code I suspect there maybe be some advantage to doing things in this fashion.
-
Putting the doMath function in some kindof library and only recompling that file. I’ve tried this and it doesn’t seem to work. I think I recall reading somewhere that all function calls are inlined, which would explain why this would fail. Is there anything I’m missing here?
-
Try to do something extremely clever to fake function pointers directly on the gpu. Or maybe just somehow for this one particular circumstance
-
Start writing ptx or cubin files manually in order to avoid some of the compliation process. This seems like madness. Would it even get me anywhere?
Any other thoughts? I’ve been racking my brain on this for a while now, and could use some advice? Thanks!