I’m trying to do an indirect jump in a kernel. I tried this with old hardware (The Official NVIDIA Forums | NVIDIA) and found that it was impossible, but it is my understanding that Fermi can handle indirect jumps, and reading through the PTX 2.1 documentation seems to confirm this.
However, I can’t get it to work. Trying the code below I get no warnings (other than not using x), but the compiler segfaults. This sort of thing is pretty easy to do in gcc, any idea how to make it work in nvcc? Is this a compiler bug?
__global__ void testkernel(int jump)
{
int x = 0;
void *jumptable[3] = {&&label0, &&label1, &&label2};
goto *jumptable[jump];
label0:
x = 4;
return;
label1:
x = 10;
return;
label2:
x = 20;
return;
}
The compiler should not segfault, no matter what files you feed it. If this happens with the CUDA 4.0 toolchain, can you please file a bug against the compiler, attaching your repro case. Please also state the exact commandline used to invoke nvcc, and platform (Win32, Win64, Linux32, Linux64) used as this will help with repro on our side. Thank you for your help, and sorry for the inconvenience.
You can submit bugs from a link off the registered developer website. Sorry, I do not know whether the code should work, the syntax of the jump table initialization is not familiar to me. Do you know whether this is ANSI C/C++ or possibly a gcc extension?
I have a switch statement I’m attempting to optimize with a jump table. I’ve looked at the PTX generated from my kernel and from what I can tell (based on the PTX documentation for the bra instruction) it is not automatically optimizing the switch with indirect jumps, so I’m attempting to handle it explicitly.
From the PTX 2.1 manual:
I’m attempting to access this instruction interface through CUDA, and my code above is the best guess I have for doing so. Any other ideas would be appreciated.
That works, thanks for your help. Unfortunately it turns out that using a table of function pointers is slower for me than using a simple switch statement. From what I can tell looking at the ptx for the device functions, some arguments are written to local memory, and it seems that this overhead eclipses the advantage of the jump table. Do you have any ideas about this? Obviously this isn’t an issue with inline functions, but using a jump table seems to make inlining impossible. It would be nice if you could declare register variables that were shared between the kernel and device functions, but as far as I know this isn’t possible either. It looks like a switch statement is still my best option.