I have a question about the use of switch-case statement inside a Cuda kernel.
I have the following kernel:
template <typename T>
__global__ void myKernel(T *output, bool b_ex,
T a1, T a2, T a3,
T s1, T s2, T s3,
T param1, T param2,
int n23, int n3, int n)
int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
int z = blockDim.z * blockIdx.z + threadIdx.z;
int idx = x * n23 + y * n23 + z;
if ( idx >= n) return;
T d1 = a2*s3 + s1;
for (int i = 0; i < N_DEEP; ++i, ++idx)
T d3 = s1*a3 + s2*a1;
T sqr = d1 + d3*d3;
deviceFunction_1(output, idx, b_ex, param1, sqr);
deviceFunction_2(output, idx, param1, param2, sqr);
deviceFunction_3(output, idx, param1, sqr);
My question is about of the performance of my kernel. In sequential code I have a function that only change in some operations. Then the first implementation of this function with CUDA kernel is like show above. I have split the three specific calculations in three device functions. Then, inside the kernel I check the case and execute the correct operations. This implementation works fine.
My doubt is if the use of switch-case statement (the same for if statement) will affect in the performance of the execution of the kernel.
Performance will be greatly affected?
It would be better to use three different implementations of the kernel?
You could use a device function pointer instead, to save the lookup (at the cost of having to unify the function signatures and pass sometimes unused arguments).
However, as you have already figured, it would be even more efficient to use three different kernels, which can be conveniently achieved with templates - as you’ve already done for T.
Without knowing what the called functions represent, it is impossible to give any sort of accurate assessment. The performance impact is likely minor.
However, since the kernel is already templated, why not add ‘type’ to the template parameters to avoid the switch in the first place?
When I asked for the performance was about the use of statement switch-case because all threads will execute the for loop and always will asked for the correct option and maybe this can be a problem (in terms of performance).
The idea was to avoid the repetition of code.
Since the switch is driven by ‘type’, which is independent of thread ID, the switch is going to add branching to the code, however such branching will be uniform, which leads me to think that the performance impact is minor, especially if the functions invoked are “beefy”, diluting any branch overhead effects.
Templating the kernel by ‘type’, and invoking the appropriate version from the host, via a function pointer (see tera’s reply in #2), will get rid of this overhead, and doesn’t clutter the code any more than the current approach. In particular, there won’t be any code repetition, just multiple instantiations of the template.
I have used templated kernels invoked via function pointers fairly frequently in my work, the performance gain was often fairly substantial, which is not expected here. But depending on what the context is, even a small speedup may be welcome.
Yes, the functions invoked are small, only a few lines of code (4-5).
Finally I will use the kernel templates and I will invoke them via pointer function from host.
Anyway, I think that the performance will be very small but how you said a samall speedup always may be welcome.