Will a lot of if statements slow down RTX 5000 significantly comparing to CPU?

I am comparing the performance of a CPU function and a GPU function on an RTX5000 while processing a dataset of 6000 float arrays. I expected that assigning each data point to a separate thread on the GPU would speed up the program. However, the CPU function, which processes the dataset sequentially, is actually faster. My code contains numerous if statements as well as 4-5 nested for loops, will those statements impact GPU performance that much?

Maybe, maybe not. The compiler may be able to eliminate many of the branches. The code may be bottlenecked on something else, such as memory throughput. Maybe your measurements included the time to transfer the source data to the GPU. Use the profiler to guide your efforts.

Thanks for your prompt response! I only call the GPU once, and most arrays I pass to the kernel use shared memory. Could memory throughput still be a factor in the slowdown?

The RTX 5000 has 48 SMs, that means your are using 125 threads per SM? That is on the lower side for occupancy reasons, but should not be totally bad.

And the float arrays for each of those 125 threads fit on the shared memory per SM, or do you do those partially sequentially?

You should make sure that each of your threads can use different data, but should converge with its execution soon again. An if/else halfs the computational performance, a multi-option switch case is worse, nested ones multiply that effect. It is only bad for performance, if the threads diverge. If they take the same path, it is okay. for loops are okay. But for loops inside if/else/switch case inherit the slowdown.

For good memory speed you should index the lowest dimension (right in C/C++) by threadIdx.x so that you avoid bank conflicts, when accessing shared memory.

Thanks for the reply. I do have if and switch statement inside loop of 16,000 round. I will try to move it outside the loop if possible. Thank you!

Hi qjin_2000,
that is only effective, if the loop itself is not in the if/switch statement or if the 32 threads of a warp process the same branch.