For Loops with If Statements Inside Kernel

anon76138489 · June 18, 2024, 11:29am

Hi! I am a PhD student working on Particle Physics Simulations on a GPU. I have recently developed a code that looks through multiple particles in a system and decides the one with the highest energy. Inside my kernel is the (pseudo)code:

int idx = [formula for thread ID];
system = systems[idx];

double highest_energy = 0.0;
int highest_particle = 0;

for (int i = 0; i < system.particles(); i++) {

    if system.particle(i).energy() < 100 {
        continue;
    }

    if (system.particle(i).energy() > highest_energy) {
        highest_energy = system.particle(i).energy();
        highest_particle = i;
    }
}

nb: A system is an object containing an array of particles. Each system is completely independent, and threads don’t communicate with each other at all.

Different systems have different numbers of particles. We tried using this code with a V100 GPU, and it was really fast. Is this due to Independent thread scheduling? Is each thread running its own for loop independently and diverging from the if statement allowed?

Curefab · June 18, 2024, 2:02pm

Probably you could even further accelerate the code:

Put member variables of systems into separate arrays, so that system_particles[idx] is a coalesced access
For the for loop switch responsibility of threads from a single system per thread to 32 threads (= one warp) process one system after another in a collaborative way

If you are happy about the speed, you do not have to try out further optimizations.

Independent Thread Scheduling currently does not help here. The threads of a warp diverge in the current code. And also the memory accesses are suboptimal.

njuffa · June 18, 2024, 2:23pm

Probably because:

(1) Your code appears to perform double-precision (FP64) computation, and the V100 offers full FP64 throughput, whereas consumer GPUs offer significantly lower FP64 throughput (reduced by a factor of 64 in the most recent GPU architectures)

(2) The V100 is an (older) high-end GPU with high memory throughput due to the use of specialty HBM2 memory. As far as I am aware, among consumer GPUs only the RTX 3090 and the RTX 4090 offer similar memory performance.

anon76138489 · June 18, 2024, 3:13pm

Thank you both for the responses! I’m still a bit confused about the underlying SIMT theory:

Let’s assume a GPU with just two threads/cores.
Let’s say System A has 3 particles and System B has 13.
The for loop will iterate 3 times for system A. For system B, it will iterate 13 times.

Why does this work by SIMT logic? Would the first thread do 10 extra/unused iterations for system A?

Or, if I have thousands of these systems, would the first thread move to a new system after system A, while the second thread is still iterating for system B?

Curefab · June 18, 2024, 3:28pm

Normally A would do 10 extra iterations.

In the worst case (which the Cuda compiler/assembler should prevent) all the threads would stay out of sync / diverged after the loop, even if the code after the loop could be executed in a converged way.

What never happens is that A and B execute different program positions at the same time, always one program position is executed and the other threads are inactive, but typically use resources (with some exceptions) as if they are active.

The Independent Thread Scheduling prevents deadlocks, when syncing between threads. In older GPUs it could happen that only A was waiting for B, and B never continued as it was A’s turn. Newer GPUs guarantee that B would at some time get execution time, when A just waits (or even before - sometimes A, sometimes B proceed).

anon76138489 · June 18, 2024, 3:48pm

That makes a lot more sense - thank you!

One last thing to confirm this - in the code above, I have this if statement:

which would skip the second half of the loop if the energy is less than 100. If this statement is true in thread A and false in thread B, would A be inactive while B executes the second half?

Curefab · June 18, 2024, 4:01pm

Yes, that is the case. A would be inactive in the second half. The Compute Nsight profiler could give you additional insight.

And normally a synchronization point would be introduced after the for loop. You can manually do this with __syncwarp(), but should not be necessary.

Curefab · June 18, 2024, 4:06pm

You can rearrange your code and algorithm by having the finished threads continuing with particles of a different system to keep them busy in a useful way.

However, depending on the arithmetic complexity of your calculations in energy() the bottleneck are the memory accesses.

And the best way to further improve the performance would be to have the threads better cooperate to load the systems and their particles instead of letting them do it independently.

anon76138489 · June 20, 2024, 1:26pm

Hi! I’m sorry, but I was reading the CUDA C++ Programming Guide, which made things more confusing.

I’ve attached a snapshot of SIMT architecture from Section 7.1. It claims that threads in a warp start together but are then practically allowed to branch and execute independently. Doesn’t this completely contradict our conclusion? I

In the sense that if threads A and B can “execute independently”, how is SIMT preserved? Are they running different commands in this case? It could just be poor wording, too.

EDIT: The paragraphs after this statement go back to talking about active and inactive threads, which is what we discussed. Sorry for the confusion!

Curefab · June 20, 2024, 1:42pm

That is true, but they are not truly executed at the same time. Similar to a CPU with a single core, which does preemptive multitasking. The processes (and threads) on the CPU run independently and have their own instruction address counter and register state, but they are not really actively running at the same time, but the scheduler quickly switches between them.

Before Independent Thread Scheduling you could very very roughly compare it to cooperative (instead of preemptive) multitasking on the CPU with the possibility of those deadlocks, when some of the threads were starving (not getting execution time) and the others were just waiting for the first ones without relinquishing control.

system · July 4, 2024, 1:42pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does the new independent thread scheduling give better performance? CUDA Programming and Performance	4	2983	February 6, 2020
Question about warp execution and the warp scheduler CUDA Programming and Performance	4	74	June 21, 2025
cant call any kernel function CUDA Programming and Performance	8	4842	June 6, 2011
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16830	January 5, 2008
questions about thread execution & volatile CUDA Programming and Performance	19	16906	December 29, 2008
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24867	September 6, 2009
loop inside kernel CUDA Programming and Performance	9	7609	May 4, 2009
Can one warp be doing one thing while another warp does something else? CUDA Programming and Performance	6	823	July 11, 2017
Question regarding conditionals in kernels CUDA Programming and Performance	4	5004	July 31, 2010
Back to SIMD CUDA Programming and Performance	21	183	November 12, 2024

For Loops with If Statements Inside Kernel

Related topics