Hi! I am a PhD student working on Particle Physics Simulations on a GPU. I have recently developed a code that looks through multiple particles in a system and decides the one with the highest energy. Inside my kernel is the (pseudo)code:
int idx = [formula for thread ID];
system = systems[idx];
double highest_energy = 0.0;
int highest_particle = 0;
for (int i = 0; i < system.particles(); i++) {
if system.particle(i).energy() < 100 {
continue;
}
if (system.particle(i).energy() > highest_energy) {
highest_energy = system.particle(i).energy();
highest_particle = i;
}
}
nb: A system is an object containing an array of particles. Each system is completely independent, and threads don’t communicate with each other at all.
Different systems have different numbers of particles. We tried using this code with a V100 GPU, and it was really fast. Is this due to Independent thread scheduling? Is each thread running its own for loop independently and diverging from the if statement allowed?
Probably you could even further accelerate the code:
Put member variables of systems into separate arrays, so that system_particles[idx] is a coalesced access
For the for loop switch responsibility of threads from a single system per thread to 32 threads (= one warp) process one system after another in a collaborative way
If you are happy about the speed, you do not have to try out further optimizations.
Independent Thread Scheduling currently does not help here. The threads of a warp diverge in the current code. And also the memory accesses are suboptimal.
(1) Your code appears to perform double-precision (FP64) computation, and the V100 offers full FP64 throughput, whereas consumer GPUs offer significantly lower FP64 throughput (reduced by a factor of 64 in the most recent GPU architectures)
(2) The V100 is an (older) high-end GPU with high memory throughput due to the use of specialty HBM2 memory. As far as I am aware, among consumer GPUs only the RTX 3090 and the RTX 4090 offer similar memory performance.
Thank you both for the responses! I’m still a bit confused about the underlying SIMT theory:
Let’s assume a GPU with just two threads/cores.
Let’s say System A has 3 particles and System B has 13.
The for loop will iterate 3 times for system A. For system B, it will iterate 13 times.
Why does this work by SIMT logic? Would the first thread do 10 extra/unused iterations for system A?
Or, if I have thousands of these systems, would the first thread move to a new system after system A, while the second thread is still iterating for system B?
In the worst case (which the Cuda compiler/assembler should prevent) all the threads would stay out of sync / diverged after the loop, even if the code after the loop could be executed in a converged way.
What never happens is that A and B execute different program positions at the same time, always one program position is executed and the other threads are inactive, but typically use resources (with some exceptions) as if they are active.
The Independent Thread Scheduling prevents deadlocks, when syncing between threads. In older GPUs it could happen that only A was waiting for B, and B never continued as it was A’s turn. Newer GPUs guarantee that B would at some time get execution time, when A just waits (or even before - sometimes A, sometimes B proceed).
One last thing to confirm this - in the code above, I have this if statement:
which would skip the second half of the loop if the energy is less than 100. If this statement is true in thread A and false in thread B, would A be inactive while B executes the second half?
You can rearrange your code and algorithm by having the finished threads continuing with particles of a different system to keep them busy in a useful way.
However, depending on the arithmetic complexity of your calculations in energy() the bottleneck are the memory accesses.
And the best way to further improve the performance would be to have the threads better cooperate to load the systems and their particles instead of letting them do it independently.
Hi! I’m sorry, but I was reading the CUDA C++ Programming Guide, which made things more confusing.
I’ve attached a snapshot of SIMT architecture from Section 7.1. It claims that threads in a warp start together but are then practically allowed to branch and execute independently. Doesn’t this completely contradict our conclusion? I
In the sense that if threads A and B can “execute independently”, how is SIMT preserved? Are they running different commands in this case? It could just be poor wording, too.
EDIT: The paragraphs after this statement go back to talking about active and inactive threads, which is what we discussed. Sorry for the confusion!
That is true, but they are not truly executed at the same time. Similar to a CPU with a single core, which does preemptive multitasking. The processes (and threads) on the CPU run independently and have their own instruction address counter and register state, but they are not really actively running at the same time, but the scheduler quickly switches between them.
Before Independent Thread Scheduling you could very very roughly compare it to cooperative (instead of preemptive) multitasking on the CPU with the possibility of those deadlocks, when some of the threads were starving (not getting execution time) and the others were just waiting for the first ones without relinquishing control.