I am also very new to CUDA (maybe a couple weeks into it), but I’m one of the best when it comes to working with relational databases. I’ve found things that carry over to CUDA from databases and embedded to be very important to keep in mind.
Hopefully I’m correct on this. I’ve identified two fundamentals with CUDA that are foreign to standard C development:
- It is important to think in sets.
It’s my understanding that one CUDA warp processes 32 threads simultaneously, but not like a processor that is switching threads, and not like two CPU cores working independently at the same time. The CUDA warp has 32 copies of thread-specific registers and the instruction applies to every single thread at the same time. If you have an if…else block and one thread has the “if” portion as true and the other is false, the instructions for both branches are executed (only they are not applied to the false portion of the branch). This is called “Warp Divergence”. Any execution that is not a multiple of 32 threads leaves processing power on the table. Any execution that has a large number of branching if statements and control flow can create a penalty because of execution. I believe this is to increase density on the silicone chips. 32 is likely the Goldilocks number for the warp size.
This guy does a good job explaining this: http://people.maths.ox.ac.uk/~gilesm/cuda/lecs/lec3-2x2.pdf
- Understand #1 and then it makes sense why external memory operations are so expensive. I believe, and I’m not yet sure on this, that when you have a bunch of threads in the same warp that need to access external memory randomly, and they’re not accessing the same page, that when external memory is being fetched that all other threads in a warp must wait for the fetch operation to complete.
I think of a warp like this machine. Each pancake is a thread. If there was no pancake in a position, the machine must still wait as all threads perform their instruction at the same time.