Threads should be run in groups of 32?

On Wikipedias page on CUDA under “Limitations”, it states that “Threads should be run in groups of at least 32 for best performance.” There is no reference as to where this comes from on the Wikipedia page.

I’ve been reading the CUDA programming guide v 1.1, but haven’t been able find any explanation as to why this is.
Can anyone help me out?

Each of the Multi-Processor inside the NVIDIA-GPGPU follows the SIMT architecture, in which a group of 32 threads will be run concurrently. Thus, in order to get the most out of the GPGPU, it’s always advisable to group the tasks in 32’s.

For more details you can refer the ‘CUDA 2.0 Programming Guide’.
More specifically… Page No. 22, 1st paragraph. :)

Thanks for your answer teju! Apparently theres a bit of an architecture difference on this point between CUDA 1.1 and CUDA 2.0. It seems like this specific optimization is only relevant on the 2.0 version.

Might do some testing on it just to be sure though.

I would always recommend using CUDA 2.0. It provides enormous capabilites as compared with its 1.1 counterpart. (Unless, the limitation is the GPU being used)

Look at the scripts in these threads…
It might be of some help to you…

Where in the world did you come up with that idea? “Specific optimizations” related to CUDA have been the same since CUDA version 0.8 and probably earlier. Warp size, divergences, coalescing, texture caches, constant memory, shared memory, etc… are all low level hardware features that have been basically the same for the last 2 years. OK, so memory coalescing gets a little easier from sm10 to sm13, but that is the biggest difference besides new features like atomic operations.

The programming guide has all the CUDA performance tuning and architecture explanations you need to write optimal CUDA apps. Really. That is, unless you want to start hand tuning assembly code where you’ll need the PTX ISA and other information that can be gotten from things such as wumpus’s decuda tool.

Thanks for the links teju. I’m gonna have to stick with 1.1 though, as the server i’m using for this particular project has the 1.1 drivers installed, and i can’t just go and upgrade them.

MisterAnderson, indeed you’re probably right that these “optimizations” havent changed from 1.1 to 2.0. I just don’t see anything in programming guide 1.1 that states that threads should be grouped in 32’s. Also, in the architecture described in programming guide 1.1, I don’t see any particular reason why that should be the case.

Obviously I could easily have missed something :)

what you’ve missed is the warp size, it’s not a new thing for cuda 2.0, as far as i know, the warp size is always there

in the programming guide 1.1, i can see that it’s in page 15

have a read about active warps

I’ve got to link my good friend David Kanter’s article on GT200–it’s a super-technical read, but knowing the hardware will give you a much better understanding of why CUDA is set up the way it is. The simplified summary is, GT200 is made up of 30 8-wide vector units. Each vector unit runs at twice the speed of its scheduler, so it absolutely must execute 16 operations at a time (although with predication, some outputs can be disabled–this is the perf hit from divergent branches). Finally, the warp size is defined to be 32 so if future hardware ever had 16-wide vector units, your current code wouldn’t explode.

That article also has a very nice overview of the changes in Compute 1.2 and 1.3. Finally, David Kanter is terrifyingly intelligent–this was his first GPU article, and it’s up there with Rys Sommefeldt’s G80 article or Dave Baumann’s Xenos article in terms of quality (both at Beyond3D).

(disclaimer: I’m in the credits for the GT200 article, which is hilarious. second disclaimer: I used to write for Beyond3D.)

It is better to query “properties” of device, find out the “Warp size” and then use the same to determine your block-size. It MAY change in future. I was alerted by Mark Harris long time ago on this.