3-layer for-loops


This is my first post, so please don’t be too hard on me…

I’ve got a 3-layer for-loop that I’d like to optimise. (It’s for a speech recogniser).

for (t=0; t<nFrames; t++){









It’s quite time consuming and could do with some parallelization.

Could anyone give me some suggestions?

Why post this on the Linux-centric forum?

The key question is what happens inside the loops. If the inner loop is completely independent of all the other iterations, then just launch one thread per inner loop and you’re done. (Careful optimization aside.)


Thanks for the message.

I put this up in the Linux page because I’m using an 8800 GPU alongside my quad-core running XFCE ubuntu Linux. Please feel free to move, if this forum is not appropriate: I’m still not 100% familiar with the goings-on round here…

It’s interesting that you mentioned about the data in the inner loop needing to be independent. Thanks for articulating that. I understand from browsing through the manual that the data types and what the data can access is quite restricted. Something I’m going to have to look at, cos my inner loop is calling functions (relatively simple functions).

I presume it’s very difficult (if not impossible) to reference c++ objects from inside the inner loop?


The C++ support in nvcc is very incomplete and accessing objects on the GPU might not work. (On the other hand, templates and operator overloading do work, and are very valuable.)

Even if it did work, the memory architecture makes parallel object access very slow. You’ll see people mention the term “coalesced memory reads” when talking about CUDA. The memory controller is optimized for large contiguous reads. To get anywhere near the theoretical memory bandwidth, you really need threads to read contiguous elements in arrays of 32, 64 or 128 bit data types. This makes arrays of structs (and basically arrays of objects, if that were supported) inefficient in many cases, whereas a structure of arrays would be better.

I call CUDA kernels from C++ code, and during the setup phase I transform my C++ data structures into a more efficient array layout before moving the data to the GPU.

Feel free to use C++ constructs. Almost everything works except anything that uses function pointers or virtual functions (ie polymorphism).

Structs/classes do make it harder to do coalescing sometimes, but don’t worry about it for now. Just study the coalescing rules, and you’ll be able to tell on your own which situations are problematic. (There’s many situations where using structures doesn’t affect performance at all, and I don’t think anyone should just abandon these hugely useful things!)

The inner loops don’t have to be 100% independent. You just must realize that threads will run simultaneously, and if one thread critically depends on the results of another, it will have a tough time working. Sometimes the inner loop may be loosely dependent on previous iterations, like when it keeps adding to the same variable. In this case the addition is not really a problem because additions can be rearranged and left for the end.

Seibert, alexdubinsky:

Thanks for the insight guys. Los Alamos National Lab eh? Isn’t the net great that we can access like-minded experts from all over the world.