Modifying CUDA particles (sample code) Some questions on how to modify the code efficiently


I am a beginner to CUDA/parallel programming. I have no formal training in either of them. I am using GTX 280 to run the CUDA particle code. It compiles runs wonderfully. I also made some modifications in the interactions ( and it works as expected. I need to make more changes and any help on the following will be greatly appreciated.

As a simple case, if I had to make only one ball to have a different radius and attraction property than the others, what would be the simplest way to do that?

As a general case, I would like to make the radius and attraction particles of different particles as different. What would be the simplest way to do that. If I were to write in C++, I would have defined a particle class with variable attributes and pass an array of that class while performing calculations in Would something like that work here too? The current particle code passes pointers to position and velocity vectors separately.

I basically want to make sure when I make these changes, I don’t screw up the performance that the current code provides!


It probably wouldn’t work or would work very slowly. Class objects tend to be big (comprising many attributes) and GPU’s memory fetching is optimized for primitive elements (32 bit words, 64 bit double words or 128 bit quad words).

You would probably be much better off with an additional array of radii that you would read in the kernel.

Mark Harris in his slides on optimisation writes:
“Reading structures of size other than 4, 8, or 16 bytes will break coalescing:
Prefer Structures of Arrays over AoS”

You can certainly use structs (and non-polymorphic classes). In fact, as mark harris said, reading structures of 4, 8, or 16 bytes will coalesce perfectly ;)

But, it’s true, you may need to take extra precautions with structs. Think about the rules of coalescing, etc, and you should be able to figure these out intuitively. (Sometimes you may need to reinterpet the array of structs as an ordinary array, so that threads can cooperatively load it into smem or whatnot). But in sum, I’d rather choose the potential headache of dealing with coalescing than to throw out structs, classes, member functions, operator overloading, etc.