Optimization Advice

TLDR version; I have an idea to speedup my program, I am not sure if it is possible, and if it is, I don’t know how best to implement it. Advice would be greatly appreciated :)

I have written a program that simulates electron microscope images based on a input file containing the coordinates of some atoms. At present my code is sufficiently fast for samples upto a few thousand atoms but I am wanting to take it much further so I coud potentially simulate millions of atoms, or run the current simulation @ real time speeds. At the moment I havent gone above 4000 atoms because thats the maximum amount I can hold in constant memory at once, I am fairly sure i can get around this issue by modifying my kernel to run multiple times, and changing the atoms within the constant memory array between each run and putting together the results additively. I havent bothered yet because the program is taking a reasonable length of time already at 4000 atoms.

At the moment I am fairly certain i know what is limiting the speed of the program though. At present each thread calculates a value for each pixel in the image by summing contributions from nearby atoms only. But to test this nearby condition it actually loads the coordinates for every atom from constant memory. Essentially threads are wasting time checking atoms that are nowhere near there vicinity.

My first plan was to essentially create several bins, to represent areas within the image, and when importing the atom coordinates, selectively add atoms into each bin. Then each block would only check the atoms within the designated bins and reduce the number of atoms it checks over by a factor of N, where N is the number of bins. At least thats what i think should happen? I think there are many problems with this though, and I’m hoping people better at this than me can give me a few tips or ideas…

In an ideal scenario I would essentially have one bin for every block in the grid of threads. Then each block would just load the atom coordinates in the bin marked by its blockID, then it will only be checking over nearby atoms anyway. However I have maybe 1000 blocks and I think this would require me creating 1000 arrays on the host and 1000 arrays in the constant memory. I am not even sure if this is possible, and if it is would it require me essentially defining all my device constants by hand at the beginning of my cu file, and creating all 1000 arrays myself for each bin.

Some other issues I have thought of are, the number of blocks changes depending on the resolution, and shape or input structure, therefore the number of constant arrays I would want would also be changing based on my input, I don’t know how/if I can handle this?

Otherwise I could just manually set up some much smaller number of bins, like split into a 5x5 grid, this isn’t too much effort to do but i’m limiting my speedup to factor of 25 i think.

At this point any ideas would be greatly appreciated…

Also sorry for the wall of text, and I apologise if it doesn’t make any sense.

I thought about it a little more and came up with an alternative plan.

  1. I declare my constant array at the beginning on maximum size I can, can roughly hold 4600 atoms.

  2. I find out how many blocks I am going to need in grid. (Easy as I know total dimensions of input beforehand and I specify the resolution of image)

  3. Divide 4600 by number of blocks to find out how many atoms i can hold for each block for each iteration of the kernel. Call this NumPerBlock.

  4. Then as I read though my input file, I test each atom to see which block it belongs to, and add it to the an array with its index based of which block it belongs to. Once block is full I don’t add anymore. (I can keep track on space left in blocks by having another array that just holds the count of atoms in each block.

  5. When I reach end of input, or every block is full I launch kernel.

  6. Then I go back through input file again, this time I skip the first NumPerBlock atoms that should go in each block so that i am only adding atoms i have never added before…

The only downsides I see to this method is if this selection process takes a long time also…
Ideally I would only run this part once and put the first NumPerBlock in one array, and the second in another array etc. But I don’t know how many arrays I will need before hand as it is based on the maximum number of atoms in any one block unknown before hand.

Instead of constant memory you can use a texture, which would get rid of the size restriction.
Given your memory access pattern (each thread block reads one contiguous block in memory) I’m not sure there actually is any benefit from caching. So you could just as well use ordinary global memory, I think.

I don’t know your exact algorithm for combining the effects of the nearby atoms, but you may want to look at the Optix API which can give you ray/sphere intersections and can handle millions of objects and arbitrary queries. In this case you may not be using the ray shoots for direct visualization, but Optix is versatile enough to let you attach your own accumulators and such onto the sample rays and intersect them with all atoms within a fixed distance of the ray.

Is this the most efficient method for computing your exact computations? Probably not, but it’s an established toolkit that’s ready to use immediately, so you should think about if it can save you development time at least.

Instead of performing a binning in a screen space you could also use some kind of bounding volume hierarchies. But the required data structures might eat up more of your precious constant memory. Essentially you would want to read up on ray tracing acceleration structures, I guess. Find a sample implementation of BVH acceleration structures in CUDA here: http://users.softlab.ece.ntua.gr/~ttsiod/cudarenderer-BVH.html

Could you post some screen shots of the resulting images? I am curious how this looks like.

The BVH stuff certainly seems interesting, I will have to do a little bit of reading up about this, essentially anything that can enable me to only present a relevant subset of atoms to each thread or block should get me a massive speedup. I still need to experiment with a few methods to find the fastest way for me to do this. If I use constant memory, it is nice and fast but I will suffer from having to launch the kernel multiple times until I have accounted for all the atoms. Global memory is slower but I could do it all in one go, or certainly in a lot less goes.

Also I don’t recall where I read it now, but I believe I saw something about using global memory in a similar fashion to constant memory on Fermi devices which may prove useful to me. Essentially I would have every thread within a block reading the same values which is why I went with constant memory to start with.

The second part of this articlehttp://www-hrem.msm.cam.ac.uk/hrem/local/ems/ImageCalculationTechniques.pdf has a decent explanation of what I am implementing, (The MultiSlice Method), but most of the implementations of it that are around are quite old and have limits on the size of image you can produce or atoms you can simulate. In the fullness of time I am also hoping to be to simulate resonable sized models in realtime, which is why I want to speed this up a bit.

Sph algorithm may help you.
Sph is short for Smoothed Particle Hydrodynamics. When calculate new attribute of a particle,this algorithm will search the neighboor region of the particle and summing contributions from nearby particles only.
At this time,my algorith can deal with 600,000 particles in “real” time. In addition,my device is GForce GTX 280.
You can search for an article which introduced how to “sort” particle into many bins and reduce nearby particles.
This article’s author is an engineer in Nvidia named Simon Green
I wish this will help you.

Thanks for the advice, going to look at this now aswell. My current implementation has got to about 1,000,000 atoms in approximately 2 seconds which is probably good enough for my purposes, but I am always looking for ways to improve it even if I don’t really need to. At the moment I am importing the entire list of atoms, then splitting them into around 2500 different bins based on there x,y position, and also seperating them into about 10 bins in the z direction. Then I am uploading 2500 at a time into constant memory (the most it can hold), and repeatedly invoking the kernel which just combines the results additively. It would probably be better if i didnt even use the constant memory and did it all in one go from global but I am not sure how to get my data from host as vector<vector<vector<>>> into device memory and still make sense of how to access the individual elements.