TLDR version; I have an idea to speedup my program, I am not sure if it is possible, and if it is, I don’t know how best to implement it. Advice would be greatly appreciated :)
I have written a program that simulates electron microscope images based on a input file containing the coordinates of some atoms. At present my code is sufficiently fast for samples upto a few thousand atoms but I am wanting to take it much further so I coud potentially simulate millions of atoms, or run the current simulation @ real time speeds. At the moment I havent gone above 4000 atoms because thats the maximum amount I can hold in constant memory at once, I am fairly sure i can get around this issue by modifying my kernel to run multiple times, and changing the atoms within the constant memory array between each run and putting together the results additively. I havent bothered yet because the program is taking a reasonable length of time already at 4000 atoms.
At the moment I am fairly certain i know what is limiting the speed of the program though. At present each thread calculates a value for each pixel in the image by summing contributions from nearby atoms only. But to test this nearby condition it actually loads the coordinates for every atom from constant memory. Essentially threads are wasting time checking atoms that are nowhere near there vicinity.
My first plan was to essentially create several bins, to represent areas within the image, and when importing the atom coordinates, selectively add atoms into each bin. Then each block would only check the atoms within the designated bins and reduce the number of atoms it checks over by a factor of N, where N is the number of bins. At least thats what i think should happen? I think there are many problems with this though, and I’m hoping people better at this than me can give me a few tips or ideas…
In an ideal scenario I would essentially have one bin for every block in the grid of threads. Then each block would just load the atom coordinates in the bin marked by its blockID, then it will only be checking over nearby atoms anyway. However I have maybe 1000 blocks and I think this would require me creating 1000 arrays on the host and 1000 arrays in the constant memory. I am not even sure if this is possible, and if it is would it require me essentially defining all my device constants by hand at the beginning of my cu file, and creating all 1000 arrays myself for each bin.
Some other issues I have thought of are, the number of blocks changes depending on the resolution, and shape or input structure, therefore the number of constant arrays I would want would also be changing based on my input, I don’t know how/if I can handle this?
Otherwise I could just manually set up some much smaller number of bins, like split into a 5x5 grid, this isn’t too much effort to do but i’m limiting my speedup to factor of 25 i think.
At this point any ideas would be greatly appreciated…
Also sorry for the wall of text, and I apologise if it doesn’t make any sense.