I have now been able to reproduce results for any number of particles (before I was limited to 192) with a bit of rewriting of the structure of the code, which required 5 more kernels and the controlling loop transferred from device to host. However this has really slowed the execution time. At the moment absolutely all data is in global memory, so I am now looking to make use of the memory structures of the GPU.
variable var1 is constant throughout the calculation. How should this be declared to reside in constant memory?
at each iteration variable v2 is recalculated for each particle and used heavily in two subsequent functions in that iteration. However, the whole of the array holding v2 needs to be used, not just the indices of the thread ids in the block being executed. Is it possible to copy v2 from global to shared or to some other memory only once once it has been calculated for each particle so that v2 can be accessed much faster by all threads during the rest of the iteration?