I have recently been reporting having problems with codes considering larger numbers of particles. All the while I have been allocating memory on one GPU by declaring the variables as device .
But now that I have changed to allocating all memory on the GPU by cudaMalloc all my problems seem to have disappeared! (except one : execution time has increased). I was having trouble running one medium sized problem, but now I can run one with 4 times as many particles.
Why would this be so?