Thanks for the code example. Things are becoming clearer.
In regard to the code, obviously the position and velocity arrays would be of the same size. What is unclear to me is the size restrictions (if any) that are placed on certain types of memory. I am assuming that the memory storage for the two arrays would be global memory. Is this memory allocated on the GPU or is it host memory? If it is host memory, is there a large performance hit when passing it to the GPU?
Also, I am aware that when calling this function, the following format is used:
propagate <<< Dg, Db, Ns >>> (time, pPosition, pVelocity);
where Dg is the number of blocks, Db is the number of threads per block, and Ns is the number of bytes in shared memory that is dynamically allocated (typically 0). What isn’t clear is how these numbers are chosen.
If I had 5000 positions and velocities to be calculated, would I be correct in assuming that I would need
5000 / 128 threads per blocks
which equals 40 blocks after rounding up? Thus, the function would be written
propagate <<< 40, 128, 0 >>> (time, pPosition, pVelocity);
I’ve seen a lot mentioned in the documentation about bank conflicts. In the above scenario, would this cause many bank conflicts?