By my calculations, due to the data structures I am using (arrays of floats and ints) I should be able to use approx 6 million particles on my device, but can’t use more than 800K.
What else is the device memory being used for? I thought I could allocate all the available 1.6Gb of device memory entirely for particle data, but that is obviously not the case.
The data of all thread blocks stay in the registers/shared memory. Therefore the number of active thread blocks depends on how much shared memory and registers your kernel needs.
Block data is kept in shared memory (16kb per multiprocessor), and (if I understand correctly) code is buffered from constant memory (64kb centrally and 8kb per multiprocessor). There is an upper limit of 2 million ptx instructions per kernel, so I am guessing that there is some global memory reserved for that, but it cannot be all that large. Certainly not enough to reserve close to half of the total device memory…