What else is the device memory being used for?

By my calculations, due to the data structures I am using (arrays of floats and ints) I should be able to use approx 6 million particles on my device, but can’t use more than 800K.

What else is the device memory being used for? I thought I could allocate all the available 1.6Gb of device memory entirely for particle data, but that is obviously not the case.

What card are you using and are you running a display on it?

one Tesla C870 so no display.

Where is the GPU code stored during execution?

Where is one block’s data stored while another block’s data is being loaded into shared?

The data of all thread blocks stay in the registers/shared memory. Therefore the number of active thread blocks depends on how much shared memory and registers your kernel needs.

Do you use pinned, mapped memory?

  1. Is 800K - number of particles OR number of data in bytes?

  2. Find the size in bytes at which cudaMalloc() is breaking down.

  3. See if zero-copy feature can help

  4. See if you can re-use portions of your memory

Block data is kept in shared memory (16kb per multiprocessor), and (if I understand correctly) code is buffered from constant memory (64kb centrally and 8kb per multiprocessor). There is an upper limit of 2 million ptx instructions per kernel, so I am guessing that there is some global memory reserved for that, but it cannot be all that large. Certainly not enough to reserve close to half of the total device memory…