What info can I extract about use of device memory?

I’ve been doing some simple experiments on the GPUs this weekend and found that

  1. when I try to allocate memory for 5 million particles per GPU the compiler tells me there is insufficient memory.
  2. when I try to allocate memory for 1 million particles per GPU the compiler does not complain, but when I try to execute a very simple kernel I get an unspecified kernel launch failure which I suspect is due to not enough resources/memory on the GPU
  3. when I try to do the same for 750000 particles as for 1 million the same occurs as.
  4. when I try to do the same for 700000 particles as for 1 million the compiler does not complain and the very simple kernel will execute.

It appears that memory requirement is not linear on the number of particles. Why is that? When using cuMemGetInfo for smaller numbers of particles it suggested that 5 million particles could be placed on each GPU, ignoring kernels.

And how much memory besides the data is required to execute a kernel?

And how much memory does a kernel require and cannot this be calculated at compile time?

How can I get hold of this info or calculate it myself?

The GPU (or CUDA) has very strict memory requirements. This means that for example allocating 16 times 1 char (16 bytes) will not be the same as allocating 1 times 16 chars (16 bytes). I found out this the hard way. I tried to allocate some small amount of memory about 100.000 times not even using 50MB and received out-of-memory errors with a card that thas 512MB. So try to be smart, allocate in large chunks :)