malloc can't allocate more than 8Mb from the __device__ function, 6Gb available.

About 7-7.2Mb seems to be OK. 7.5 returns NULL.

I am calling cudaMemGetInfo() right before the kernel launch (global function that makes the call to device function that makes call to malloc)

It gives me green light with 5.7Gb of available memory. (6Gb total)

This malloc is the first memory allocation in the program. (!!!) (However, there is one cudaMalloc on the host, that allocates approx 100Mb).

7Mb is OK, 7.5 is too much???

I am a very beginner. Thanks in advance.

from what i gather, the memory allocation that fails, is a memory allocation done on/ by the device, not the host?

little_jimmy, yes, you can use the malloc to allocate global memory (on the device from the device). By default it gives you 8Mb limit. You have to lift the limit, if you want more.

how many threads call/ execute the malloc - one or many?

post the device code that issues the malloc and fail, if you will

Yes, the 8MB limit and the method to raise the limit is documented:

[url]Programming Guide :: CUDA Toolkit Documentation