Maximum size of page locked memory

Just wondering if there is a maximum size for page locked memory under Linux? The machine I had been assigning it on had only 1GB of memory, so assumed it was running out of available space to allocate when the code bailed. I have since added another 2GB, but with no apparent effect. Is there some other software limit?


There is a limitation of 4GB for page locked memory.
With CUDA 2.2, this limitation will go away.

Thanks for the confirmation, though clearly this is not what I am hitting so I’ll investigate further.


I’ve never seen a machine that couldn’t allocate 80-90% of its physical memory as page-locked memory for CUDA (up to 4GB in pre-2.2), so I really don’t know what you’re hitting. However, I haven’t tried this on 32-bit Linux ever either, so that might be something to consider.

So, when are >4 Gb cards coming? ;)

You will be able to copy a sub-matrix to/from GPU memory from/to a huge matrix in CPU memory.
The sub-matrix is going to be limited by the memory on the card.

Ahh offcourse. Should have thought of that, still in the process of understanding block-wise matrix algo’s…

Does that mean nVidia has >4G devices in the oven?

The Tesla cards are already 4GB


Yeah, I know that. They’re quite expensive. >4GB means greater than 4GB.

Ok, I would be surprised to see bigger than 4GB, as I believe it is a 32bit processor, though larger amounts of page locked memory could be very useful on the host. As for larger commodity cards, I think that will depend on when games etc have a use for the extra space :-)


On 64-bit hosts, device pointers are already 64-bit. There’s no reason to suspect that CUDA devices can’t address more than 4 GB of memory. I imagine the limitation is more one of market/price and not technology. :)

tsk, tsk… sizeof(CUdeviceptr) == 4.

Wait, what? Then why all the argument about 64 bit pointers slowing down CUDA kernels on 64 bit hosts? [Damn search engine won’t show me where that thread is.]

Yes, it is a 32-bit processor. That’s why supporting anything over 4GB will require a major rework of the architecture. I’m actually impressed of how nVidia managed to put the full 4GB on a 32-bit device.

I agree that more that 4GB of page-locked memory could be helpful… but only on systems with 2 or more S1060s and powered by a Nehalem with triple-channel DDR3 or a dual+ processor system… Anything else simply wouldn’t have enough memory bandwidth to satiate two transfers.

You’re probably refering to a thread I started a while back.

Pointers are treated by the 64-bit nvcc compiler as being 8-byte wide to be consistent with the size of pointers on the host. In very specific cases, this increases the register usage of a kernel vs its 32-bit counterpart to the point that it lowers its occupancy. Thus the kernel becomes slower. I have one such headache-producing kernel.

I’m not sure how CUdeviceptr works, but it’s most likely declared as an unsigned int, not a true pointer like void*. But if you have a float*, for example, that will be treated as 8-bytes wide even within the device on 64-bit compilation.

Here’s the thread:…mp;#entry502288

Check the value of ‘max locked memory’ in the output of ‘ulimit -l’. It can be overriden with ulimit or by changing ‘memlock’ in /etc/security/limits.conf .