Hi,
Just wondering if there is a maximum size for page locked memory under Linux? The machine I had been assigning it on had only 1GB of memory, so assumed it was running out of available space to allocate when the code bailed. I have since added another 2GB, but with no apparent effect. Is there some other software limit?
I’ve never seen a machine that couldn’t allocate 80-90% of its physical memory as page-locked memory for CUDA (up to 4GB in pre-2.2), so I really don’t know what you’re hitting. However, I haven’t tried this on 32-bit Linux ever either, so that might be something to consider.
You will be able to copy a sub-matrix to/from GPU memory from/to a huge matrix in CPU memory.
The sub-matrix is going to be limited by the memory on the card.
Ok, I would be surprised to see bigger than 4GB, as I believe it is a 32bit processor, though larger amounts of page locked memory could be very useful on the host. As for larger commodity cards, I think that will depend on when games etc have a use for the extra space :-)
On 64-bit hosts, device pointers are already 64-bit. There’s no reason to suspect that CUDA devices can’t address more than 4 GB of memory. I imagine the limitation is more one of market/price and not technology. :)
Wait, what? Then why all the argument about 64 bit pointers slowing down CUDA kernels on 64 bit hosts? [Damn search engine won’t show me where that thread is.]
Yes, it is a 32-bit processor. That’s why supporting anything over 4GB will require a major rework of the architecture. I’m actually impressed of how nVidia managed to put the full 4GB on a 32-bit device.
I agree that more that 4GB of page-locked memory could be helpful… but only on systems with 2 or more S1060s and powered by a Nehalem with triple-channel DDR3 or a dual+ processor system… Anything else simply wouldn’t have enough memory bandwidth to satiate two transfers.
You’re probably refering to a thread I started a while back.
Pointers are treated by the 64-bit nvcc compiler as being 8-byte wide to be consistent with the size of pointers on the host. In very specific cases, this increases the register usage of a kernel vs its 32-bit counterpart to the point that it lowers its occupancy. Thus the kernel becomes slower. I have one such headache-producing kernel.
I’m not sure how CUdeviceptr works, but it’s most likely declared as an unsigned int, not a true pointer like void*. But if you have a float*, for example, that will be treated as 8-bytes wide even within the device on 64-bit compilation.
Check the value of ‘max locked memory’ in the output of ‘ulimit -l’. It can be overriden with ulimit or by changing ‘memlock’ in /etc/security/limits.conf .