GTX TITAN X and other >4GB VRAM cards

I have a legacy application created with CUDA 2.3 (Runtime API) and I’m trying to run it on a GTX TITAN X. This is the first card I’ve ever used with >4GB VRAM. For large datasets my application uses as much VRAM as it can allocate mostly using cudaMalloc() in blocks of a few MB at a time. It seems to run fine except that cudaMalloc() fails to allocate any more memory after the first 4GB. Is there something I need to do in order to enable 64-bit device pointers?

You need to recompile your code as a 64-bit application.

It’s already compiled as a 64-bit application. Is there anything else I need to do? Do I have to enable the unified address space or something?

Do you try to alloc single 4GB+ data block ?

No, lots of small blocks. I assume a single >4GB block would also fail. In fact, that was a documented feature at one time wasn’t it?

I suspect its something to do with using an ancient CUDA version but I can’t quite put my finger on the exact problem.

In the early generations of CUDA, there were no devices with > 4 GB of device memory. It is therefore entirely possible that no provision were made in the CUDA run time and driver for more than 4 GB of address space at that time. Alternatively, if such support already existed, it may have suffered from bugs that were undetectable due to lack of large-memory devices. I do not recall exactly how old CUDA 2.3 is, but in my recollection there were no devices with > 4 GB prior to CUDA 3.0.

Since you are running the latest hardware, I would suggest updating to the latest version of CUDA as well.

I guess that is a reasonable response but it does present me with a number of new issues.

I would ideally like to retain compatibility (with no performance regression) with hardware going back to compute capability 1.0 as well as compatibility with 32-bit operating systems. From the release notes this appears to be impossible (on both counts) with CUDA 7.0. Technically the release notes indicate that for GeForce (but not Tesla on Quadro) products running on a 32-bit OS should be possible but then they go on to state that 32-bit cuFFT (which I am using) is not supported.

Is there a version of CUDA that would provide 12GB VRAM and full performance from the GTX TITAN X and still have the backward compatibility I am looking for? I realize that this depends on whether my particular application requires any sm_52 features to achieve full performance but for sake of argument let’s assume it does not.

Has it become common practice for people using CUDA in their applications to maintain several versions built against different CUDA versions?

If I go with CUDA 7.0 alone then I can support hardware going back to sm_20 and running on a 64-bit OS only. Is that correct?

If I accept that limitation then is there anything else I need to worry about? For instance, many GPUs right up to sm_52 have <4GB of device memory. If my kernels are currently using 32-bit pointers then switching to 64-bit pointers would presumably cost extra registers and arithmetic operations. Is there a recommended practice I can follow which will lead to 64-bit pointers being used only where strictly necessary?

You can’t access more than 4GB of GPU memory with a 32-bit system.

64-bit versions of CUDA 6.0 should support cc 1.0 device. With an appropriate (recent) driver, that same system should be able to “support” a cc5.x device, however you will be compiling for cc 3.5 with PTX and depending on the driver forward-JIT-compile mechanism to convert cc3.5 PTX to code suitable for running on these newer devices.

Yes, if you only use CUDA 7, then you can only compile for cc 2.0 or newer targets.

There is no mixing or runtime-decision making about 32-bit vs. 64-bit pointers. An app binary can either be a 32-bit binary or a 64-bit binary. If 64-bit binary, all pointers will be 64-bit. If 32-bit binary, all pointers will be 32-bit.

Sorry, I’ve probably been a bit unclear about the 32-bit / 64-bit situation.

Yes, I always planned to use 64-bit for the GTX TITAN X. However, I would ideally like to be able to make a 32-bit build from the same source code against the same CUDA version that would run as it does now on any GPU from sm_10 to sm_52 (but subject to a 4GB VRAM limit). Whether I would run a 32-bit build or a 64-bit build would depend upon whether the PC was running a 64-bit OS or not.

It does sound like this may be possible using CUDA 6.0 but I probably need to check that the forward-JIT-compiled code runs as well on the TITAN X as code compiled with the CUDA 7.0 toolkit.

It used to be possible to build a mixed application with 64-bit host code and 32-bit device code. I think this was already unsupported by CUDA 2.3 but the same effect can be achieved by explicitly recasting the pointers. I realize this is slightly evil but it has worked perfectly for many years even on systems with several 4GB VRAM cards. It turns out that using CUDA 2.3 it even works on the TITAN X (apart from not being able to access all 12GB VRAM).

Obviously I’m going to have to use proper 64-bit pointers at all times to access all of the memory on the TITAN X but I was hoping to only do so if the GPU in use had >4GB VRAM (or perhaps if the compute capability was above some version where the cost was negligible). I guess that such an approach is unsupported?

Are there actually any drivers that support both sm_10 and sm_5x devices? The most recent drivers support only architectures >= sm_20, since support for devices < sm_20 was removed from the entire shared driver infrastructure, independent of CUDA use.

Best I know there is no support for forcing use of 32-bit pointers in 64-bit binaries. Use of 64-bit pointers will require additional registers and instructions. The effect on performance can range from negligible to significant, depending on the GPU architecture. The impact of switching to 64-bit was most pronounced on register-starved early GPU architectures. The compiler may be able to optimize pointers by using 32 bits when operating in address spaces that are known to be always < 4 GB, such as shared memory. I have not checked into that, though.

From what I have seen of CUDA-enabled applications, they usually drop support for older platforms well ahead of NVIDIA removing driver support for those platforms. This probably has multiple practical reasons, such as the older platforms lacking required features or performance, or the absence of old hardware to maintain test coverage.

That isn’t necessarily an issue. I already use different drivers for different GPUs. However, it does mean that if I only use one version of CUDA then that version must have shipped with a driver that still supported the oldest GPU I want to support. The latest driver does seem to happily forward-JIT-compile sm_10 code to sm_5x if that was your question.

Yes, it’s those register-starved early GPU architectures I’m worrying about.

A sensible approach and ultimately its probably the one I’ll have to take. Some of my other algorithms definitely benefit from some of the newer features and old hardware for testing is also an issue.