GTX 690 CUDA shared memory space, TCC, UVA, GPUDirect?

I am running Windows 7 x64 using the GeForce GTX 690 card for a CUDA project. As a dual card, deviceQuery recognizes the card as two separate GPU’s, each with 2Gb globabl memory. I would like to be able to share the memory between the two GPU’s, to have access to a larger memory space, as I am currently limited to the size of my cudaMalloc calls.

I have come across several topics regarding UVA (Unified Virtual Adressing), GPUDirect, and Peer to Peer memcpy. From what I have gathered, I must be using Windows 7, x64, and TCC mode for this. I have been unsuccessful in my attempts to change to TCC (using nvidia-smi as described by: http://blogs.fau.de/johanneshabich/2010/12/10/windows-and-cuda-enabling-tcc-with-nvidia-smi/).

My question is: Has anyone using the GTX 690 been able to unify the memory space (using TCC or otherwise)? Or will cudaMalloc be limited to less than 2Gb, and the use of cudaSetDevice calls will be necessary when I need to access memory between the GPU’s?

Thank you.

Peer-to-peer memory copy will work fine between the GTX690’s dual GPUs without TCC mode. Use the cudaMemcpyPeer() command. There’s an async version too. See 3.2.6.5 in the CUDA C Programming Guide.
You don’t need cudaSetDevice() calls to copy memory.

Unified virtual addressing will not work with your GTX 690 in Windows 7. It will in Linux and in Windows XP64. Windows 7 won’t work since UVA requires TCC drivers, which are Tesla only.
See 3.2.7 in the Guide.

Finally, even with UVA, you won’t get a “merged memory” and be able allocate all 4GB in a single malloc. It means the memories share the same address space, not that they become contiguous in the heap.

The very useful workaround to many of these “I need huge mallocs, or more memory than my GPU holds” problems is to use Linux, then allocate large chunks of host memory. You can read/write access it from the devices transparently, though at low bandwidth (since it has to come over the PCIe bus). This used to be called “zero copy” memory, but the UVA concept has generalized the idea.
I’ve used this idea very effectively to give GPUs easy runtime access to a database over 14GB in size.
(Windows XP64 would work too.)