CUDA 6.5 Unified Memory (cudamallocmanaged)

my question is related to CUDA programming and the implementation of Unified Memory.
I´m using a Single Nvidia Quadro K4000 GPU on a 64-Bit Linux System (OpenSuse13.1). I installed the required graphics driver and the CUDA Computing 6.5 Toolkit. Everything works well !
In a project I am working with a particle simulation in DualSPHysics (Opensource). The GPU code I am trying to modify is written in C++ and Cuda 4.0 (there is “cudamalloc” instead of “cudamallocmanaged”). Currently the simulations are on the limit of our 3GB GPU RAM. Now we want to use the full power of our GPU and so we tried to include the Unified Memory capability in the simulation tool to access the extra 64GB system RAM on our workstation.
I have already compiled and run the “Unified Memory Streams” sample successful. There is no error or complication.
To modify the code of the simulation I replaced every single “cudamalloc” with a “cudamallocmanaged”. I know that there are some more things to consider (e.g. remove memcpy) but as far as I can see that´s not the main problem.
The code compiles successful. There is no complication. When I simulate a case which requires less than the maximum accessible GPU RAM (3GB) everything works well. When I simulate a case which requires more than the maximum accessible GPU RAM (i.e. the simulation needs to access the Unified Memory) there is no success and the required RAM can not be allocated. I assume that there is a problem by accessing the Unified Memory via cudamallocmanaged…!?

Do you have an idea how I can solve this problem ?
Is there a tool or something similar on how I can test my Unified Memory ?
Could this be a hardware restricted problem ?
How much effort would it take to modify the tool successful ?

Please excuse my bad description - programming is is not my business.

Regards from Germany

Unified memory does not allow you to exceed the device memory (RAM) that is physically present on your GPU. The UM documentation states that the primary purpose of UM is to eliminate the need for explicit cudaMemcpy operations, but that the data must still be migrated to the processor (host or device) that is using it. This migration means that the transfer must still occur, and therefore the device memory must still be large enough to hold the data.

Based on your description, there is nothing wrong with your system or Unified Memory setup. The behavior you are experiencing is expected.

In general, GPU programs that require access to data that exceeds their physical RAM size may use some other techniques:

  1. Pipelined access. Break the data into pieces and move a piece to the GPU when the GPU needs to operate on that piece. This is typically used when the GPU needs to make high-volume access to the data.

  2. Zero Copy. Place the data in a zero-copy region (ie. host memory allocated with cudaHostAlloc, or similar) to give the GPU direct access to it. This has significant bandwidth restrictions, and so is only recommended when the GPU needs occasional or limited access to the data, not for high-volume access.