paging stratigies for global memory any paging strategy on the way for CUDA

I have been doing some graphics programming in CUDA, that has to run on a variaty of 8800 cards with either 512 or 768 megs of mem. The images, or 3D volumes, are rather big arround 512M a pease. this means that they wan be run on the 768meg cards but not on the 512 cards.

is there a reason why Cuda does not implement somesort of paging strategy for memory mapping from host mem to device mem? I know it can be expencive in terms of taking care of pagefaults. but this other solution means that there needs to be a program for each pease of hardware.

Any ideas to making a paging strategy or perhaps other solutions to the problem?

/Martin

Current GPU hardware doesn’t support any kind of automatic paging from CPU to GPU memory, so CUDA can’t support this.

Couldn’t you subdivide your data set into sub-volumes and process them one at a time? You can even overlap transfers and computation using streams.

One partial solution would be to get a card with more memory (for example Tesla C1060 has 4GB).

I was thinking just thinking out loud. I have been wondering about this for some time, and I cannot help thinking “why” is the memory hierarchy from the constant memory on the device down to the host memory is not in some way automated. I’m not sure but isnt there some theorem stating the LRU is at most twice as bad as the optimal caching strategy? It would make the programming interface way easier to cope with for new programmers, and make applications allot simpler to make.

Just an idea. It just seems like the CPU people figured out how to cope with caches - why not use that with GPU’s

/Martin

In my opinion, there are several reasons why such a cache hierarchy doesn’t exist for global memory access on current GPUs. The foremost of these is the fact that CUDA doesn’t sell a GPU as much as DirectX does, and hence any non-trivial addition to the chip must be justified in the context of NVIDIA’s core business (games). Each of global, constant and shared memory have special meaning in a graphics environment, which is why they are there. To link them up would take an overhead that might not make as much sense for (today’s) game programmers.

Secondly, I think that because these levels of memory are independently accessible and are fast (smem takes 2 cycles if I’m not wrong) makes a good case for high-performance computation, where it is often desirable to have explicit access to computational and memory resources for best optimization of program execution. That said, there will often be situations where a program is strongly data-dependent, and a cache is definitely the best answer there. Fortunately, it is possible in CUDA to write a software cache that does the job without compromising performance. http://portal.acm.org/citation.cfm?id=1375572

Caches in CPUs are huge and usually occupy a large proportion of the chip area. One of the reasons they can afford to make this choice is that they need to support MS Word and the likes. If NVIDIA (on top its already huge dies) adds any cache with non-trivial impact, it will be detrimental to the chip’s performance because it will probably not be able to include as many ALUs.

Furthermore, caching in parallel systems usually comes with the additional headache of a coherence protocol. That is, if warp X writes to address A in his local cache and subsequently warp Y reads the same address, the value provided must be an updated version. CPUs use complex mechanisms to ensure cache hierarchies remain coherent. This adds to both area and performance.

However, whether caches make any sense for massively parallel computers remains to be seen (but we don’t have to wait much External Media ).