paging stratigies for global memory any paging strategy on the way for CUDA

martinsk · November 25, 2008, 9:24am

I have been doing some graphics programming in CUDA, that has to run on a variaty of 8800 cards with either 512 or 768 megs of mem. The images, or 3D volumes, are rather big arround 512M a pease. this means that they wan be run on the 768meg cards but not on the 512 cards.

is there a reason why Cuda does not implement somesort of paging strategy for memory mapping from host mem to device mem? I know it can be expencive in terms of taking care of pagefaults. but this other solution means that there needs to be a program for each pease of hardware.

Any ideas to making a paging strategy or perhaps other solutions to the problem?

/Martin

Simon_Green · November 25, 2008, 5:56pm

Current GPU hardware doesn’t support any kind of automatic paging from CPU to GPU memory, so CUDA can’t support this.

Couldn’t you subdivide your data set into sub-volumes and process them one at a time? You can even overlap transfers and computation using streams.

One partial solution would be to get a card with more memory (for example Tesla C1060 has 4GB).

martinsk · November 25, 2008, 7:23pm

I was thinking just thinking out loud. I have been wondering about this for some time, and I cannot help thinking “why” is the memory hierarchy from the constant memory on the device down to the host memory is not in some way automated. I’m not sure but isnt there some theorem stating the LRU is at most twice as bad as the optimal caching strategy? It would make the programming interface way easier to cope with for new programmers, and make applications allot simpler to make.

Just an idea. It just seems like the CPU people figured out how to cope with caches - why not use that with GPU’s

/Martin

Anjul_Patney · November 26, 2008, 3:27am

In my opinion, there are several reasons why such a cache hierarchy doesn’t exist for global memory access on current GPUs. The foremost of these is the fact that CUDA doesn’t sell a GPU as much as DirectX does, and hence any non-trivial addition to the chip must be justified in the context of NVIDIA’s core business (games). Each of global, constant and shared memory have special meaning in a graphics environment, which is why they are there. To link them up would take an overhead that might not make as much sense for (today’s) game programmers.

Secondly, I think that because these levels of memory are independently accessible and are fast (smem takes 2 cycles if I’m not wrong) makes a good case for high-performance computation, where it is often desirable to have explicit access to computational and memory resources for best optimization of program execution. That said, there will often be situations where a program is strongly data-dependent, and a cache is definitely the best answer there. Fortunately, it is possible in CUDA to write a software cache that does the job without compromising performance. http://portal.acm.org/citation.cfm?id=1375572

Caches in CPUs are huge and usually occupy a large proportion of the chip area. One of the reasons they can afford to make this choice is that they need to support MS Word and the likes. If NVIDIA (on top its already huge dies) adds any cache with non-trivial impact, it will be detrimental to the chip’s performance because it will probably not be able to include as many ALUs.

Furthermore, caching in parallel systems usually comes with the additional headache of a coherence protocol. That is, if warp X writes to address A in his local cache and subsequently warp Y reads the same address, the value provided must be an updated version. CPUs use complex mechanisms to ensure cache hierarchies remain coherent. This adds to both area and performance.

However, whether caches make any sense for massively parallel computers remains to be seen (but we don’t have to wait much External Media ).

Topic		Replies	Views
Dazed and Confused.. CUDA Programming and Performance	6	1412	April 8, 2013
Cuda Memory Bank layout Interleaving, Addressing, Conflicts CUDA Programming and Performance	25	61326	September 4, 2008
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9590	January 1, 2009
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	510	July 11, 2019
Odd performance problem/question CUDA Programming and Performance	3	832	June 3, 2009
Help with some CUDA concepts CUDA Programming and Performance	7	1448	August 16, 2009
A (not so) hypothetical question CUDA Programming and Performance	6	1639	March 24, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4490	October 24, 2008
CUDA basic discussions(Looping, branching) Why GPU faster than CPU when both use C? CUDA Programming and Performance	6	18674	August 22, 2007
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4697	June 22, 2011

paging stratigies for global memory any paging strategy on the way for CUDA

Related topics