A few general questions...

mrlinky · October 12, 2009, 12:09am

Hi, I’m kind of new to CUDA and I’ve tried searching through the forums and google for these questions, so forgive me if they’re basic or already answered.

Is there somewhere that shows which graphics cards offer pinned memory/host memory mapping/write-combined memory? If not, does anyone know if the 260m offers these options? I assume that any of the non-mobile 260+ offer these?
From what I’ve gathered, the only way to really use communication between multiple gpus is via pinned memory? And a card like the GTX 295 is just the equivalent of 2 GPUs… i.e. they don’t have a shared global memory, and for any memory transfer, you need to go through the host?
Apparently statements like if(blockIdx …) are evaluated at compile time, but are statements like if(threadIdx.x …) also done then, or at runtime? So if it’s at runtime, what have people found to have more overhead, more blocks with fewer of these if-statements or fewer blocks with more of them? I’ll probably end up testing both, but I’m curious if there is a general rule…
Discounting block loading overhead, is there any benefit of using more than 32 threads at a time in one block? For example, do multiple global memory or texture memory accesses hide the latency of one access?

Thanks

seibert · October 12, 2009, 3:08am

Pinned memory is supported by all CUDA devices, because it is really just a host-side thing. The CUDA library is just telling the virtual memory subsystem of the OS that it can’t physically relocate that chunk of memory while you are using it. This simplifies the method the driver has to use to DMA the memory to the GPU quite a bit.

As for host memory mapping and write-combining, I’m not sure that I’ve seen an official list. So far, all I know is that the GT200 desktop cards and the 9400M, which uses host memory for the graphics memory, support it. There may be other chips. Hopefully someone else can comment.

Correct, the only way to move data between cards (including each half of a GTX 295) is to bounce it to the host first. Whether or not you use pinned memory is a performance question. If you use pinned memory, you want to set the appropriate flags on cudaHostAlloc() so that the pinned memory allocation is marked as pinned for all CUDA contexts. Otherwise, you only get the speed benefit of pinned memory on one half of the transfer.

I’m not sure what you mean about if(blockIdx…) being evaluated at compile time, since that quantity is not known at compile time. It is different for every block in your calculation.

Yes, much of the benefit from CUDA comes from massively oversubscribing the processors so that you can hide global memory access latency. Although there are a few kernels which run best with 32 threads, most perform best with 64 to 256 threads per block. When possible, design your code so you can easily benchmark different block sizes, which is the most straightforward way to figure out the best configuration.

mrlinky · October 12, 2009, 5:04am

Pinned memory is supported by all CUDA devices, because it is really just a host-side thing. The CUDA library is just telling the virtual memory subsystem of the OS that it can’t physically relocate that chunk of memory while you are using it. This simplifies the method the driver has to use to DMA the memory to the GPU quite a bit.

As for host memory mapping and write-combining, I’m not sure that I’ve seen an official list. So far, all I know is that the GT200 desktop cards and the 9400M, which uses host memory for the graphics memory, support it. There may be other chips. Hopefully someone else can comment.

Correct, the only way to move data between cards (including each half of a GTX 295) is to bounce it to the host first. Whether or not you use pinned memory is a performance question. If you use pinned memory, you want to set the appropriate flags on cudaHostAlloc() so that the pinned memory allocation is marked as pinned for all CUDA contexts. Otherwise, you only get the speed benefit of pinned memory on one half of the transfer.

I’m not sure what you mean about if(blockIdx…) being evaluated at compile time, since that quantity is not known at compile time. It is different for every block in your calculation.

Yes, much of the benefit from CUDA comes from massively oversubscribing the processors so that you can hide global memory access latency. Although there are a few kernels which run best with 32 threads, most perform best with 64 to 256 threads per block. When possible, design your code so you can easily benchmark different block sizes, which is the most straightforward way to figure out the best configuration.

Thanks for the answers. As for the if(blockIdx…), I misremembered. Should have written if(blockSize…), as explained in http://oz.nthu.edu.tw/~d947207/NVIDIA/redu…n/reduction.pdf That make more sense. Unfortunately means I’ll need to write more versions to thoroughly test =[

Topic		Replies	Views
Unified Memory vs Pinned Host Memory vs GPU Global Memory CUDA Programming and Performance	9	8408	June 1, 2022
Question about Pinned memory CUDA Programming and Performance	8	1845	June 16, 2016
CUDA 2.2 pinned memory white paper CUDA Programming and Performance	7	6884	July 1, 2010
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1984	April 18, 2017
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Doubts related to CUDA CUDA Programming and Performance	17	11801	November 18, 2010
CUDA Refresher: The CUDA Programming Model Technical Blog	2	648	January 26, 2023
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6033	December 8, 2008
Mapped memory across multiple GPUs CUDA Programming and Performance	3	8737	October 28, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1922	October 8, 2020

A few general questions...

Related topics