Hi, I’m kind of new to CUDA and I’ve tried searching through the forums and google for these questions, so forgive me if they’re basic or already answered.
Is there somewhere that shows which graphics cards offer pinned memory/host memory mapping/write-combined memory? If not, does anyone know if the 260m offers these options? I assume that any of the non-mobile 260+ offer these?
From what I’ve gathered, the only way to really use communication between multiple gpus is via pinned memory? And a card like the GTX 295 is just the equivalent of 2 GPUs… i.e. they don’t have a shared global memory, and for any memory transfer, you need to go through the host?
Apparently statements like if(blockIdx …) are evaluated at compile time, but are statements like if(threadIdx.x …) also done then, or at runtime? So if it’s at runtime, what have people found to have more overhead, more blocks with fewer of these if-statements or fewer blocks with more of them? I’ll probably end up testing both, but I’m curious if there is a general rule…
Discounting block loading overhead, is there any benefit of using more than 32 threads at a time in one block? For example, do multiple global memory or texture memory accesses hide the latency of one access?
Pinned memory is supported by all CUDA devices, because it is really just a host-side thing. The CUDA library is just telling the virtual memory subsystem of the OS that it can’t physically relocate that chunk of memory while you are using it. This simplifies the method the driver has to use to DMA the memory to the GPU quite a bit.
As for host memory mapping and write-combining, I’m not sure that I’ve seen an official list. So far, all I know is that the GT200 desktop cards and the 9400M, which uses host memory for the graphics memory, support it. There may be other chips. Hopefully someone else can comment.
Correct, the only way to move data between cards (including each half of a GTX 295) is to bounce it to the host first. Whether or not you use pinned memory is a performance question. If you use pinned memory, you want to set the appropriate flags on cudaHostAlloc() so that the pinned memory allocation is marked as pinned for all CUDA contexts. Otherwise, you only get the speed benefit of pinned memory on one half of the transfer.
I’m not sure what you mean about if(blockIdx…) being evaluated at compile time, since that quantity is not known at compile time. It is different for every block in your calculation.
Yes, much of the benefit from CUDA comes from massively oversubscribing the processors so that you can hide global memory access latency. Although there are a few kernels which run best with 32 threads, most perform best with 64 to 256 threads per block. When possible, design your code so you can easily benchmark different block sizes, which is the most straightforward way to figure out the best configuration.
Thanks for the answers. As for the if(blockIdx…), I misremembered. Should have written if(blockSize…), as explained in http://oz.nthu.edu.tw/~d947207/NVIDIA/redu…n/reduction.pdf That make more sense. Unfortunately means I’ll need to write more versions to thoroughly test =[