you can possibly use streams to speed up your transfer. This entails pinning host memory and doing asynchronous memory transfers. Look in the programming guide, it has a really crappy example there (so maybe SDK examples are better, but they also suck a bit).
OR maybe you can use some of that good ol OpenGL interoperability to directly display your numbers directly onto the screen :)
Do you know the performance differnce between using regular page locked memory ( cudaMallocHost ) VS using the “pinned” pageable memory (cudaHostAlloc) ?
As long as your memory is pinned, “cudaMemcpy” will make use of the CARD DMA function for transfer (no matter whether you allocate using cudaMallocHost and cudaHostAlloc). The latter API , I think, was introduced for the zero-copy ==> kernels accessing host memory directly. But I dont think we are talking about it here…
Yes, cudaHostAlloc() is for zero copy memory, which is something different again - the GPU can directly read and write from PC memory over the PCI-e bus. It is handy in some situations (like pushing small amounts of data back from a reduction kernel, for example), but not really in the same vein as pinned versus pageable memory.