Using dma memory transfers

I was under the impression that CUDA uses DMA transfers to copy data from host main-memory over de PCI-express bus to the device. If this is the case, is it possible to use the device for computation and transfer data at the same time ?

In particular I want to use two host threads in which one controls the computations and the other thread transfers data to the device. Can this be implemented since page 37. of the programming guide says that only 1 thread can access a particular context.

CUDA does use DMA to transfer data between the host and the GPU, but these transfers are not currently asynchronous (i.e. computation and transfer can not happen at the same time).

This may be supported in a future release.

I was wondering how one could use the Streaming programming model with CUDA. The application requires the streaming of a database through the GPU, with the compute units processing this data and sending outputs at sparse intervals.

Is it possible to execute two threads: One copying memory chunk A from host to device with cudaMemcpy and the other executing the kernel (on memory chunk B) on the device?