small size, fast device to host mem transfer

I am working with a Tesla C2050 on high speed markov process simulations. One potential application of the project involves a voltage output, adjusted in real-time based on the state of the markov process. When I was running some preliminary tests, it seems like the time it takes to make even a small memory transfer precludes this application. At every simulation time-step, I need to transfer about 100 floating point numbers from the GPU to the CPU (Device to Host). Alternatively, I might be able to get away with transferring a single floating point value from the Device to the Host. I am aiming for time-steps of approximately 20 us. My testing shows transferring up to 10,000 floating point numbers (40 kb) takes more than 100 us). This number does not change much when varying the memory transfer size from 1kb - 40kb, suggesting the majority of the time is spent setting up the DMA transfer. I also attempted the transfer with pinned host memory with no observable performance gain. Do these memory transfer times appear typical? If so, is there a faster way to transfer a small amount of data from Device to Host? Finally, is there a separate card or device (I noticed the Quadro SDI Capture Cards) for transferring data directly from GPU memory to another PCI-E device for output?
Thanks for your help
Matt Bakalar