How to Directly display on GPU without transfering data back to CPU

Hello Im using 9800GT and 9500GT as my GPUs and using windows as my OS. Starting from cuda 4.0 I read I can transfer data from
gpu to gpu so I was wondering if I could do math on one gpu and display on another. Problem is Im using CPUbitmap function provided by
NVIDIA and it seems that this certain function is the bottle neck of my program. Is there a way to directly display without transferring data back to cpu? Or is there better (faster) way of displaying on monitor than CPUbitmap? Thank you

It should be possible to:
a.) Direct-copy data from the compute GPU to the display GPU (as cuda linear memory)
b.) Copy linear data into a OpenGL texture mapped as cuda array
c.) render the OpenGL texture
(Never tried it myself)

Or, easier to implement:
a.) Write data from compute GPU to host memory (mapped host memory or memcpy)
b.) upload from host memory to display GPU as texture
c.) render texture

See the simpleGL SDK example for the OpenGL+Cuda part.

There was an Nvidia presentation about CUDA <–> OpenGL interoperability a while ago; you can see the slides here. Not sure if this is what you’re looking for, but it might be of interest

My program is about 2D image. Im not sure if using OpenGL is good for it since it is optimized for 3D users right?? Does OpenGL work well with
2D images as well? Or is there better options?

OpenGL works fine with 2D images, although it can be a bit unwieldy. In all, when coupled with CUDA, either OpenGL or DirectX (if you’re only going to be on windows) are the way to go. Taking advantage of the cuda ↔ OpenGL/DirectX interoperability allows you to display things from your device memory w/o copying back to the host memory.

You can’t copy data directly from one GPU to another (i.e., without going through the host) unless you’re using two Fermi-based cards.

MasterKitten, have you tried just doing the math + display on a single card, and using OpenGL or DirectX interop to directly display the results of your calculations? If your display card has enough memory for what you need to do, it might be faster than offloading the math to the other card, precisely because of the bottleneck of going through the host (and all the synchronization that needs to take place between the two cards and the host).