I am thinking of changing my GTX-275 card with a GTX-295 card for Scientific purposes with the following thinking and I want someones advice who has coded on double GPU card on a single PCB like GTX-295.
In general memory is a crucial aspect and need to be seriously taken into consideration. In an application I am dealing there can be 300M data entries which need to be processed by a special data structure and this has to be done in stages with so many data entries. So one GPU should do the application and the other dynamically update the data structure fetching data from the CPU memory and giving them to the processor before it is needed by the application. In a single PCB the two GPUs can access memory with the same speed in contrast to the SLI structure which is very slow.

Someone correct me if I am wrong with this philosophy before I start since this need a serious investment in time (redesigning the theory so that it can work with partial data or simply state that this is not possible and choosing another alternative and of course money.

A small edit here I do not know about GPU-devices coordination if they can be coordinated. This is the only thing that needs to be done I believe. There should be two loops one for the first device and one for the second device. The devices should query each other so that a cycle of memory input/output can be initiated. Still though there should be some latencies but this can be solved also with some strategy. What is the basic questing since I have not coded in Multi-GPU if there is a way to coordinate between the two devices. Is CUDA capable for this? With a single GPU things can be done for instance take memory build the data stucture then apply the applicating kernel on the data structure send memory back and then update (or build a new datastucture) construct from partial data send them back etc. etc. and then finally go to main memory of the CPU and reconstruct the model. Do some kind of zippering of the partial reconstructed models or maybe dynamically update a field function to its final form. Anyway I was thinking if you could optimize the process with two GPUs in a single PCB.


A GTX295 is no different to having two discrete GTX275s with an SLI link (although the link is irrelevant for CUDA anyway), other than the fact they are on the same physical PCB and sit behind an Nforce 200 PCI-e switch. They have discrete memory and are programmed and operate completely independently from one another.

Avidday is of course completely correct… the 295 just looks like two independent GPUs. There’s almost no difference at all compared to having two separate cards.

But your real question may not be about hardware, but software. I think you see already that it can get complex, but that may still be OK once you get the model in your mind. In particular, there may be an effective strategy using zero-copy memory which makes your multiple GPUs THINK they have the same copy of memory, and that memory will also be visible to the CPU. So for example one GPU might be using your data structures (pulled from the CPU via zero copy) while the second GPU is prepping new structures for the next iteration (also sent to the CPU via zero-copy). This is pretty clean in that the GPU coding is simple… it’s just reading and writing.
It’s not any more efficient than doing manual memory copies and synchronization (*) but it’s cleaner code wise and can even make scaling to 3+ GPUs easier.

(*) footnote: well, it may be more efficient, or it could be much less efficient, it’s usually problem dependent. In general where you read all the data once, it’s about the same.


Thanks for the quick answer. So not yet a multicore architecture. All right maybe in the future they decide it is useful. I was really hoping for an out of kernel device synchronization and some shared memory. I am a bit disapointed but it certainly saves me some money.


I do not know about the Fermi architecture though, this architecture lets you run different kernels in the GPU as I have read:

Concurrent Kernel Execution

Fermi supports concurrent kernel execution, where different kernels of the same application

context can execute on the GPU at the same time. Concurrent kernel execution allows

programs that execute a number of small kernels to utilize the whole GPU. For example, a

PhysX program may invoke a fluids solver and a rigid body solver which, if executed

sequentially, would use only half of the available thread processors.

I dont know how this concurrency works is there a report that explains how two different kernels can be invoked? Though I do not believe that the CUDA compiler can let you synchroinize the execution of two parallel kernels. Great there is still a lot of potential for CUDA.

The Programming Guide explains this. Launching a kernel is an asynchronous operation: the driver queues up the kernel for execution and returns immediately so the CPU can keep processing. CUDA has long had an abstraction called “streams”. A stream is a sequence of CUDA operations (kernel launches and memory copies) that execute in sequence on a CUDA device. By default, if you do not specify a stream, all CUDA operations go onto stream 0.

The driver is allowed to order operations on different streams arbitrarily, or run them concurrently if the hardware supports it. For example, many CUDA devices can overlap a kernel execution on one stream with a memory copy on another stream. Fermi is the first series of GPUs that permit kernels in different streams to also overlap. The original CUDA 3.0 release would allow up to 4 kernels to overlap on a Fermi GPU, and the 3.1 beta increases this limit to 16.

Thanks for the reply. Much understood in your rich abstract. The stream is a very nice concept and what you say about the overlap answers also the latency question. Great, thanks for the guidance.


Also there is a nice block diagram explaining the significant differences between the old architecture and the new architecture of Fermi which is very promising I do not know (no experience yet but certainly will have) about how you can handle the streams with this new architecture. Certainly though this new architecture gives a significant advantage over othe GPGPU APIs and this certainly answers the question why use CUDA and not another GPGPU API and I have to answer this question in front of a video camera so much to do. Again thanks.


Also it has to be mentioned that with streams what am I telling seems possible considering the following example but needs a liitle attention.

For instance the example in the Programming guide:

for (int i = 0; i < 2; ++i)

 cudaMemcpyAsync(inputDev + i * size, inputHost + i * size, size, cudaMemcpyHostToDevice, stream[i]); 

for (int i = 0; i < 2; ++i)

 MyKernel<<<100, 512, 0, stream[i]>>> (outputDev + i * size, inputDev + i * size, size); 

for (int i = 0; i < 2; ++i)

 cudaMemcpyAsync(outputHost + i * size, outputDev + i * size, size, cudaMemcpyDeviceToHost, stream[i]);

This is actually a nice example and it is exactly what I need from 0-1 since

Stream[i] will be read given to the kernel and a new transfer will be made in stream[i+1] concurrently while the kernel with stream[i] is executed.

Also Stream[i-1] will give its data back to the kernel while kernel with stream[i] is executed

With a query stream you can serialize the kernel sequence of execution. Great I see a great reduction in host-device latency.

Sounds like a nice plan.