CUDA multicore/mpi

CUDA 2.1 will feature multicore as compiling target, but what about implementing some MPI generation from CUDA source?
That is, having CUDA target MPI clusters would be a very powerful tool (the same code could be compiled on MPI/GPU/multicore CPU).
The implementation shouldn’t be too much difficult, too; I think the only problem would be to manage device memory, with the correct MPI commands to broadcast the necessary data.

This message is a feature request for next CUDAs, but if someone eventually feels like he would like to make a small cuda2mpi converter with me, we could have a look at it.

Sounds like an excellent paper opportunity. (Not for me, but a CS person) I’d suggest reading through the MCUDA paper and looking at their approach to transforming the source code. I think it would be very similar for MPI.

Hmm… they didn’t write how they did.
However it’s quite easy to parse text files, and there are flex/bison, eventually (though I haven’t ever used them).

[url=“http://forums.nvidia.com/index.php?showtopic=62999&hl=”]http://forums.nvidia.com/index.php?showtopic=62999&hl=[/url]

This was discussed in the forum before. The link above might be interesting to you…

In particular, Emerth’s reply on MPI and cluster programming was very very good!

It would be nice to have something developed on distributed computing!

I would like to think of a CUDA_Abstraction layer that abstracts the physical location of the GPU in the cluster network!!!

nVidia has a cluster, made up of 16 GPUs divided into 8 nodes (dual core cpus + 2xTeslaC870 or quadro fx 5600).

I’m working on it for some scientific simulations.

Actually my MPI/CUDA idea could be extended in this direction, too. There are many paradigms which could be implemented:

-a cu2mpi code conversion, to generate plain mpi code (w/o gpus), useful to stardardize some parallel computing (SIMD) letting your CUDA programs to run on a cluster (if you have access to a powerful one, you could take advantage of more power than a GPU can give you, at the moment).

-an abstraction layer

My main idea was the first point, but I have to say the second one would be great, too.

Do you mean some kind of (peraphs driver-level?) interface to allow viewing all the GPUs in a cluster as a single one, or just to simplify handling of them (eg: seeing them as many GPUs on the same node)?

The first solution would be harder, because of some difficult distributed memory management, but it would be nice (if communication latency doesn’t affect performance too much); the second one could be easier, and used with the GPU Worker class could lead to easy management of the cluster resources.

However this approach would need accessing CUDA’s source code (or by parsing cu files, but it could be harder).

Not really…Only the physical positions of a GPU need to be abstracted!

Imagine a CUDA agent (service or a daemon) in every node of the cluster!

Think of the CUDA agent as a GPU service provider! Before programming any GPU, you need to request the CUDA agent to allocate a GPU.

The CUDA agents of all the nodes will be in touch with each other and will have up-to-date information on

  1. How many GPUs each agent controls?

  2. What are the hardware capabilities of the GPUs?

  3. GPUs available for kernel launch.

and so on.

Any GPU request from a cluster application that cannot be serviced by local GPUs will be serviced in GPUs on other nodes!

Besides sharing information, CUDA agents can establish connections to GPUs controlled by other CUDA agents and perform CUDA operations (library calls, kernel launches etc.) on them on behalf of the application driving it!

Thus the application will look like this:

Conn = establish_connection(local CUDA agent);

gpu = RequestGPU(Conn, GPU_Capabilities, expected_hogging_time, LOCAL_ONLY);

if (gpu == NONE)

{

    gpu = RequestGPU(Conn, GPU_Capabilities, expected_hogging_time, ANYWHERE);

   if (gpu == NONE) 

   { exit_application(); }

}

gpuMem = RequestGPUOperation(Conn, cudaMallocToken, sizeof(n*sizeof(float)));

RequestGPUOperation(Conn, cudaMemcpyToken, gpuMem, cpuArray, n*sizeof(float), cudaMemcpyHostToDevice);

RequestGPUOperation(Conn, cudaKernelLaunchToken, cudaKernelCUBINPointerORWhatever); /* This step requires some clarity */

RequestGPUOperation(Conn, cudaThreadSynchronize());

and so on..

The best way would be to write an application that would run and synchronize among itself in all nodes of the cluster! Such an application would utilize local GPUs to large extent resulting in efficient resource utilization.

Well, I’ve got the idea, but actually it may be difficult to achieve high performance with kernel spawning on the network (also think that local memory should be broadcasted in an efficient way, eg, having some cache on other machines).

My idea was to allow developers make standard code for multigpu system, letting them to see every gpu in the local cluster as belonging to the same platform.

Also notice that doing the hardware abstraction I were thinking of (the kind of all-gpus-viewed-as-a-single-one) would be quite easy to implement, at least for GPUs physically on the same platform.
Lets imagine to replace in the source file every command starting with cuda with n commands, making use of the GPUWorker class, which would call the appropriate commands to every GPU in the system, performing the same operation. Commands issuing a kernel call, also, would be executed n times but, say, the blocks would be put in different GPUs: you could generate n different kernels, adding some offsets to blockIdx.x / gridDim.x, hence letting the others GPU think they’re processing higher blocks (eg on the x-direction), but indeed executing just the same smaller number of blocks.
I hope you get what I mean.

This is something I had hoped we would see when the 9800 GX2 came out, but there is an additional tricky step. You need to efficiently merge the global memory writes on the separate devices. With the exception of atomics, this is not extremely complicated, as the write order to global memory in CUDA is undefined between blocks. You can combine the writes at the end of each kernel call, and still ensure correctness.

The only way I could come up with to do this was a three-way (or N-way) merge. Before starting the kernel, you need a copy of the original global memory block. Then after the call, you determine which bytes changed on each GPU and merge them back into a single memory block. This would require slow device-to-host-to-device traffic, or an inter-device copy that doesn’t (yet) exist.

This would be much faster if global memory had a dirty bits that you could clear, and would be set on write. That would require new hardware, though, so you can’t really wait for that.

-This is something I had hoped we would see when the 9800 GX2 came out, but there is an additional -tricky step. You need to efficiently merge the global memory writes on the separate devices. With the -exception of atomics, this is not extremely complicated, as the write order to global memory in CUDA is -undefined between blocks. You can combine the writes at the end of each kernel call, and still ensure -correctness.

Yes, I thought about that, too (the fact that you are sure writes are done ONLY after the end of the kernel’s execution).

Actually an ever easier implementation would be just to make all GPUs’ memories be the same (this means some little overhead because of memory copying).

I were also investigating the possibility of doing this from code, instead of having a source code parser (definitely less easy to use).
It still doesn’t look possible, thought.

With parsing, thought, it would be quite easy to implement (IMO), if using the GPU Worker class, but it may not be that easy to use it, because of its linker requirements (eg boost libraries, which, for example, I don’t usually have on my system).

For sure, this model doesn’t work unless you mirror host-to-device copies to all devices.

However, this doesn’t solve the problem of blocks running on different devices writing to different parts of a single (conceptual) array. You will still need a merge step to resync the array changes after the kernel finishes.

But, yeah, I agree, this sort of “CUDA-SLI” model is hard to do efficiently without some extra hardware & driver support.

Yes.