Not really…Only the physical positions of a GPU need to be abstracted!
Imagine a CUDA agent (service or a daemon) in every node of the cluster!
Think of the CUDA agent as a GPU service provider! Before programming any GPU, you need to request the CUDA agent to allocate a GPU.
The CUDA agents of all the nodes will be in touch with each other and will have up-to-date information on
-
How many GPUs each agent controls?
-
What are the hardware capabilities of the GPUs?
-
GPUs available for kernel launch.
and so on.
Any GPU request from a cluster application that cannot be serviced by local GPUs will be serviced in GPUs on other nodes!
Besides sharing information, CUDA agents can establish connections to GPUs controlled by other CUDA agents and perform CUDA operations (library calls, kernel launches etc.) on them on behalf of the application driving it!
Thus the application will look like this:
Conn = establish_connection(local CUDA agent);
gpu = RequestGPU(Conn, GPU_Capabilities, expected_hogging_time, LOCAL_ONLY);
if (gpu == NONE)
{
gpu = RequestGPU(Conn, GPU_Capabilities, expected_hogging_time, ANYWHERE);
if (gpu == NONE)
{ exit_application(); }
}
gpuMem = RequestGPUOperation(Conn, cudaMallocToken, sizeof(n*sizeof(float)));
RequestGPUOperation(Conn, cudaMemcpyToken, gpuMem, cpuArray, n*sizeof(float), cudaMemcpyHostToDevice);
RequestGPUOperation(Conn, cudaKernelLaunchToken, cudaKernelCUBINPointerORWhatever); /* This step requires some clarity */
RequestGPUOperation(Conn, cudaThreadSynchronize());
and so on..
The best way would be to write an application that would run and synchronize among itself in all nodes of the cluster! Such an application would utilize local GPUs to large extent resulting in efficient resource utilization.