Hi there. I’m currently working to scale a singleGPU application onto a multiGPU application. I’m trying to scale to the 4 devices available, 3 Tesla C1060 and a Quadro Fx 1800.
I’ve studied the example provided by SDK simpleMultiGPU and it has some limitations on its approach, limitations that become obvious if the same approach is used by the application I’m developing.
While simpleMultiGPU works on 8192 floats divided by all devices available, the task is divided and a quantum is assigned to every device, the task my application performs is not to be divided amongst devices. Using 4 devices is supposed to increase the depth of data to be processed, creating the kind of effect a stream approach would result in. On the singleGPU application it processes 2x1.4515136e7 32bit integers and needs 2x1.4515136e7+2x2.26799e5+1.0368e6 32 bit integers to be copied to the device. The reason all this amount of data can’t be divided in quanta and assigned to each device for processing is because the amount 2x2.26799e5+1.0368e6 integers is needed for processing but is not processed, ie it stores information but it will not be changed by the kernel, and because the kernel is iterative, the division in quanta would create the need to synchronize data after each iteration at the cost of a massive overhead.
So basically no task dividing is possible, the multiGPU approach is aiming to increase 4 times the amount of data processed (2x1.4515136e7 → 8x1.4515136e7) while being able to parallelize by all four GPUs the execution model.
This is where problems arrise. If the SDK simpleMultiGPU approach is taken, one thread per device and a main thread that creates the others, there will be inevitably a racing condition on the PCIe bus, as each thread must perform a cudaMemcpyHost2Device, if the OS scheduler was good enough to surpass this racing for the bus, after the kernel execution it would do again on another racing for the bus on account of a cudaMemcpyDevice2Host. Thing is, OS scheduler won’t help me, so I devised a simple synchro mechanism, that enables that some concurrent kernel execution on the devices available. I suspect that simpleMultiGPU provided by SDK while handling a pitiful amount of data is able to complete each cudaMemcpyHost2Device even before the next host thread is created, thus eliminating the needs for scheduling.
Mind that the situation described by the picture provided is likely to happen for a small number of kernel iterations, but more likely is to have the kernel execution of device 0 to overlap the execution of device 3. It’s not a perfect routine but it allows some parallelism and it greatly contributes to increasing performance as the penalty taken in throughput for an increasing number of iterations, hence kernel execution time, for a singleGPU is much greater than the penalty taken by a multiGPU approach like this.
As the simpleMultiGPU provided by SDK uses the pthread.h by calling other routines, I devised an omp implementation that works regardless of OS state as opposed to the pthread.h implementation (my fault here I assume, for not being able to develop a stable multithreaded application using pthread.h). The synchro mechanism is not far from being implemented but it’s still very faulty.
Here’s a snippet
[codebox]#define GPUsAVAILABLE 4
/defined as global variable/
int count=-1;
#pragma omp shared(count)
/on the main function/
count=0;
omp_set_num_threads(GPUsAVAILABLE);
#pragma omp parallel
{
/*host ptrs declarations*/
thread_id=omp_get_thread_num();
//select GPU
cudaSetDevice(thread_id%GPUsAVAILABLE);
cudaThreadExit();
/*allocates resources on host side and initializes data*/
/*device ptrs declarations*/
/*setup grid*/
/*perform cudaMallocs*/
#pragma omp barrier //all threads synchronize here
while(count!=thread_id)
{
usleep(1500); //dunno why I is a sleep call needed but it was the only way it worked
}
//if count==thread_id continues execution
/*perform cudaMemcpyHost2Device
#pragma omp atomic //unlocks next thread
count++;
kernel<<<grid,threads>>>(/*data ptrs*/);
for(i=0;i<2*num_iterations;i+=1)
kernel<<<grid,threads>>>(/*data ptrs*/);
cudaThreadSynchronize();
/*thread 0 halts execution waiting for last device to perform cudaMemcpyHost2Device*/
/*other threads halt execution waiting for previous to perform cudaMemcpyDevice2Host*/
while(count!=thread_id+GPUsAVAILABLE)
{
usleep(1500);
}
//if count==thread_id+GPUsAVAILABLE continues execution
/*cudaMemcpyDevice2Host*/
#pragma omp atomic //unlocks next thread
count++;
}[/codebox]
There are several things about this that shouldn’t be happening, cudaMemcpyHost2Device/Device2Host is iteration independent, it’s the very same amount of data, but on the multiGPU approach it increases with the number of iterations, as opposed to the singleGPU where it’s constant. Why is a sleeping routine necessary for it to work? A while(/some condition on thread_id and count/){} stalls the application. Also the usage of sleeping routines resets the result of the time stamp given by clock() so I’m not able to confirm that execution follows the schedule I’m trying to impose.
And mostly, is this a wise synchro routine?
It’s not that clever but eliminating the count and thus the scheduling makes the cudaMemcpyHost2Device/Device2Host take as much as 20 times more. Is there another way to achieve optimal synchro between host threads and managing the common resources, mainly the bus, besides a similar one?
Appreciate all the help and comments, and your patience if you reached so far on this post.