Multiple GPUs Devise a synchro mechanism for host threads

Hi there. I’m currently working to scale a singleGPU application onto a multiGPU application. I’m trying to scale to the 4 devices available, 3 Tesla C1060 and a Quadro Fx 1800.

I’ve studied the example provided by SDK simpleMultiGPU and it has some limitations on its approach, limitations that become obvious if the same approach is used by the application I’m developing.

While simpleMultiGPU works on 8192 floats divided by all devices available, the task is divided and a quantum is assigned to every device, the task my application performs is not to be divided amongst devices. Using 4 devices is supposed to increase the depth of data to be processed, creating the kind of effect a stream approach would result in. On the singleGPU application it processes 2x1.4515136e7 32bit integers and needs 2x1.4515136e7+2x2.26799e5+1.0368e6 32 bit integers to be copied to the device. The reason all this amount of data can’t be divided in quanta and assigned to each device for processing is because the amount 2x2.26799e5+1.0368e6 integers is needed for processing but is not processed, ie it stores information but it will not be changed by the kernel, and because the kernel is iterative, the division in quanta would create the need to synchronize data after each iteration at the cost of a massive overhead.

So basically no task dividing is possible, the multiGPU approach is aiming to increase 4 times the amount of data processed (2x1.4515136e7 → 8x1.4515136e7) while being able to parallelize by all four GPUs the execution model.

This is where problems arrise. If the SDK simpleMultiGPU approach is taken, one thread per device and a main thread that creates the others, there will be inevitably a racing condition on the PCIe bus, as each thread must perform a cudaMemcpyHost2Device, if the OS scheduler was good enough to surpass this racing for the bus, after the kernel execution it would do again on another racing for the bus on account of a cudaMemcpyDevice2Host. Thing is, OS scheduler won’t help me, so I devised a simple synchro mechanism, that enables that some concurrent kernel execution on the devices available. I suspect that simpleMultiGPU provided by SDK while handling a pitiful amount of data is able to complete each cudaMemcpyHost2Device even before the next host thread is created, thus eliminating the needs for scheduling.

External Media

Mind that the situation described by the picture provided is likely to happen for a small number of kernel iterations, but more likely is to have the kernel execution of device 0 to overlap the execution of device 3. It’s not a perfect routine but it allows some parallelism and it greatly contributes to increasing performance as the penalty taken in throughput for an increasing number of iterations, hence kernel execution time, for a singleGPU is much greater than the penalty taken by a multiGPU approach like this.

As the simpleMultiGPU provided by SDK uses the pthread.h by calling other routines, I devised an omp implementation that works regardless of OS state as opposed to the pthread.h implementation (my fault here I assume, for not being able to develop a stable multithreaded application using pthread.h). The synchro mechanism is not far from being implemented but it’s still very faulty.

Here’s a snippet

[codebox]#define GPUsAVAILABLE 4

/defined as global variable/

int count=-1;

#pragma omp shared(count)

/on the main function/

count=0;

omp_set_num_threads(GPUsAVAILABLE);

#pragma omp parallel

{



/*host ptrs declarations*/

thread_id=omp_get_thread_num();

//select GPU

cudaSetDevice(thread_id%GPUsAVAILABLE);

cudaThreadExit();



/*allocates resources on host side and initializes data*/

/*device ptrs declarations*/

/*setup grid*/

/*perform cudaMallocs*/



#pragma omp barrier		//all threads synchronize here



while(count!=thread_id)

{

	usleep(1500);		//dunno why I is a sleep call needed but it was the only way it worked

}



//if count==thread_id continues execution

/*perform cudaMemcpyHost2Device

#pragma omp atomic		//unlocks next thread

count++;



kernel<<<grid,threads>>>(/*data ptrs*/);



for(i=0;i<2*num_iterations;i+=1)

	kernel<<<grid,threads>>>(/*data ptrs*/);

cudaThreadSynchronize();



/*thread 0 halts execution waiting for last device to perform cudaMemcpyHost2Device*/

/*other threads halt execution waiting for previous to perform cudaMemcpyDevice2Host*/

while(count!=thread_id+GPUsAVAILABLE)

{

	usleep(1500);

}

//if count==thread_id+GPUsAVAILABLE continues execution

/*cudaMemcpyDevice2Host*/

#pragma omp atomic		//unlocks next thread

count++;



}[/codebox]

There are several things about this that shouldn’t be happening, cudaMemcpyHost2Device/Device2Host is iteration independent, it’s the very same amount of data, but on the multiGPU approach it increases with the number of iterations, as opposed to the singleGPU where it’s constant. Why is a sleeping routine necessary for it to work? A while(/some condition on thread_id and count/){} stalls the application. Also the usage of sleeping routines resets the result of the time stamp given by clock() so I’m not able to confirm that execution follows the schedule I’m trying to impose.

And mostly, is this a wise synchro routine?

It’s not that clever but eliminating the count and thus the scheduling makes the cudaMemcpyHost2Device/Device2Host take as much as 20 times more. Is there another way to achieve optimal synchro between host threads and managing the common resources, mainly the bus, besides a similar one?

Appreciate all the help and comments, and your patience if you reached so far on this post.

Wow… long post :)
I’m not sure i’ve completly understood what you want to do/done but here are a few comments…
-. SDK samples are just samples they are not production quality code. If you want multi-gpu code you have
to write and manage everything by yourself (pthreads or OpenMP or…). The basic idea is indeed to open
one thread per GPU, call cudaSetDevice on each thread and leave it running as long as possible feeding data
and running kernels on those threads. Killing the thread will make you re-create another one, call cudaSetDevice
again and that will take time.
-. You can google this newsgroup for Mr Anderson’s GPUWorker code that implements Multi-GPU.
-. The “racing on the Bus” will happen of course in anycase. If you have a x16 PCI bus where you have, for example,
a GTX295 (which is actually two GPUs) you’ll probably get a x8 per each card. Nothing you can do about this other
than to get a system with as much PCI lanes…
-. The usleep(1500) is obviously an indication that your code is not working correctly. There should be no such
sync between the GPUs and each HOST thread should work independantly from the other one.
-. You might want to look at async memory copies and/or pinned memory et al…

hope that cleared/helped a bit…

eyal

The bus sharing doesn’t appear to happen that way, it appears to be fully dedicated to each thread but time sliced amongst threads, thread0 controls bus thread# waits then thread1 controls thread# waits, creating an overhead if the cudaMemcpies are performed concurrently or if the overlap.

Though GPUWorker isn’t only a way to transparently hide the thread creation and all its management, not its scheduling, in a similar fashion to omp? The usleep isn’t supposed to be there, but the existence of a while(some condition/){} stalls the application, so for the time being I’ll stick to it.

Anyway the sort of problem I’m addressing has a couple of features that generate overhead per se. According to the CUDA execution model an interblock synchronization mechanism is a violation of the later, which means that being an iterative kernel, without any fancy mechanisms that implement an interblock sync, the execution of the kernel is made inside a loop on host side and that makes each host thread compete for the bus not only for the cudaMemcpies but for the entire application. I get tremendous overhead, and for two devices only the first and last cudaMemcpies are bus competing overhead free, ie executing as fast as for a single device.

I don’t believe that is the case, and I have experimental evidence which demonstrates it in this thread. I have multithreaded codes which achieve simultaneous pageable and pinned memory copies to multiple GPUs, where the sum of transfer throughput exceeds a single device copy by a good margin. So much so that I can demonstrate that the bandwidth of CPU-PCI-e controller link (either Hypertransport or QPI) becoming the bottleneck during large transfers. I tried reading your first post a couple of times, but I really couldn’t follow it, so I can’t really offer any suggestions, other than the simple observation that my experience with multi-GPU and CUDA doesn’t seems to agree with yours at all.

Then how can one explain the difference in memory copying times, copying to a single devices takes about 2.8ms but simultaneous copies to four devices take each one ~20ms?

The object of this topic, or at least the answers I need are not LTI, it seems that I won’t be able to scale my application to several devices if I can’t implement an interblock sync. Reasons seem to be, the concurrent execution of the kernel several times within a loop with a memory copy on separate devices causes a massive overhead.

Simple - the threading model you are using is really inappropriate for multi-gpu use. Context establishment is expensive, and creating and destroying contexts with every thread operation (which is what that OpenMP code looks to do) is a bad idea. Most of the time you are measuring isn’t copying or kernel execution, it is context establishment overhead.

If you use persistent threads which establish contexts only once and hold them for the lifetime of the application, the performance will be vastly better.

The reason I defined a a #pragma omp barrier was to stall all host threads after their creation and setup of memory pointers as to measure the cudaMemcpy and kernell execution times. It would be dreadfully wrong to time the execution of the host thread rather than task performed by it, and it’s not happening. The application is supposed to be a benchmark of a minsum decoding algorithm, which is an iterative one, hence the loop on host side, and the host threads last the lifetime of the application as far as I can tell. Programming with POSIX threads has proven if anything to be overly time consuming as opposed to omp.

Anyway, overhead arises if mempies and kernel executions are concurrent, so the scheduling posted won’t prove to be effective unless the iterations can be run on device without the need to call a kernel more than once. Inter-block synchronization is required, there is a paper called Inter-block GPU Communicatioon via Fast Barrier Synchronization that describes inter-block sync routines, but it lacks some code details thus not being able to implement their routines. Any help with inter-block sync?

One of the routines provided is the following

[codebox]

device void __gpu_sync()

{

int tid_in_block=threadIdx.x;

if(tid_in_block==0){

	atomicAdd(&g_mutex,1);



	while(g_mutex!=goalVal);

}

__syncthreads();

}

[/codebox]

And its appropriate usage, according to authors, is:

[codebox]

global iterative_kernel(){

for(/any given number of iterations/)

{

mykernel();

__gpu_sync();

}

}

[/codebox]

Having changed my application to grid-synchronize, the effect is null, as results are inconsistent, which means a write-after-read hazard is occurring in my specific case.

The same problem with no effect of block synchronizations (“the effect is null”).

If you take a look to .ptx code, you’ll see, that there is no code binding to [codebox] while(g_mutex!=goalVal); [/codebox]

For example i do [codebox]

int u;

while(g_mutex!=goalVal) {u=777}; [/codebox]

and look at .ptx .Nothing, i don’t see my loop. In NVIDIA_CUDA_Programming_Guide_2.3 i read “…the compiler is free to optimize…” and “This behavior can be changed using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its can be changed at any time by another thread and therefore any reference to this variable compiles to an actual memory read instruction.”

Then i do [codebox]

volatile int u;

while(g_mutex!=goalVal) {u=777}; [/codebox]

in .ptx code i see [ATTACHMENT]

That is why this block synchronization has absolutely no effect.

After that, if use only [codebox]if(tid_in_block==0) [/codebox] (for thread 0), there is no hanging of program, but results are inconsistent.

If each thread in all blocks do [codebox]atomicAdd(&g_mutex,1);

while(g_mutex!=goalVal);[/codebox]

my program is hang!

Why it’s happen - i don’t know !! ??