Weird multiGPU performance About 10 times slower than single GPU

Hello,

I’m trying to perform several convolutions in parallel on two GPU cores. I’m using the same multi-thread API like in simpleMultiGPU example so the code looks something like this:

#define MAX_GPU_COUNT 8

prepare_signals_data;

prepare_kernels_data;

TGPUplan	  plan[MAX_GPU_COUNT];

CUTThread threadID[MAX_GPU_COUNT];

int GPU_N;

cutilSafeCall(cudaGetDeviceCount(&GPU_N));

if(GPU_N > MAX_GPU_COUNT) GPU_N = MAX_GPU_COUNT;

// initialize plans (set parameters for each convolution)

for (int i = 0; i < GPU_N; ++i) 

  initialize_plan(i);

// compute plans

for(int i = 0; i < GPU_N; i++)

  threadID[i] = cutStartThread( (CUT_THREADROUTINE)solverThread, (void *)(plan + i) );

cutWaitForThreads(threadID, GPU_N);

The solverThread routine then looks like this (consider the global number of convolutions is generally larger than GPU_N but it is multiple of GPU_N):

static CUT_THREADPROC solverThread( TGPUplan *plan ) {

  cutilSafeCall( cudaSetDevice(plan->device) );

  for (int i = 0; i < CONVOLUTION_N; i++) {

	copy_signal_from_CPU_to_GPU;

	copy_kernel_from_CPU_to_GPU;

	// perform FFTs using CUFFT

	fft(signal);

	fft(kernel);

	// perform convolution and store the result in signal

	pointwise_multiplication(signal, kernel, normalization); 

	ifft(signal);

	copy_signal_from_GPU_to_CPU;

  }

}

Finally I get the right results (comparing with convolution on CPU) but almost ten times slower than when using only single GPU core with no multithread API (1800 ms vs. 200 ms). I thought the problem could be bandwidth of PCI Express bus (copying data to multiple devices concurrently would lead to low performance) but when I set MAX_GPU_COUNT to 1 the computation time is still about 1800 ms.

For completeness’ sake I add some short information about hardware configuration: Intel Core 2 Quad Q6600, 4 GB RAM, nVidia GTX 295, OS 64bit Linux, CUDA 2.2.

Have any of you experienced this weird behavior? What can I do to avoid it?

Does the behavior improve if you generate the plan inside each thread?

Maybe I don’t understand you but how can I generate the plan inside each thread when the plan consists information for each thread what to do? E.g. pointers to memory and parameters of convolutions? Furhermore, I don’t think the generating plan itself concurrently can be measurably faster since it is just setting few variables of the TGPUplan structure:

typedef struct {

	// number of device

	int device;

	// pointer for data

	Complex *h_signal;

	Complex *h_kernel;

	// input parameters

	unsigned int sizex;

	unsigned int sizey;

	unsigned int sizez;

	unsigned int CONVOLUTION_N;

	float normalization;

} TGPUplan;

Bad news everyone: The simpleMultiGPU example from SDK behaves about the same. To be exact, computation on 2 GPUs is about 50 % slower than on 1 GPU (provided by setting MAX_GPU_COUNT to 1). Tested on two different machines with GTX 295, one with CUDA 2.2 and one with CUDA 2.3. So, wtf? :confused:

Hi,

Do you know what takes most of the time in the solver thread?

copy_signal_from_CPU_to_GPU;

	copy_kernel_from_CPU_to_GPU;

	// perform FFTs using CUFFT

	fft(signal);

	fft(kernel);

	// perform convolution and store the result in signal

	pointwise_multiplication(signal, kernel, normalization); 

	ifft(signal);

	copy_signal_from_GPU_to_CPU;

Can you put timers on each of the calls to see what takes most of the time?

Might be that most of the time is PCI transfer and the GTX295 (2 GPUs on the same PCI lane) makes it worse.

eyal

Here we go … Exact output on display was following:

Device 1: Initialization: 47.27 ms

Device 0: Initialization: 110.60 ms

Device 1: Data copy: 141.64 ms

Device 0: Data copy: 115.56 ms

Device 1: FFT: 33.90 ms

Device 1: Ptwise multiplication: 1.00 ms

Device 0: FFT: 33.87 ms

Device 0: Ptwise multiplication: 0.98 ms

Device 1: iFFT: 16.88 ms

Device 0: iFFT: 17.16 ms

Device 1: Data backcopy: 39.34 ms

Device 1: Finalization: 0.47 ms

Device 0: Data backcopy: 77.05 ms

Device 0: Finalization: 0.54 ms

So the times spent in particular stages of solverThread on particular devices are following:

|  Device 0   |  Device 1 

-----------------------|-------------|-------------

Initialization		 |  110.60 ms  |   47.27 ms

Data copy			  |  115.56 ms  |  141.64 ms

FFT					|   33.87 ms  |   33.90 ms

Ptwise multiplication  |	0.98 ms  |	1.00 ms

iFFT				   |   17.16 ms  |   16.88 ms

Data backcopy		  |   77.05 ms  |   39.34 ms

Finalization		   |	0.54 ms  |	0.47 ms

However, the total time measured is 2098.3 ms (including initializing TGPUplan, starting threads and waiting for threads to finish).

GOT IT! It’s all in cudaGetDeviceCount(&GPU_N) because this call spends about 2 seconds! :w00t: Is there any faster way how to count available devices?

Hi,

Device 0 and 1 are ~ the same, good :) - this type of difference seems reasonable to me - espicially if you’re use a GTX295 sharing the same PCI lanes.

You can see that most of your time is spent on passing data to the GPU and then back to CPU (115+77) vs the calculation itself(33.87+17.16)

This means you’re mostly bounded by the PCI bandwidth.

There are a few things to check:

  1. Can you decrease the amount of data going back/forth - by maybe compacting it.

  2. The GTX is sitting in a PCI slot - what’s its speed? x16? what generation?

  3. You might see better results for larger datasets - i.e. causing the GPU to calculate more on the same amount of data.

  4. You might want to look at async copies of data to the gpu or other tricks like zero-copy

edit - btw - since the initialization phase (setDevice) takes time, make sure you call it only once and leave the CPU threads

active - so indeed you call setDevice only once for the entire program duration.

hope that helps,

eyal

Sorry about the confusion but I have set GPU_N to 2 (without using the cudaGetDeviceCount routine) and the time is about the same so there is some mess in time measuring. :( Anyway, to eyalhir74:

  1. Maybe I can reduce amount of data passed but it won’t make any big difference.
  2. It’s x16 slot version 1.0.
  3. I think the signals of size 256 x 256 x 128 are large enough. ;)
  4. Not in this case.
  5. SetDevice is used only once a thread.
  6. I still cannot figure out the difference between the total time (over 2 seconds) and the total time spent in solverThread (about 300 ms per device).

How big is CONVOLUTION_N?

Are you calculating the total time from the main CPU thread? that one takes 2 seconds?

LOL, also note that creating a cuda context on each device will also throw you back several ten milliseconds.

Christian