OpenMP Multi-GPU, not getting speedup expected

skyang · July 11, 2011, 6:48pm

Hi,

I’m using OpenMP to launch my code on 2 GPUs. I have tested the time for each GPU: on GPU1 it takes 19ms, while GPU2 takes 20ms. When I run the overall code it takes 40ms (the addition of the 2 codes). I am getting the numerical results I expected, just not the performance. Am I doing something wrong with the OpenMP portion of the code? My code is as follows:

omp_set_num_threads(num_gpus);
double * variable;
variable = (double *) malloc (sizeof(double)*2);

#pragma omp parallel
{
unsigned int cpu_tid = omp_get_thread_num();

declare memory;
host to device transfer;
assign dim3;

kernel <<< >>> (data[cpu_tid], declared memory);

device to host transfer;
variable[cpu_tid] = transfered local variable;
free memory;
}

eyalhir74 · July 12, 2011, 9:02am

In order to isolate a PCI issue - try to take the PCI copies out of the OMP section and leave
only the kernel invocation inside the OMP section and see the timings.
Other than that - maybe post the real code - maybe there is another thing there.

LSChien · July 12, 2011, 9:21am

omp_set_num_threads(num_gpus);
double * variable;

variable = (double *) malloc (sizeof(double)*2);
#pragma omp parallel
{

unsigned int cpu_tid = omp_get_thread_num();
declare memory;
host to device transfer;

assign dim3;
kernel <<< >>> (data[cpu_tid], declared memory);

device to host transfer;
variable[cpu_tid] = transfered local variable;

free memory;

}

I think you need to bind different threads to different devices

unsigned int tid = omp_get_thread_num();

cuda_status = cudaSetDevice(tid);

assert( cudaSuccess == cuda_status );

skyang · July 15, 2011, 4:55pm

Thanks for the reply.

I believe there is an issue with PCIe, here’s how I tested in code:

#pragma omp parallel

    {

    unsigned int cpu_tid = omp_get_thread_num();

//   int cpu_tid = 1;

//	CUDA_SAFE_CALL(cudaSetDevice(cpu_tid));

    assert( cudaSuccess == cudaSetDevice(cpu_tid) );

    double *dev_data, *dev_X_plus, *dev_X_minus, *dev_debugBig, *dev_debug;

    double *debug1, *debug2, *debugBig;

int lengthDevData = windowSize/2;

    int lengthDevAlpha = 80;

// allocate host memory

    debug1 = (double *) malloc (lengthDevData*sizeof(double));

    debug2 = (double *) malloc (lengthDevData*sizeof(double));

    debugBig = (double *) malloc (lengthDevData*lengthDevAlpha*sizeof(double));

	for (int i = 0; i < 10; i ++){

		cout << "Before malloc and memcpy Thread " << cpu_tid << ", i is " << i << endl;

	}

    // allocate device memory

    cudaMalloc( (void **)&dev_data, sizeof(double) * windowSize );

    cudaMalloc( (void **)&dev_X_plus, sizeof(double) * lengthDevAlpha * lengthDevData );

    cudaMalloc( (void **)&dev_X_minus, sizeof(double) * lengthDevAlpha * lengthDevData );

    cudaMalloc( (void **)&dev_debugBig, sizeof(double) * lengthDevAlpha * lengthDevData );

    cudaMalloc( (void **)&dev_debug, sizeof(double) * lengthDevAlpha );

// copy data to dev_data, alpha to dev_alpha, and log gamma of alpha to device

    cudaMemcpy( dev_data, data, sizeof(double) * windowSize, cudaMemcpyHostToDevice );

dim3 Grid(lengthDevData, lengthDevAlpha);

    dim3 Block(16, 32);

//    kernelOMPCalculateXplusXminus <<<Grid, Block>>> (dev_data, dev_alpha[cpu_tid], dev_X_plus, dev_X_minus, dev_debugBig, dev_debug, cpu_tid*250);

	for (int i = 0; i < 10; i ++){

		cout << "After malloc and memcpy Thread " << cpu_tid << ", i is " << i << endl;

	}

    } // end openmp

I put this in a function and used a loop to call it 100 times, a typical output looks like:

Before malloc and memcpy Thread 0, i is 0

Before malloc and memcpy Thread 0, i is 1

Before malloc and memcpy Thread 1, i is 0

Before malloc and memcpy Thread 1, i is 1

Before malloc and memcpy Thread 1, i is 2

Before malloc and memcpy Thread 1, i is 3

Before malloc and memcpy Thread 1, i is 4

Before malloc and memcpy Thread 1, i is 5

Before malloc and memcpy Thread 1, i is 6

Before malloc and memcpy Thread 1, i is 7

Before malloc and memcpy Thread 1, i is 8

Before malloc and memcpy Thread 1, i is 9

Before malloc and memcpy Thread 0, i is 2

Before malloc and memcpy Thread 0, i is 3

Before malloc and memcpy Thread 0, i is 4

Before malloc and memcpy Thread 0, i is 5

Before malloc and memcpy Thread 0, i is 6

Before malloc and memcpy Thread 0, i is 7

Before malloc and memcpy Thread 0, i is 8

Before malloc and memcpy Thread 0, i is 9

After malloc and memcpy Thread 1, i is 0

After malloc and memcpy Thread 1, i is 1

After malloc and memcpy Thread 1, i is 2

After malloc and memcpy Thread 1, i is 3

After malloc and memcpy Thread 1, i is 4

After malloc and memcpy Thread 1, i is 5

After malloc and memcpy Thread 1, i is 6

After malloc and memcpy Thread 1, i is 7

After malloc and memcpy Thread 1, i is 8

After malloc and memcpy Thread 1, i is 9

After malloc and memcpy Thread 0, i is 0

After malloc and memcpy Thread 0, i is 1

After malloc and memcpy Thread 0, i is 2

After malloc and memcpy Thread 0, i is 3

After malloc and memcpy Thread 0, i is 4

After malloc and memcpy Thread 0, i is 5

After malloc and memcpy Thread 0, i is 6

After malloc and memcpy Thread 0, i is 7

After malloc and memcpy Thread 0, i is 8

After malloc and memcpy Thread 0, i is 9

This leads me to believe that before malloc and memcpy, the 2 cpu threads are running in parallel, after malloc and memcpy, the 2 threads are running sequentially.

Does anyone know why this is happening and how to fix the issue?

Thanks

tera · July 15, 2011, 5:35pm

Because you did not follow Lung Sheng’s advice: The malloc() and memcpy() will operate on the same device. And as they are synchronous operations, they will be serialized.

skyang · July 15, 2011, 5:58pm

Thanks Tera, actually I wrote too quickly in the last thread, or more specifically, the for loops I used to determine parallel or serial were too short.

The problem is that I made a newbie error in my time measurement function: I used my function for single thread time measurement instead of OPENMP’s time measurement functions. Everything works fine after I fixed this error.

Thanks to everyone that replied.

Topic		Replies	Views
Using multiple GPUs Legacy PGI Compilers	7	22079	August 11, 2009
Different execution time on multi gpu 4 equal cards, different execution time CUDA Programming and Performance	8	1904	March 28, 2011
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6184	November 14, 2017
Access to the generated code by OpenMP target pragmas nvc, nvc++ and nvfortran cuda	8	1199	March 1, 2023
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14003	September 12, 2015
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	28003	March 4, 2012
cudaMemcpy is slow the first time used in a loop CUDA Programming and Performance cuda	3	1661	October 12, 2021
Host->Device memcpy failure in forked process valgrind output included CUDA Programming and Performance	4	7129	February 26, 2008
processing time check CUDA Programming and Performance	5	551	November 16, 2010

OpenMP Multi-GPU, not getting speedup expected

Related topics