OpenMP Multi-GPU, not getting speedup expected

Hi,

I’m using OpenMP to launch my code on 2 GPUs. I have tested the time for each GPU: on GPU1 it takes 19ms, while GPU2 takes 20ms. When I run the overall code it takes 40ms (the addition of the 2 codes). I am getting the numerical results I expected, just not the performance. Am I doing something wrong with the OpenMP portion of the code? My code is as follows:

omp_set_num_threads(num_gpus);
double * variable;
variable = (double *) malloc (sizeof(double)*2);

#pragma omp parallel
{
unsigned int cpu_tid = omp_get_thread_num();

declare memory;
host to device transfer;
assign dim3;

kernel <<< >>> (data[cpu_tid], declared memory);

device to host transfer;
variable[cpu_tid] = transfered local variable;
free memory;
}

In order to isolate a PCI issue - try to take the PCI copies out of the OMP section and leave
only the kernel invocation inside the OMP section and see the timings.
Other than that - maybe post the real code - maybe there is another thing there.

I think you need to bind different threads to different devices

unsigned int tid = omp_get_thread_num();

cuda_status = cudaSetDevice(tid);

assert( cudaSuccess == cuda_status );

Thanks for the reply.

I believe there is an issue with PCIe, here’s how I tested in code:

#pragma omp parallel

    {

    unsigned int cpu_tid = omp_get_thread_num();

//   int cpu_tid = 1;

//	CUDA_SAFE_CALL(cudaSetDevice(cpu_tid));

    assert( cudaSuccess == cudaSetDevice(cpu_tid) );

    double *dev_data, *dev_X_plus, *dev_X_minus, *dev_debugBig, *dev_debug;

    double *debug1, *debug2, *debugBig;

int lengthDevData = windowSize/2;

    int lengthDevAlpha = 80;

// allocate host memory

    debug1 = (double *) malloc (lengthDevData*sizeof(double));

    debug2 = (double *) malloc (lengthDevData*sizeof(double));

    debugBig = (double *) malloc (lengthDevData*lengthDevAlpha*sizeof(double));

	for (int i = 0; i < 10; i ++){

		cout << "Before malloc and memcpy Thread " << cpu_tid << ", i is " << i << endl;

	}

    // allocate device memory

    cudaMalloc( (void **)&dev_data, sizeof(double) * windowSize );

    cudaMalloc( (void **)&dev_X_plus, sizeof(double) * lengthDevAlpha * lengthDevData );

    cudaMalloc( (void **)&dev_X_minus, sizeof(double) * lengthDevAlpha * lengthDevData );

    cudaMalloc( (void **)&dev_debugBig, sizeof(double) * lengthDevAlpha * lengthDevData );

    cudaMalloc( (void **)&dev_debug, sizeof(double) * lengthDevAlpha );

// copy data to dev_data, alpha to dev_alpha, and log gamma of alpha to device

    cudaMemcpy( dev_data, data, sizeof(double) * windowSize, cudaMemcpyHostToDevice );

dim3 Grid(lengthDevData, lengthDevAlpha);

    dim3 Block(16, 32);

//    kernelOMPCalculateXplusXminus <<<Grid, Block>>> (dev_data, dev_alpha[cpu_tid], dev_X_plus, dev_X_minus, dev_debugBig, dev_debug, cpu_tid*250);

	for (int i = 0; i < 10; i ++){

		cout << "After malloc and memcpy Thread " << cpu_tid << ", i is " << i << endl;

	}

    } // end openmp

I put this in a function and used a loop to call it 100 times, a typical output looks like:

This leads me to believe that before malloc and memcpy, the 2 cpu threads are running in parallel, after malloc and memcpy, the 2 threads are running sequentially.

Does anyone know why this is happening and how to fix the issue?

Thanks

Because you did not follow Lung Sheng’s advice: The malloc() and memcpy() will operate on the same device. And as they are synchronous operations, they will be serialized.

Thanks Tera, actually I wrote too quickly in the last thread, or more specifically, the for loops I used to determine parallel or serial were too short.

The problem is that I made a newbie error in my time measurement function: I used my function for single thread time measurement instead of OPENMP’s time measurement functions. Everything works fine after I fixed this error.

Thanks to everyone that replied.