How to Avoid Re-Creating CUDA Context with Multi-GPU using OpenMP? Call a function that does omp par

I’m trying to do some multi-gpu work under 2.3 beta (haven’t gotten around to installing the release) on a GTX 295 using OpenMP.

I’m can’t seem to come up with a way to prevent the overhead of context creation when running a function multiple times.

My code flow is similar to that in the cudaOpenMP SDK example: I set up some data, then use “#pragma omp parallel” to process the data on the two GPUs.

After doing the “#pragma omp parallel”, a cudaSetDevice must be done to associate the host thread with one of the GPUs. This call has to be done before the CUDA context can be created.

The above described code has been combined into a single host function which I would like to call multiple times. However, if I do not insert a cudaThreadExit at the end of the function I get an error when calling it a second time

As I understand it, by doing a cudaThreadExit I am destroying the CUDA context and will have to recreate it the next time I call into the function, incurring the undesired overhead.

Is there any way to prevent having to recreate the context every time?

The easiest way to demonstrate this situation would be to simply add a for loop into the cudaOpenMP SDK example (In red below - above initialize data comment):

[codebox]

/*

  • Multi-GPU sample using OpenMP for threading on the CPU side

  • needs a compiler that supports OpenMP 2.0

*/

#include <omp.h>

#include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

#include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int b)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int b)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + b)

		return 0;

return 1;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

    cudaDeviceProp dprop;

    cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

for( int j = 0; j < 2; j++ )

{

	/////////////////////////////////////////////////////////////////

	// initialize data

	//

	unsigned int n = num_gpus * 8192;

	unsigned int nbytes = n * sizeof(int);

	int *a = 0;		// pointer to data on the CPU

	int b = 3;		// value by which the array is incremented

	a = (int*)malloc(nbytes);

	if(0 == a)

	{

		printf("couldn't allocate CPU memory\n");

		return 1;

	}

	for(unsigned int i = 0; i < n; i++)

		a[i] = i;

    

	////////////////////////////////////////////////////////////////

	// run as many CPU threads as there are CUDA devices

	//   each CPU thread controls a different device, processing its

	//   portion of the data.  It's possible to use more CPU threads

	//   than there are CUDA devices, in which case several CPU

	//   threads will be allocating resources and launching kernels

	//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

	//   Recall that all variables declared inside an "omp parallel" scope are

	//   local to each CPU thread

	//

	omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

	//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

	{		

		unsigned int cpu_thread_id = omp_get_thread_num();

		unsigned int num_cpu_threads = omp_get_num_threads();

		

		// set and check the CUDA device for this CPU thread

		int gpu_id = -1;

		CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

		CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

		printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

		int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

		int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

		unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

		dim3 gpu_threads(128);	// 128 threads per block

		dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

		CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

		kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b);

		CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

		CUDA_SAFE_CALL(cudaFree(d_a));

	}

	printf("---------------------------\n");

	

	if(cudaSuccess != cudaGetLastError())

		printf("%s\n", cudaGetErrorString(cudaGetLastError()));

	////////////////////////////////////////////////////////////////

	// check the result

	//

	if(correctResult(a, n, b))

		printf("Test PASSED\n");

	else

		printf("Test FAILED\n");

	free(a);	// free CPU memory

}

cudaThreadExit();

cutilExit(argc, argv);

return 0;

}

[/codebox]

Thank you for the assistance!

It appears that as long as I do the cudaSetDevice once, later iterations will continue to be assigned to the same device. See the colored regions below:

[codebox]/*

  • Multi-GPU sample using OpenMP for threading on the CPU side

  • needs a compiler that supports OpenMP 2.0

*/

#include <omp.h>

#include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

#include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int b)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int b)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + b)

		return 0;

return 1;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

    cudaDeviceProp dprop;

    cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

for( int j = 0; j < 20; j++ )

{

	/////////////////////////////////////////////////////////////////

	// initialize data

	//

	unsigned int n = num_gpus * 8192;

	unsigned int nbytes = n * sizeof(int);

	int *a = 0;		// pointer to data on the CPU

	int b = 3;		// value by which the array is incremented

	a = (int*)malloc(nbytes);

	if(0 == a)

	{

		printf("couldn't allocate CPU memory\n");

		return 1;

	}

	for(unsigned int i = 0; i < n; i++)

		a[i] = i;

    

	////////////////////////////////////////////////////////////////

	// run as many CPU threads as there are CUDA devices

	//   each CPU thread controls a different device, processing its

	//   portion of the data.  It's possible to use more CPU threads

	//   than there are CUDA devices, in which case several CPU

	//   threads will be allocating resources and launching kernels

	//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

	//   Recall that all variables declared inside an "omp parallel" scope are

	//   local to each CPU thread

	//

	omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

	//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

	{		

		unsigned int cpu_thread_id = omp_get_thread_num();

		unsigned int num_cpu_threads = omp_get_num_threads();

		

		// set and check the CUDA device for this CPU thread

		int gpu_id = -1;

		if( j == 0 )

		{

			CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

		}

		CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

		printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

		int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

		int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

		unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

		dim3 gpu_threads(128);	// 128 threads per block

		dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

		CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

		kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b);

		CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

		CUDA_SAFE_CALL(cudaFree(d_a));

	}

	printf("---------------------------\n");

	

	if(cudaSuccess != cudaGetLastError())

		printf("%s\n", cudaGetErrorString(cudaGetLastError()));

	////////////////////////////////////////////////////////////////

	// check the result

	//

	if(correctResult(a, n, b))

		printf("Test PASSED\n");

	else

		printf("Test FAILED\n");

	free(a);	// free CPU memory

}

cudaThreadExit();

cutilExit(argc, argv);

return 0;

}

[/codebox]

Results in:

[codebox]number of host CPUs: 2

number of CUDA devices: 2

0: GeForce GTX 295

1: GeForce GTX 295


CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

Press ENTER to exit…[/codebox]

Is it safe to assume this behavior is reliable?

If this is the case, I can just

  1. create a function called createContext() which will set the devices

  2. Remove the setDevice from my function

  3. call my function as many times as I want

As in:

[codebox]/*

  • Multi-GPU sample using OpenMP for threading on the CPU side

  • needs a compiler that supports OpenMP 2.0

*/

#include <omp.h>

#include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

#include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int b)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int b)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + b)

		return 0;

return 1;

}

int createContext()

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

	cudaDeviceProp dprop;

	cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

#pragma omp parallel

{		

	unsigned int cpu_thread_id = omp_get_thread_num();

	CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

}

return 0;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

createContext();

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

for( int j = 0; j < 20; j++ )

{

	/////////////////////////////////////////////////////////////////

	// initialize data

	//

	unsigned int n = num_gpus * 8192;

	unsigned int nbytes = n * sizeof(int);

	int *a = 0;		// pointer to data on the CPU

	int b = 3;		// value by which the array is incremented

	a = (int*)malloc(nbytes);

	if(0 == a)

	{

		printf("couldn't allocate CPU memory\n");

		return 1;

	}

	for(unsigned int i = 0; i < n; i++)

		a[i] = i;

    

	////////////////////////////////////////////////////////////////

	// run as many CPU threads as there are CUDA devices

	//   each CPU thread controls a different device, processing its

	//   portion of the data.  It's possible to use more CPU threads

	//   than there are CUDA devices, in which case several CPU

	//   threads will be allocating resources and launching kernels

	//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

	//   Recall that all variables declared inside an "omp parallel" scope are

	//   local to each CPU thread

	//

	omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

	//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

	{		

		unsigned int cpu_thread_id = omp_get_thread_num();

		unsigned int num_cpu_threads = omp_get_num_threads();

		

		// set and check the CUDA device for this CPU thread

		int gpu_id = -1;

		// CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

		CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

		printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

		int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

		int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

		unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

		dim3 gpu_threads(128);	// 128 threads per block

		dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

		CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

		kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b);

		CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

		CUDA_SAFE_CALL(cudaFree(d_a));

	}

	printf("---------------------------\n");

	

	if(cudaSuccess != cudaGetLastError())

		printf("%s\n", cudaGetErrorString(cudaGetLastError()));

	////////////////////////////////////////////////////////////////

	// check the result

	//

	if(correctResult(a, n, b))

		printf("Test PASSED\n");

	else

		printf("Test FAILED\n");

	free(a);	// free CPU memory

	

	

}

cudaThreadExit();

cutilExit(argc, argv);

return 0;

}

[/codebox]

Which has the same result as the earlier code:

[codebox]number of host CPUs: 2

number of CUDA devices: 2

0: GeForce GTX 295

1: GeForce GTX 295


CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1


Test PASSED[/codebox]

Any feedback would be appreciated!

A context is established between a CPU thread and a GPU. Currently, context is established during the first CUDA call that modifies state (think, cudaMalloc, kernel launch, etc.). cudaSetDevice, sets the device for the calling CPU thread’s context, if a context is not yet created, but does not create a context. After a context is created, calling cudaSetDevice will not set a device and should return an error (as you observed). A context is active until the CPU thread calls cudaThreadExit or the thread terminates.

So, going back to your case, you need to call cudaSetDevice once per OpenMP thread, prior to context being created for that thread. Once a thread has a context, make CUDA calls as usual.

Paulius