How to Avoid Re-Creating CUDA Context with Multi-GPU using OpenMP? Call a function that does omp par

jph4599 · July 28, 2009, 8:50pm

Iâ€™m trying to do some multi-gpu work under 2.3 beta (havenâ€™t gotten around to installing the release) on a GTX 295 using OpenMP.

Iâ€™m canâ€™t seem to come up with a way to prevent the overhead of context creation when running a function multiple times.

My code flow is similar to that in the cudaOpenMP SDK example: I set up some data, then use â€œ#pragma omp parallelâ€ to process the data on the two GPUs.

After doing the â€œ#pragma omp parallelâ€, a cudaSetDevice must be done to associate the host thread with one of the GPUs. This call has to be done before the CUDA context can be created.

The above described code has been combined into a single host function which I would like to call multiple times. However, if I do not insert a cudaThreadExit at the end of the function I get an error when calling it a second time

As I understand it, by doing a cudaThreadExit I am destroying the CUDA context and will have to recreate it the next time I call into the function, incurring the undesired overhead.

Is there any way to prevent having to recreate the context every time?

The easiest way to demonstrate this situation would be to simply add a for loop into the cudaOpenMP SDK example (In red below - above initialize data comment):

[codebox]

/*

Multi-GPU sample using OpenMP for threading on the CPU side
needs a compiler that supports OpenMP 2.0

*/

include <omp.h>

include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int b)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int b)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + b)

		return 0;

return 1;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

    cudaDeviceProp dprop;

    cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

for( int j = 0; j < 2; j++ )

{

	/////////////////////////////////////////////////////////////////

	// initialize data

	//

	unsigned int n = num_gpus * 8192;

	unsigned int nbytes = n * sizeof(int);

	int *a = 0;		// pointer to data on the CPU

	int b = 3;		// value by which the array is incremented

	a = (int*)malloc(nbytes);

	if(0 == a)

	{

		printf("couldn't allocate CPU memory\n");

		return 1;

	}

	for(unsigned int i = 0; i < n; i++)

		a[i] = i;

    

	////////////////////////////////////////////////////////////////

	// run as many CPU threads as there are CUDA devices

	//   each CPU thread controls a different device, processing its

	//   portion of the data.  It's possible to use more CPU threads

	//   than there are CUDA devices, in which case several CPU

	//   threads will be allocating resources and launching kernels

	//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

	//   Recall that all variables declared inside an "omp parallel" scope are

	//   local to each CPU thread

	//

	omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

	//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

	{		

		unsigned int cpu_thread_id = omp_get_thread_num();

		unsigned int num_cpu_threads = omp_get_num_threads();

		

		// set and check the CUDA device for this CPU thread

		int gpu_id = -1;

		CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

		CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

		printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

		int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

		int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

		unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

		dim3 gpu_threads(128);	// 128 threads per block

		dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

		CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

		kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b);

		CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

		CUDA_SAFE_CALL(cudaFree(d_a));

	}

	printf("---------------------------\n");

	

	if(cudaSuccess != cudaGetLastError())

		printf("%s\n", cudaGetErrorString(cudaGetLastError()));

	////////////////////////////////////////////////////////////////

	// check the result

	//

	if(correctResult(a, n, b))

		printf("Test PASSED\n");

	else

		printf("Test FAILED\n");

	free(a);	// free CPU memory

}

cudaThreadExit();

cutilExit(argc, argv);

return 0;

}

[/codebox]

Thank you for the assistance!

jph4599 · July 28, 2009, 9:24pm

It appears that as long as I do the cudaSetDevice once, later iterations will continue to be assigned to the same device. See the colored regions below:

[codebox]/*

Multi-GPU sample using OpenMP for threading on the CPU side
needs a compiler that supports OpenMP 2.0

*/

#include <omp.h>

#include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

#include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int b)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int b)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + b)

		return 0;

return 1;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

    cudaDeviceProp dprop;

    cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

for( int j = 0; j < 20; j++ )

{

	/////////////////////////////////////////////////////////////////

	// initialize data

	//

	unsigned int n = num_gpus * 8192;

	unsigned int nbytes = n * sizeof(int);

	int *a = 0;		// pointer to data on the CPU

	int b = 3;		// value by which the array is incremented

	a = (int*)malloc(nbytes);

	if(0 == a)

	{

		printf("couldn't allocate CPU memory\n");

		return 1;

	}

	for(unsigned int i = 0; i < n; i++)

		a[i] = i;

    

	////////////////////////////////////////////////////////////////

	// run as many CPU threads as there are CUDA devices

	//   each CPU thread controls a different device, processing its

	//   portion of the data.  It's possible to use more CPU threads

	//   than there are CUDA devices, in which case several CPU

	//   threads will be allocating resources and launching kernels

	//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

	//   Recall that all variables declared inside an "omp parallel" scope are

	//   local to each CPU thread

	//

	omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

	//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

	{		

		unsigned int cpu_thread_id = omp_get_thread_num();

		unsigned int num_cpu_threads = omp_get_num_threads();

		

		// set and check the CUDA device for this CPU thread

		int gpu_id = -1;

		if( j == 0 )

		{

			CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

		}

		CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

		printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

		int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

		int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

		unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

		dim3 gpu_threads(128);	// 128 threads per block

		dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

		CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

		kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b);

		CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

		CUDA_SAFE_CALL(cudaFree(d_a));

	}

	printf("---------------------------\n");

	

	if(cudaSuccess != cudaGetLastError())

		printf("%s\n", cudaGetErrorString(cudaGetLastError()));

	////////////////////////////////////////////////////////////////

	// check the result

	//

	if(correctResult(a, n, b))

		printf("Test PASSED\n");

	else

		printf("Test FAILED\n");

	free(a);	// free CPU memory

}

cudaThreadExit();

cutilExit(argc, argv);

return 0;

}

[/codebox]

Results in:

[codebox]number of host CPUs: 2

number of CUDA devices: 2

0: GeForce GTX 295

1: GeForce GTX 295

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

Press ENTER to exit…[/codebox]

Is it safe to assume this behavior is reliable?

If this is the case, I can just

create a function called createContext() which will set the devices
Remove the setDevice from my function
call my function as many times as I want

As in:

[codebox]/*

Multi-GPU sample using OpenMP for threading on the CPU side
needs a compiler that supports OpenMP 2.0

*/

#include <omp.h>

#include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

#include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int b)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int b)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + b)

		return 0;

return 1;

}

int createContext()

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

	cudaDeviceProp dprop;

	cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

#pragma omp parallel

{		

	unsigned int cpu_thread_id = omp_get_thread_num();

	CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

}

return 0;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

createContext();

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

for( int j = 0; j < 20; j++ )

{

	/////////////////////////////////////////////////////////////////

	// initialize data

	//

	unsigned int n = num_gpus * 8192;

	unsigned int nbytes = n * sizeof(int);

	int *a = 0;		// pointer to data on the CPU

	int b = 3;		// value by which the array is incremented

	a = (int*)malloc(nbytes);

	if(0 == a)

	{

		printf("couldn't allocate CPU memory\n");

		return 1;

	}

	for(unsigned int i = 0; i < n; i++)

		a[i] = i;

    

	////////////////////////////////////////////////////////////////

	// run as many CPU threads as there are CUDA devices

	//   each CPU thread controls a different device, processing its

	//   portion of the data.  It's possible to use more CPU threads

	//   than there are CUDA devices, in which case several CPU

	//   threads will be allocating resources and launching kernels

	//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

	//   Recall that all variables declared inside an "omp parallel" scope are

	//   local to each CPU thread

	//

	omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

	//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

	{		

		unsigned int cpu_thread_id = omp_get_thread_num();

		unsigned int num_cpu_threads = omp_get_num_threads();

		

		// set and check the CUDA device for this CPU thread

		int gpu_id = -1;

		// CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

		CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

		printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

		int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

		int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

		unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

		dim3 gpu_threads(128);	// 128 threads per block

		dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

		CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

		CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

		kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b);

		CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

		CUDA_SAFE_CALL(cudaFree(d_a));

	}

	printf("---------------------------\n");

	

	if(cudaSuccess != cudaGetLastError())

		printf("%s\n", cudaGetErrorString(cudaGetLastError()));

	////////////////////////////////////////////////////////////////

	// check the result

	//

	if(correctResult(a, n, b))

		printf("Test PASSED\n");

	else

		printf("Test FAILED\n");

	free(a);	// free CPU memory

	

	

}

cudaThreadExit();

cutilExit(argc, argv);

return 0;

}

[/codebox]

Which has the same result as the earlier code:

[codebox]number of host CPUs: 2

number of CUDA devices: 2

0: GeForce GTX 295

1: GeForce GTX 295

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

CPU thread 0 (of 2) uses CUDA device 0

CPU thread 1 (of 2) uses CUDA device 1

Test PASSED[/codebox]

Any feedback would be appreciated!

paulius · July 28, 2009, 9:47pm

A context is established between a CPU thread and a GPU. Currently, context is established during the first CUDA call that modifies state (think, cudaMalloc, kernel launch, etc.). cudaSetDevice, sets the device for the calling CPU thread’s context, if a context is not yet created, but does not create a context. After a context is created, calling cudaSetDevice will not set a device and should return an error (as you observed). A context is active until the CPU thread calls cudaThreadExit or the thread terminates.

So, going back to your case, you need to call cudaSetDevice once per OpenMP thread, prior to context being created for that thread. Once a thread has a context, make CUDA calls as usual.

Paulius

Topic		Replies	Views
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14029	September 12, 2015
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	10909	September 30, 2010
OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel CUDA Programming and Performance	11	3603	June 29, 2011
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23225	October 7, 2010
MultiGPU start help CUDA Programming and Performance	8	10523	August 10, 2010
Cuda code performance CUDA Programming and Performance	14	3173	December 16, 2014
Multi stream multi GPU CUDA Programming and Performance cuda	9	1194	October 6, 2023
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2157	February 5, 2020
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134653	May 26, 2010
How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU? nvc, nvc++ and nvfortran	16	168	January 17, 2025

How to Avoid Re-Creating CUDA Context with Multi-GPU using OpenMP? Call a function that does omp par

Related topics