CUDA & openMP Problem with the SDK sample code

fivos_rg · August 12, 2009, 2:56pm

Hi to everyone,

I was trying to test CUDA with the OpenMP, on the SDK sample cudaOpenMP. After running the sample code (either directly through the SDK browser or after building it with the Microsoft Developer Studio 2008), I get the following results :

number of host CPUs: 8

number of CUDA devices: 2

0: GeForce GTX 295

1: GeForce GTX 295

CPU thread 0 (of 1) uses CUDA device 0

Test PASSED

Press ENTER to exit…

So my question is why do I take these results? I mean that since I have the GTX295, which is considered as 2 GPUs, I am expecting two CPU threads. To be precise if someone takes a look into the source code (see below) the number of threads for the OpenMP parallel region is set to the number of GPUs.

However no matter what value the number of GPUs takes, there is only one CPU thread. During debugging the code gives 2 as a result for num_gpus, but the result for num_cpu_threads is 1 ! Even if I change the value of omp_set_num_threads to, lets say 8, I get exactly the same results.

Has anyone faced the same problem ? Any ideas about it ?

Any help is appreciated because I am really stuck with this issue.

Also if anyone else has a GTX295 card could he/she post the results of the cudaOpenMP SDK sample test?

Thanks in advance.

[codebox]

/*

Copyright 1993-2009 NVIDIA Corporation. All rights reserved.
NVIDIA Corporation and its licensors retain all intellectual property and
proprietary rights in and to this software and related documentation and
any modifications thereto. Any use, reproduction, disclosure, or distribution
of this software and related documentation without an express license
agreement from NVIDIA Corporation is strictly prohibited.

*/

/*

Multi-GPU sample using OpenMP for threading on the CPU side
needs a compiler that supports OpenMP 2.0

*/

#include <omp.h>

#include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

#include

#include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int B)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int B)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + B)

		return 0;

return 1;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

	cudaDeviceProp dprop;

	cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

/////////////////////////////////////////////////////////////////

// initialize data

//

unsigned int n = num_gpus * 8192;

unsigned int nbytes = n * sizeof(int);

int *a = 0;		// pointer to data on the CPU

int b = 3;		// value by which the array is incremented

a = (int*)malloc(nbytes);

if(0 == a)

{

	printf("couldn't allocate CPU memory\n");

	return 1;

}

for(unsigned int i = 0; i < n; i++)

	a[i] = i;



////////////////////////////////////////////////////////////////

// run as many CPU threads as there are CUDA devices

//   each CPU thread controls a different device, processing its

//   portion of the data.  It's possible to use more CPU threads

//   than there are CUDA devices, in which case several CPU

//   threads will be allocating resources and launching kernels

//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

//   Recall that all variables declared inside an "omp parallel" scope are

//   local to each CPU thread

//

omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

{	

	

		

	unsigned int cpu_thread_id = omp_get_thread_num();

	unsigned int num_cpu_threads = omp_get_num_threads();

	// set and check the CUDA device for this CPU thread

	int gpu_id = -1;

	CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

	CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

	printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

	int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

	int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

	unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

	dim3 gpu_threads(128);	// 128 threads per block

	dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

	CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

	CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

	kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, B);

	CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

	CUDA_SAFE_CALL(cudaFree(d_a));

}

printf("---------------------------\n");



if(cudaSuccess != cudaGetLastError())

	printf("%s\n", cudaGetErrorString(cudaGetLastError()));

////////////////////////////////////////////////////////////////

// check the result

//

if(correctResult(a, n, B))

	printf("Test PASSED\n");

else

	printf("Test FAILED\n");

free(a);	// free CPU memory

cudaThreadExit();

cutilExit(argc, argv);

return 0;

} ';[/codebox]

fivos_rg · August 13, 2009, 8:35am

Well I think I have found why this happened. I had to add the /openmp as extra C++ options in the Hybrid Cuda/C++ options (which can be found by right clicking on the cudaOpenMP.cu source file in the Solution Explorer for the Microsoft Developer Studio 2008). Now I get the two CPU threads I expected :

number of host CPUs: 8
number of CUDA devices: 2
0: GeForce GTX 295
1: GeForce GTX 295

CPU thread 0 (of 2) uses CUDA device 0
CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

However I was expecting that the cudaOpenMP project would be ready to use, without any need for any additional arguments/options to be typed by the user. Also the executable that is found in the SDK (folder in …\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Release)
was not built using the /openmp command as described above since, when I run it, I get the one CPU thread as mentioned in the first post.

Can anyone else who has multiple GPUs confirm the results for the CudaOpenMP?

oreo1 · January 6, 2010, 12:17pm

Hello,

I’m digging this thread because i had the same problem. It’s still here in 3.0 beta 1. Maybe it’s meant to avoid errors if openmp is not accessible for any reasons in VS. I don’t really know…

_OskaR · January 6, 2010, 1:01pm

Check http://msdn.microsoft.com/en-us/library/hs24szh9.aspx

_Big_Mac · February 2, 2010, 11:54am

Wrong thread and I suspect a spam attempt but I’ll bite anyway.

In the code you posted, there should be no real penalty for this initial if(idx < lenght). If it’s evaluated to a warp-common result there will be, as you know, no branching penalty. And this will be the case with all warps except one (if length is not aligned to warp-size; if it is, no divergence at all). Even this single warp won’t loose anything - the threads of which idx > length will be simply masked out and they will hang around for the ride while the remaining ones do computation. Such conditions are pretty much free - a binary divergence in a single warp out of thousands is a negligible cost.
Your code is actually wrong.

idx = max( idx, 0);

idx = min( idx, length);

So, if we launch 2000 threads and length was only 1000, there will be 1000 threads trying to write to output[1000]. That’s a race condition. Not to mention output[1000] is out of bounds and it should be min( idx, length-1) if anything.

Branching is not actually that time consuming anyway. Most kernels (definitely then one from your example) are memory bound, which means the limiting factor in performance is memory access bandwidth or latency. Those kernels won’t mind if they have 10 or 15 additional instructions they need to chew through due to serialized divergent branching while they wait for memory. This is true for the CPU as well, maybe more so since they don’t have the latency hiding capabilities GPU have. So don’t worry that much and, most importantly, don’t do premature optimization. Only hack your conditionals if you’ve profiled and are absolutely sure they are the bottleneck. And if you actually know what you’re doing…

avidday · February 2, 2010, 12:06pm

Not only is it spam/astroturfing (there was an earlier post from the same author I called out as well), but it links to an article by an old friend who turned “development by forum questions” into an artform…

timmiroon · February 8, 2010, 3:27pm

Hi to everyone,

I was trying to test CUDA with the OpenMP, on the SDK sample cudaOpenMP. After running the sample code (either directly through the SDK browser or after building it with the Microsoft Developer Studio 2008), I get the following results :

number of host CPUs: 8

number of CUDA devices: 2

0: GeForce GTX 295

1: GeForce GTX 295

CPU thread 0 (of 1) uses CUDA device 0

Test PASSED

Press ENTER to exit…

So my question is why do I take these results? I mean that since I have the GTX295, which is considered as 2 GPUs, I am expecting two CPU threads. To be precise if someone takes a look into the source code (see below) the number of threads for the OpenMP parallel region is set to the number of GPUs.

However no matter what value the number of GPUs takes, there is only one CPU thread. During debugging the code gives 2 as a result for num_gpus, but the result for num_cpu_threads is 1 ! Even if I change the value of omp_set_num_threads to, lets say 8, I get exactly the same results.

Has anyone faced the same problem ? Any ideas about it ?

Any help is appreciated because I am really stuck with this issue.

Also if anyone else has a GTX295 card could he/she post the results of the cudaOpenMP SDK sample test?

Thanks in advance.

[codebox]

/*

Copyright 1993-2009 NVIDIA Corporation. All rights reserved.

NVIDIA Corporation and its licensors retain all intellectual property and

proprietary rights in and to this software and related documentation and

any modifications thereto. Any use, reproduction, disclosure, or distribution

of this software and related documentation without an express license

agreement from NVIDIA Corporation is strictly prohibited.

*/

/*

Multi-GPU sample using OpenMP for threading on the CPU side

needs a compiler that supports OpenMP 2.0

*/

include <omp.h>

include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

include

include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int B)

{
int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;
}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int B)

{
for(int i = 0; i < n; i++)

	if(data[i] != i + B)

		return 0;

return 1;
}

int main(int argc, char *argv)

{
int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

	cudaDeviceProp dprop;

	cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

/////////////////////////////////////////////////////////////////

// initialize data

//

unsigned int n = num_gpus * 8192;

unsigned int nbytes = n * sizeof(int);

int *a = 0;		// pointer to data on the CPU

int b = 3;		// value by which the array is incremented

a = (int*)malloc(nbytes);

if(0 == a)

{

	printf("couldn't allocate CPU memory\n");

	return 1;

}

for(unsigned int i = 0; i < n; i++)

	a[i] = i;



////////////////////////////////////////////////////////////////

// run as many CPU threads as there are CUDA devices

//   each CPU thread controls a different device, processing its

//   portion of the data.  It's possible to use more CPU threads

//   than there are CUDA devices, in which case several CPU

//   threads will be allocating resources and launching kernels

//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

//   Recall that all variables declared inside an "omp parallel" scope are

//   local to each CPU thread

//

omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices
#pragma omp parallel
{	

	

		

	unsigned int cpu_thread_id = omp_get_thread_num();

	unsigned int num_cpu_threads = omp_get_num_threads();

	// set and check the CUDA device for this CPU thread

	int gpu_id = -1;

	CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

	CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

	printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

	int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

	int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

	unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

	dim3 gpu_threads(128);	// 128 threads per block

	dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

	CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

	CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

	kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, B);

	CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

	CUDA_SAFE_CALL(cudaFree(d_a));

}

printf("---------------------------\n");



if(cudaSuccess != cudaGetLastError())

	printf("%s\n", cudaGetErrorString(cudaGetLastError()));

////////////////////////////////////////////////////////////////

// check the result

//

if(correctResult(a, n, B))

	printf("Test PASSED\n");

else

	printf("Test FAILED\n");

free(a);	// free CPU memory

cudaThreadExit();

cutilExit(argc, argv);

return 0;
} ';[/codebox]

I have a GTX 295 and a Tesla and also got the same result. i.e. only one CPU thread, but all you have to do is enable OpenMP in your compiler options and then it should work fine…

namspace · February 25, 2010, 11:07pm

why does it always outputs

0: GeForce GTX 295
1: GeForce GTX 295

in that order [0, 1] and never [1,0] ?

knowledgeless · May 30, 2010, 8:08pm

Dear all. I am a newbie to multi-GPU programming and I am trying to understand what is going on in the cudaOpenMP sample code. I just don’t understand in the case of using 2 GPUs, using this sample code, how each device (GPU) knows which portion of data it is assigned to and also I don’t see how the device kernel is invoked twice for the 2 GPUs. Additionally, how are the data coming out of the 2 GPUs combined at the end? Do I need to modify the code to make these happen? If so, would anyone happen to have a sample code that they could share to show all the steps? Thanking everyone in advance.

knowledgeless · May 30, 2010, 8:09pm

Dear all. I am a newbie to multi-GPU programming and I am trying to understand what is going on in the cudaOpenMP sample code. I just don’t understand in the case of using 2 GPUs, using this sample code, how each device (GPU) knows which portion of data it is assigned to and also I don’t see how the device kernel is invoked twice for the 2 GPUs. Additionally, how are the data coming out of the 2 GPUs combined at the end? Do I need to modify the code to make these happen? If so, would anyone happen to have a sample code that they could share to show all the steps? Thanking everyone in advance.

knowledgeless · May 30, 2010, 8:10pm

Dear all. I am a newbie to multi-GPU programming and I am trying to understand what is going on in the cudaOpenMP sample code. I just don’t understand in the case of using 2 GPUs, using this sample code, how each device (GPU) knows which portion of data it is assigned to and also I don’t see how the device kernel is invoked twice for the 2 GPUs. Additionally, how are the data coming out of the 2 GPUs combined at the end? Do I need to modify the code to make these happen? If so, would anyone happen to have a sample code that they could share to show all the steps? Thanking everyone in advance.

marwan565 · September 12, 2015, 10:40am

hello, I work on the branch of parallel computing, and I have a problem as you mentioned and you proposed a solution, I want you to help me to solve it by explaining with a simple tutorial.

Topic		Replies	Views
How to Avoid Re-Creating CUDA Context with Multi-GPU using OpenMP? Call a function that does omp par CUDA Programming and Performance	2	1825	July 28, 2009
An Easy Introduction to CUDA C and C++ Technical Blog	48	1232	July 19, 2018
OpenMP doesn't work in a templated function CUDA Programming and Performance	4	2253	September 14, 2009
How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU? nvc, nvc++ and nvfortran	16	121	January 17, 2025
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	10890	September 30, 2010
Cuda code performance CUDA Programming and Performance	14	3144	December 16, 2014
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10661	January 4, 2011
An Even Easier Introduction to CUDA Technical Blog	141	6331	November 28, 2023
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13500	May 29, 2021

CUDA & openMP Problem with the SDK sample code

number of host CPUs: 8 number of CUDA devices: 2 0: GeForce GTX 295 1: GeForce GTX 295

CPU thread 0 (of 2) uses CUDA device 0 CPU thread 1 (of 2) uses CUDA device 1

Related topics

number of host CPUs: 8
number of CUDA devices: 2
0: GeForce GTX 295
1: GeForce GTX 295

CPU thread 0 (of 2) uses CUDA device 0
CPU thread 1 (of 2) uses CUDA device 1