CUDA & openMP Problem with the SDK sample code

Hi to everyone,

I was trying to test CUDA with the OpenMP, on the SDK sample cudaOpenMP. After running the sample code (either directly through the SDK browser or after building it with the Microsoft Developer Studio 2008), I get the following results :

number of host CPUs: 8

number of CUDA devices: 2

0: GeForce GTX 295

1: GeForce GTX 295


CPU thread 0 (of 1) uses CUDA device 0


Test PASSED

Press ENTER to exit…

So my question is why do I take these results? I mean that since I have the GTX295, which is considered as 2 GPUs, I am expecting two CPU threads. To be precise if someone takes a look into the source code (see below) the number of threads for the OpenMP parallel region is set to the number of GPUs.

However no matter what value the number of GPUs takes, there is only one CPU thread. During debugging the code gives 2 as a result for num_gpus, but the result for num_cpu_threads is 1 ! Even if I change the value of omp_set_num_threads to, lets say 8, I get exactly the same results.

Has anyone faced the same problem ? Any ideas about it ?

Any help is appreciated because I am really stuck with this issue.

Also if anyone else has a GTX295 card could he/she post the results of the cudaOpenMP SDK sample test?

Thanks in advance.

[codebox]

/*

  • Copyright 1993-2009 NVIDIA Corporation. All rights reserved.

  • NVIDIA Corporation and its licensors retain all intellectual property and

  • proprietary rights in and to this software and related documentation and

  • any modifications thereto. Any use, reproduction, disclosure, or distribution

  • of this software and related documentation without an express license

  • agreement from NVIDIA Corporation is strictly prohibited.

*/

/*

  • Multi-GPU sample using OpenMP for threading on the CPU side

  • needs a compiler that supports OpenMP 2.0

*/

#include <omp.h>

#include <stdio.h> // stdio functions are used since C++ streams aren’t necessarily thread safe

#include

#include <cutil_inline.h>

using namespace std;

// a simple kernel that simply increments each array element by b

global void kernelAddConstant(int *g_a, const int B)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

g_a[idx] += b;

}

// a predicate that checks whether each array elemen is set to its index plus b

int correctResult(int *data, const int n, const int B)

{

for(int i = 0; i < n; i++)

	if(data[i] != i + B)

		return 0;

return 1;

}

int main(int argc, char *argv)

{

int num_gpus = 0;	// number of CUDA GPUs

/////////////////////////////////////////////////////////////////

// determine the number of CUDA capable GPUs

//

cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1)

{

	printf("no CUDA capable devices were detected\n");

	return 1;

}

/////////////////////////////////////////////////////////////////

// display CPU and GPU configuration

//

printf("number of host CPUs:\t%d\n", omp_get_num_procs());

printf("number of CUDA devices:\t%d\n", num_gpus);

for(int i = 0; i < num_gpus; i++)

{

	cudaDeviceProp dprop;

	cudaGetDeviceProperties(&dprop, i);

	printf("   %d: %s\n", i, dprop.name);

}

printf("---------------------------\n");

/////////////////////////////////////////////////////////////////

// initialize data

//

unsigned int n = num_gpus * 8192;

unsigned int nbytes = n * sizeof(int);

int *a = 0;		// pointer to data on the CPU

int b = 3;		// value by which the array is incremented

a = (int*)malloc(nbytes);

if(0 == a)

{

	printf("couldn't allocate CPU memory\n");

	return 1;

}

for(unsigned int i = 0; i < n; i++)

	a[i] = i;



////////////////////////////////////////////////////////////////

// run as many CPU threads as there are CUDA devices

//   each CPU thread controls a different device, processing its

//   portion of the data.  It's possible to use more CPU threads

//   than there are CUDA devices, in which case several CPU

//   threads will be allocating resources and launching kernels

//   on the same device.  For example, try omp_set_num_threads(2*num_gpus);

//   Recall that all variables declared inside an "omp parallel" scope are

//   local to each CPU thread

//

omp_set_num_threads(num_gpus);	// create as many CPU threads as there are CUDA devices

//omp_set_num_threads(2*num_gpus);// create twice as many CPU threads as there are CUDA devices

#pragma omp parallel

{	

	

		

	unsigned int cpu_thread_id = omp_get_thread_num();

	unsigned int num_cpu_threads = omp_get_num_threads();

	// set and check the CUDA device for this CPU thread

	int gpu_id = -1;

	CUDA_SAFE_CALL(cudaSetDevice(cpu_thread_id % num_gpus));	// "% num_gpus" allows more CPU threads than GPU devices

	CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

	printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

	int *d_a = 0;	// pointer to memory on the device associated with this CPU thread

	int *sub_a = a + cpu_thread_id * n / num_cpu_threads;	// pointer to this CPU thread's portion of data

	unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;

	dim3 gpu_threads(128);	// 128 threads per block

	dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_a, nbytes_per_kernel));

	CUDA_SAFE_CALL(cudaMemset(d_a, 0, nbytes_per_kernel));

	CUDA_SAFE_CALL(cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice));

	kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, B);

	CUDA_SAFE_CALL(cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost));

	CUDA_SAFE_CALL(cudaFree(d_a));

}

printf("---------------------------\n");



if(cudaSuccess != cudaGetLastError())

	printf("%s\n", cudaGetErrorString(cudaGetLastError()));

////////////////////////////////////////////////////////////////

// check the result

//

if(correctResult(a, n, B))

	printf("Test PASSED\n");

else

	printf("Test FAILED\n");

free(a);	// free CPU memory

cudaThreadExit();

cutilExit(argc, argv);

return 0;

} ';[/codebox]

Well I think I have found why this happened. I had to add the /openmp as extra C++ options in the Hybrid Cuda/C++ options (which can be found by right clicking on the cudaOpenMP.cu source file in the Solution Explorer for the Microsoft Developer Studio 2008). Now I get the two CPU threads I expected :

number of host CPUs: 8
number of CUDA devices: 2
0: GeForce GTX 295
1: GeForce GTX 295

CPU thread 0 (of 2) uses CUDA device 0
CPU thread 1 (of 2) uses CUDA device 1

Test PASSED

However I was expecting that the cudaOpenMP project would be ready to use, without any need for any additional arguments/options to be typed by the user. Also the executable that is found in the SDK (folder in …\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Release)
was not built using the /openmp command as described above since, when I run it, I get the one CPU thread as mentioned in the first post.

Can anyone else who has multiple GPUs confirm the results for the CudaOpenMP?

Hello,

I’m digging this thread because i had the same problem. It’s still here in 3.0 beta 1. Maybe it’s meant to avoid errors if openmp is not accessible for any reasons in VS. I don’t really know…

Check http://msdn.microsoft.com/en-us/library/hs24szh9.aspx

Wrong thread and I suspect a spam attempt but I’ll bite anyway.

  1. In the code you posted, there should be no real penalty for this initial if(idx < lenght). If it’s evaluated to a warp-common result there will be, as you know, no branching penalty. And this will be the case with all warps except one (if length is not aligned to warp-size; if it is, no divergence at all). Even this single warp won’t loose anything - the threads of which idx > length will be simply masked out and they will hang around for the ride while the remaining ones do computation. Such conditions are pretty much free - a binary divergence in a single warp out of thousands is a negligible cost.

  2. Your code is actually wrong.

idx = max( idx, 0);

idx = min( idx, length);

So, if we launch 2000 threads and length was only 1000, there will be 1000 threads trying to write to output[1000]. That’s a race condition. Not to mention output[1000] is out of bounds and it should be min( idx, length-1) if anything.

  1. Branching is not actually that time consuming anyway. Most kernels (definitely then one from your example) are memory bound, which means the limiting factor in performance is memory access bandwidth or latency. Those kernels won’t mind if they have 10 or 15 additional instructions they need to chew through due to serialized divergent branching while they wait for memory. This is true for the CPU as well, maybe more so since they don’t have the latency hiding capabilities GPU have. So don’t worry that much and, most importantly, don’t do premature optimization. Only hack your conditionals if you’ve profiled and are absolutely sure they are the bottleneck. And if you actually know what you’re doing…

Not only is it spam/astroturfing (there was an earlier post from the same author I called out as well), but it links to an article by an old friend who turned “development by forum questions” into an artform…

I have a GTX 295 and a Tesla and also got the same result. i.e. only one CPU thread, but all you have to do is enable OpenMP in your compiler options and then it should work fine…

why does it always outputs

0: GeForce GTX 295
1: GeForce GTX 295

in that order [0, 1] and never [1,0] ?

Dear all. I am a newbie to multi-GPU programming and I am trying to understand what is going on in the cudaOpenMP sample code. I just don’t understand in the case of using 2 GPUs, using this sample code, how each device (GPU) knows which portion of data it is assigned to and also I don’t see how the device kernel is invoked twice for the 2 GPUs. Additionally, how are the data coming out of the 2 GPUs combined at the end? Do I need to modify the code to make these happen? If so, would anyone happen to have a sample code that they could share to show all the steps? Thanking everyone in advance.

Dear all. I am a newbie to multi-GPU programming and I am trying to understand what is going on in the cudaOpenMP sample code. I just don’t understand in the case of using 2 GPUs, using this sample code, how each device (GPU) knows which portion of data it is assigned to and also I don’t see how the device kernel is invoked twice for the 2 GPUs. Additionally, how are the data coming out of the 2 GPUs combined at the end? Do I need to modify the code to make these happen? If so, would anyone happen to have a sample code that they could share to show all the steps? Thanking everyone in advance.

Dear all. I am a newbie to multi-GPU programming and I am trying to understand what is going on in the cudaOpenMP sample code. I just don’t understand in the case of using 2 GPUs, using this sample code, how each device (GPU) knows which portion of data it is assigned to and also I don’t see how the device kernel is invoked twice for the 2 GPUs. Additionally, how are the data coming out of the 2 GPUs combined at the end? Do I need to modify the code to make these happen? If so, would anyone happen to have a sample code that they could share to show all the steps? Thanking everyone in advance.

hello, I work on the branch of parallel computing, and I have a problem as you mentioned and you proposed a solution, I want you to help me to solve it by explaining with a simple tutorial.