OpenMP CPU Thread Affinity loss when using cudaMalloc Problems with Thread Affinity after calling cu

molinero · June 26, 2009, 7:36pm

I’m using OpenMP on an XP 64 system in Visual Studio 2008 Professional to run 4 Geforce GTX 295s on an Quad Core Intel i7 CPU 920 @ 2.67 GHz. I am doing image processing with both the GPUs and CPUs where each CPU will have a dedicated thread to a dedicated GPU (each thread gets a independent set of images). I’ve been able to process the image data in parallel with no problem with the GPUs, but when I need to use all 4 CPUs for processing of the image data, only a single CPU processor is doing the work and the program can’t keep up with my requirements.

I’ve narrowed down the issue to the cudaMalloc. I ran a test (see code below) where I use memcpy to copy one array to the next for several thousand iterations. Each CPU gets it’s own array and when I run the program, all CPUs run and nearly the same rate and eventually all cap out at 100% usage. I then call cudaMalloc, run the same memcpy routine and now only the 1st processor is at 100%…the rest are around 0%. I also used SetThreadAffinityMask to choose which CPU to run the threads on. If I choose the thead affinity such that all threads use the 3rd CPU only the 3rd processor usage goes to 100% with the rest near zero (similar results with other combinations, such as thread affinity for 2nd and 4th processors → both 2nd and 4th processor usage goes to 100% and the rest 0%). After calling cudaMalloc CPU usage goes back to 100% for the 1st processor and ~0% for the rest(same result with different combinations of SetThreadAffinityMask). Any ideas or suggestions would be greatly appreciated!

[codebox] //XDIM = 640;

//YDIM = 480;

int *data_dev[4];

static char array_1[4][XDIM][YDIM];

static char array_2[4][XDIM][YDIM];

int num_copies = 100000;

int uiImage;

int uiCamera;



#pragma omp parallel num_threads(4) private(uiCamera, uiImage)

{

    uiCamera  = omp_get_thread_num();

		

	//set thread affinity so each thread is dedicated to a CPU

	DWORD_PTR mask = 15; //15 binary 1111 (each CPU get a thread)

	SetThreadAffinityMask( GetCurrentThread(),mask);

	printf("Num Threads = %d\n", uiCamera);

	unsigned int num_cpu_threads = omp_get_num_threads();

	//unsigned int uiCamera = -1;

	int gpu_id = -1;

	CUDA_SAFE_CALL(cudaSetDevice(uiCamera));	

	CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

	printf("CPU thread %d (of %d) uses CUDA device %d\n", uiCamera, num_cpu_threads, gpu_id);

	

	//memcopy one array to another

	for(uiImage= 0; uiImage < num_copies; uiImage ++)

		memcpy( &array_1[uiCamera][0][0], &array_2[uiCamera][0][0], XDIM * YDIM * sizeof( char ) );

	printf("Cuda Malloc\n");

	cudaMalloc((void**)&data_dev[uiCamera], XDIM*YDIM);

	//memcopy one array to another after cudaMalloc

	for(uiImage= 0; uiImage < num_copies; uiImage ++)

		memcpy( &array_1[uiCamera][0][0], &array_2[uiCamera][0][0], XDIM * YDIM * sizeof( char ) );

}[/codebox]

tmurray · June 26, 2009, 8:06pm

Are you using the profiler?

molinero · June 26, 2009, 8:17pm

Yes.

tmurray · June 26, 2009, 8:36pm

Profiler sets the affinity mask in order to handle timestamps correctly (or at least I seem to recall that’s the reason). If you don’t use the profiler we won’t touch your affinity mask.

molinero · June 27, 2009, 12:54am

That fixed it!!! Thanks!!!

Topic		Replies	Views
Host Thread Affinity using OpenMP Bind GPUs to particular thread CUDA Programming and Performance	0	6903	April 1, 2011
AffinityMask changed process reduced to single core CUDA Programming and Performance	2	3497	March 10, 2009
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14056	September 12, 2015
OpenMP Multi-GPU, not getting speedup expected CUDA Programming and Performance	5	5879	July 15, 2011
GPU Affinity Performance One Man's Battle to get Two Operating Systems to run Three Cards CUDA Programming and Performance	0	2699	February 5, 2010
Calling CUDA function disables OpenMP? Can they co-exist in the same application? CUDA Programming and Performance	2	4533	June 7, 2010
pthreads vs. OpenMP? CUDA Programming and Performance	4	4935	February 18, 2013
Each thread in each processor Legacy PGI Compilers	12	17978	October 13, 2006
OpenMP thread affinity Legacy PGI Compilers	12	16841	December 2, 2010
cudaOpenMP ? only one thread with 2 GPU's? CUDA Programming and Performance	1	11063	March 18, 2011

OpenMP CPU Thread Affinity loss when using cudaMalloc Problems with Thread Affinity after calling cu

Related topics