OpenMP CPU Thread Affinity loss when using cudaMalloc Problems with Thread Affinity after calling cu

I’m using OpenMP on an XP 64 system in Visual Studio 2008 Professional to run 4 Geforce GTX 295s on an Quad Core Intel i7 CPU 920 @ 2.67 GHz. I am doing image processing with both the GPUs and CPUs where each CPU will have a dedicated thread to a dedicated GPU (each thread gets a independent set of images). I’ve been able to process the image data in parallel with no problem with the GPUs, but when I need to use all 4 CPUs for processing of the image data, only a single CPU processor is doing the work and the program can’t keep up with my requirements.

I’ve narrowed down the issue to the cudaMalloc. I ran a test (see code below) where I use memcpy to copy one array to the next for several thousand iterations. Each CPU gets it’s own array and when I run the program, all CPUs run and nearly the same rate and eventually all cap out at 100% usage. I then call cudaMalloc, run the same memcpy routine and now only the 1st processor is at 100%…the rest are around 0%. I also used SetThreadAffinityMask to choose which CPU to run the threads on. If I choose the thead affinity such that all threads use the 3rd CPU only the 3rd processor usage goes to 100% with the rest near zero (similar results with other combinations, such as thread affinity for 2nd and 4th processors → both 2nd and 4th processor usage goes to 100% and the rest 0%). After calling cudaMalloc CPU usage goes back to 100% for the 1st processor and ~0% for the rest(same result with different combinations of SetThreadAffinityMask). Any ideas or suggestions would be greatly appreciated!

[codebox] //XDIM = 640;

//YDIM = 480;

int *data_dev[4];

static char array_1[4][XDIM][YDIM];

static char array_2[4][XDIM][YDIM];

int num_copies = 100000;

int uiImage;

int uiCamera;



#pragma omp parallel num_threads(4) private(uiCamera, uiImage)

{

    uiCamera  = omp_get_thread_num();

		

	//set thread affinity so each thread is dedicated to a CPU

	DWORD_PTR mask = 15; //15 binary 1111 (each CPU get a thread)

	SetThreadAffinityMask( GetCurrentThread(),mask);

	printf("Num Threads = %d\n", uiCamera);

	unsigned int num_cpu_threads = omp_get_num_threads();

	//unsigned int uiCamera = -1;

	int gpu_id = -1;

	CUDA_SAFE_CALL(cudaSetDevice(uiCamera));	

	CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));

	printf("CPU thread %d (of %d) uses CUDA device %d\n", uiCamera, num_cpu_threads, gpu_id);

	

	//memcopy one array to another

	for(uiImage= 0; uiImage < num_copies; uiImage ++)

		memcpy( &array_1[uiCamera][0][0], &array_2[uiCamera][0][0], XDIM * YDIM * sizeof( char ) );

	printf("Cuda Malloc\n");

	cudaMalloc((void**)&data_dev[uiCamera], XDIM*YDIM);

	//memcopy one array to another after cudaMalloc

	for(uiImage= 0; uiImage < num_copies; uiImage ++)

		memcpy( &array_1[uiCamera][0][0], &array_2[uiCamera][0][0], XDIM * YDIM * sizeof( char ) );

}[/codebox]

Are you using the profiler?

Yes.

Profiler sets the affinity mask in order to handle timestamps correctly (or at least I seem to recall that’s the reason). If you don’t use the profiler we won’t touch your affinity mask.

That fixed it!!! Thanks!!!