I’m using OpenMP on an XP 64 system in Visual Studio 2008 Professional to run 4 Geforce GTX 295s on an Quad Core Intel i7 CPU 920 @ 2.67 GHz. I am doing image processing with both the GPUs and CPUs where each CPU will have a dedicated thread to a dedicated GPU (each thread gets a independent set of images). I’ve been able to process the image data in parallel with no problem with the GPUs, but when I need to use all 4 CPUs for processing of the image data, only a single CPU processor is doing the work and the program can’t keep up with my requirements.
I’ve narrowed down the issue to the cudaMalloc. I ran a test (see code below) where I use memcpy to copy one array to the next for several thousand iterations. Each CPU gets it’s own array and when I run the program, all CPUs run and nearly the same rate and eventually all cap out at 100% usage. I then call cudaMalloc, run the same memcpy routine and now only the 1st processor is at 100%…the rest are around 0%. I also used SetThreadAffinityMask to choose which CPU to run the threads on. If I choose the thead affinity such that all threads use the 3rd CPU only the 3rd processor usage goes to 100% with the rest near zero (similar results with other combinations, such as thread affinity for 2nd and 4th processors → both 2nd and 4th processor usage goes to 100% and the rest 0%). After calling cudaMalloc CPU usage goes back to 100% for the 1st processor and ~0% for the rest(same result with different combinations of SetThreadAffinityMask). Any ideas or suggestions would be greatly appreciated!
[codebox] //XDIM = 640;
//YDIM = 480;
int *data_dev[4];
static char array_1[4][XDIM][YDIM];
static char array_2[4][XDIM][YDIM];
int num_copies = 100000;
int uiImage;
int uiCamera;
#pragma omp parallel num_threads(4) private(uiCamera, uiImage)
{
uiCamera = omp_get_thread_num();
//set thread affinity so each thread is dedicated to a CPU
DWORD_PTR mask = 15; //15 binary 1111 (each CPU get a thread)
SetThreadAffinityMask( GetCurrentThread(),mask);
printf("Num Threads = %d\n", uiCamera);
unsigned int num_cpu_threads = omp_get_num_threads();
//unsigned int uiCamera = -1;
int gpu_id = -1;
CUDA_SAFE_CALL(cudaSetDevice(uiCamera));
CUDA_SAFE_CALL(cudaGetDevice(&gpu_id));
printf("CPU thread %d (of %d) uses CUDA device %d\n", uiCamera, num_cpu_threads, gpu_id);
//memcopy one array to another
for(uiImage= 0; uiImage < num_copies; uiImage ++)
memcpy( &array_1[uiCamera][0][0], &array_2[uiCamera][0][0], XDIM * YDIM * sizeof( char ) );
printf("Cuda Malloc\n");
cudaMalloc((void**)&data_dev[uiCamera], XDIM*YDIM);
//memcopy one array to another after cudaMalloc
for(uiImage= 0; uiImage < num_copies; uiImage ++)
memcpy( &array_1[uiCamera][0][0], &array_2[uiCamera][0][0], XDIM * YDIM * sizeof( char ) );
}[/codebox]