Questions for multiple GPUs

BlueKDS · April 15, 2009, 8:34am

Hello all,

I tired to use two GPUs at once.

But I think my code does not work with two GPUs.

I use GTX295.

I have a few questions…

Can a CPU thread use multiple GPUs?

Ex)

for ( int i = 0; i < nGPU; i++ ) {

	cudaSetDevice(i);

	doKernel();

}

Can page-locked memory allocated by calling cudaMallocHost() be shared by multiple GPUs?
How can I check if my application use multiple GPUs or not?

Thanks.

Quoc_Vinh · April 15, 2009, 10:01am

Hello all,

I tired to use two GPUs at once.

But I think my code does not work with two GPUs.

I use GTX295.

I have a few questions…

Can a CPU thread use multiple GPUs?

Ex)
for ( int i = 0; i < nGPU; i++ ) {

	cudaSetDevice(i);

	doKernel();

}
Can page-locked memory allocated by calling cudaMallocHost() be shared by multiple GPUs?

How can I check if my application use multiple GPUs or not?

Thanks.

You should take a look at SDK simpleMultiGPU program.

I think that you can use cudaMallocHost() (pay attention at memory copy address and size of copy data)

You can check your return data in host memory to know how many GPU runned.

:)

_Big_Mac · April 15, 2009, 10:33am

IIRC only one device can access pinned memory. At least that’s how I think it was up until recently, I’m not sure if CUDA 2.2 with its zero-copy etc. changes this.

I don’t think you can use multiple devices in the way you described.

Seva · April 15, 2009, 10:42am

Hi,

try this

[codebox]

omp_set_num_threads(num_used_gpu);

#pragma omp parallel

{

#pragma omp sections

{

    unsigned int cpu_thread_id = omp_get_thread_num();

	unsigned int num_cpu_threads = omp_get_num_threads();

	int gpu_id = -1;

	cudaSetDevice(cpu_thread_id);

	cudaGetDevice(&gpu_id);

	printf("M CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

	

float *SD,*SH;

cudaMallocHost((void **)&SH, NpoinT*sizeof(float));

cudaMalloc((void **)&SD, NpoinT*sizeof(float));

 #pragma omp barrier

t=cpu_time();

for(int tim=0;tim<10000;tim++){

cudaMemcpy(SH,SD, NpoinT*sizeof(float), cudaMemcpyDeviceToHost);

} // End cicle

cudaThreadSynchronize();

printf("\n---%lf---\n",cpu_time()-t);

cudaFree(SD);

cudaFreeHost(SH);

} // End first section

///////////////////////////////////////////////////////////////////////////////////////////////////////////

/////////////////////////////////////////////////////////////////////////////////////////////////////////

#pragma omp section

{

unsigned int cpu_thread_id = omp_get_thread_num();

	unsigned int num_cpu_threads = omp_get_num_threads();

	

	int gpu_id = -1;

	cudaSetDevice(cpu_thread_id);

	cudaGetDevice(&gpu_id);

	printf("S CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

float *SD,*SH;

cudaMallocHost((void **)&SH, NpoinT*sizeof(float));

cudaMalloc((void **)&SD, NpoinT*sizeof(float));

 #pragma omp barrier

t=cpu_time();

for(int tim=0;tim<10000;tim++){

cudaMemcpy(SH,SD, NpoinT*sizeof(float), cudaMemcpyDeviceToHost);

} // End cicle

cudaThreadSynchronize();

printf("\n---%lf---\n",cpu_time()-t);

cudaFreeHost(SH);

cudaFree(SD);

} // End second section

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

} // End sections

} // End parallel

[/codebox]

BlueKDS · April 15, 2009, 12:02pm

In Intel compiler I used, it makes compile error.

As I know,using barrier enclosed by section does not allowed in openMP.

Am I wrong?

BlueKDS · April 15, 2009, 12:08pm

Do you mean that a pinned memory space used by device A should allocated with context of device A?

If I use two CPU threads and they allocate pinned memory for each one, can I use two pinned memory spaces asynchronously?

Thanks.

Fugl · April 15, 2009, 12:23pm

Take a look at MrAndersons GPUWorker - it might be what you are looking for (see [url=“http://forums.nvidia.com/index.php?showtopic=66598&st=0”]http://forums.nvidia.com/index.php?showtopic=66598&st=0[/url] ).

The page-locked memory allocated by cudaMallocHost can only be used by the GPU context (read cpu-thread) which has created it (My understanding is that the driver has a per-thread state of the DMA-transfers which it uses internally for GPU transfers). Thus a a GPU running in a thread can only make use of page-locked memory allocated in the same thread. You are of course still free to use normal CPU operations (memset, memcpy, etc) between the thread data.

Edit: As pointed out below, this isn’t quite true. It’s just async transfers which doesn’t work

MisterAnderson42 · April 15, 2009, 12:36pm

Just a small clarification. Pre-CUDA 2.2, you can cudaMemcpy from memory allocated by cudaMallocHost in one context to a GPU in another context. The driver just won’t recognize that the memory is pinned and copy it through the slow path. It is at the speed of normal pageable, memcpys, but, it does work. You will only get errors if you try to use the cudaMemcpyAsync calls.

CUDA 2.2 adds an optional setting you can set before cudaSetDevice() that enables full pinned memory performance to memory allocated in one context to any other GPU, thus removing the performance hit and allowing for *Async() calls.

Seva · April 20, 2009, 9:06am

barrier must appear in all sections the same number of times

also you must put in the special compiler options for OpenMP

for instance:

nvcc -Xopencc -mp,-pthread -Xcompiler -fopenmp -Xlinker -lgomp my_prog.cu -O2 -o my_prog --compiler-bindir /share/gnu/gcc-4.3.3/bin