Questions for multiple GPUs

Hello all,

I tired to use two GPUs at once.

But I think my code does not work with two GPUs.

I use GTX295.

I have a few questions…

  1. Can a CPU thread use multiple GPUs?

Ex)

for ( int i = 0; i < nGPU; i++ ) {

	cudaSetDevice(i);

	doKernel();

}
  1. Can page-locked memory allocated by calling cudaMallocHost() be shared by multiple GPUs?

  2. How can I check if my application use multiple GPUs or not?

Thanks.

You should take a look at SDK simpleMultiGPU program.

I think that you can use cudaMallocHost() (pay attention at memory copy address and size of copy data)

You can check your return data in host memory to know how many GPU runned.

:)

IIRC only one device can access pinned memory. At least that’s how I think it was up until recently, I’m not sure if CUDA 2.2 with its zero-copy etc. changes this.

I don’t think you can use multiple devices in the way you described.

Hi,

try this

[codebox]

omp_set_num_threads(num_used_gpu);

#pragma omp parallel

{

#pragma omp sections

{

{

    unsigned int cpu_thread_id = omp_get_thread_num();

	unsigned int num_cpu_threads = omp_get_num_threads();

	int gpu_id = -1;

	cudaSetDevice(cpu_thread_id);

	cudaGetDevice(&gpu_id);

	printf("M CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

	

float *SD,*SH;

cudaMallocHost((void **)&SH, NpoinT*sizeof(float));

cudaMalloc((void **)&SD, NpoinT*sizeof(float));

 #pragma omp barrier

t=cpu_time();

for(int tim=0;tim<10000;tim++){

cudaMemcpy(SH,SD, NpoinT*sizeof(float), cudaMemcpyDeviceToHost);

} // End cicle

cudaThreadSynchronize();

printf("\n---%lf---\n",cpu_time()-t);

cudaFree(SD);

cudaFreeHost(SH);

} // End first section

///////////////////////////////////////////////////////////////////////////////////////////////////////////

/////////////////////////////////////////////////////////////////////////////////////////////////////////

/////////////////////////////////////////////////////////////////////////////////////////////////////////

#pragma omp section

{

unsigned int cpu_thread_id = omp_get_thread_num();

	unsigned int num_cpu_threads = omp_get_num_threads();

	

	int gpu_id = -1;

	cudaSetDevice(cpu_thread_id);

	cudaGetDevice(&gpu_id);

	printf("S CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);

float *SD,*SH;

cudaMallocHost((void **)&SH, NpoinT*sizeof(float));

cudaMalloc((void **)&SD, NpoinT*sizeof(float));

 #pragma omp barrier

t=cpu_time();

for(int tim=0;tim<10000;tim++){

cudaMemcpy(SH,SD, NpoinT*sizeof(float), cudaMemcpyDeviceToHost);

} // End cicle

cudaThreadSynchronize();

printf("\n---%lf---\n",cpu_time()-t);

cudaFreeHost(SH);

cudaFree(SD);

} // End second section

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

} // End sections

} // End parallel

[/codebox]

In Intel compiler I used, it makes compile error.

As I know,using barrier enclosed by section does not allowed in openMP.

Am I wrong?

Do you mean that a pinned memory space used by device A should allocated with context of device A?

If I use two CPU threads and they allocate pinned memory for each one, can I use two pinned memory spaces asynchronously?

Thanks.

Take a look at MrAndersons GPUWorker - it might be what you are looking for (see http://forums.nvidia.com/index.php?showtopic=66598&st=0 ).

The page-locked memory allocated by cudaMallocHost can only be used by the GPU context (read cpu-thread) which has created it (My understanding is that the driver has a per-thread state of the DMA-transfers which it uses internally for GPU transfers). Thus a a GPU running in a thread can only make use of page-locked memory allocated in the same thread. You are of course still free to use normal CPU operations (memset, memcpy, etc) between the thread data.

Edit: As pointed out below, this isn’t quite true. It’s just async transfers which doesn’t work

Just a small clarification. Pre-CUDA 2.2, you can cudaMemcpy from memory allocated by cudaMallocHost in one context to a GPU in another context. The driver just won’t recognize that the memory is pinned and copy it through the slow path. It is at the speed of normal pageable, memcpys, but, it does work. You will only get errors if you try to use the cudaMemcpyAsync calls.

CUDA 2.2 adds an optional setting you can set before cudaSetDevice() that enables full pinned memory performance to memory allocated in one context to any other GPU, thus removing the performance hit and allowing for *Async() calls.

barrier must appear in all sections the same number of times

also you must put in the special compiler options for OpenMP

for instance:

nvcc -Xopencc -mp,-pthread -Xcompiler -fopenmp -Xlinker -lgomp my_prog.cu -O2 -o my_prog --compiler-bindir /share/gnu/gcc-4.3.3/bin