How to check work is done by different GPU in multi GPU environment

darot · June 16, 2009, 6:55am

I have a system with 2 GTX 285.
And I write a code with 2 host thread(use .net thread in cpp and call cuda code in .cu file)
I use Runtime API to code it.
So I use cudaSetDevice on both two host thread before calling the same cuda kernel.
But how to understand its really work on two diffrent CPU?

eyalhir74 · June 16, 2009, 8:06am

The amount of work accomplished by 2 GPUs will hopefully be twice as fast.
Try to run one thread at a time, once against device(0) and once against device(1) and see its working in both cases
Add logs to the gpu code that will write the time/device id/… each to a different log.
Run the visual profiler or some other monitioring tools (the temprature will go high on working GPUS… )
Review the code - there is realy little place for doubts :)

eyal

AndreiB · June 16, 2009, 8:07am

You pass ordinal of the device to cudaSetDevice(). If you pass 0 in first thread and 1 in second thread then your program will use both devices. THe easiest way to make sure that both GPUs are working is to monitor their temperature – it should increase as you start executing kernels on device.

darot · June 17, 2009, 12:04am

I tried but 2 host threads for 2 GPU is not faster then only one. I confused me very much.

Becase my cpp code is very long, I try to describe how I use it and post my cu code below.

1.under$VS 2005.net C++ CLR envrionment.

2.I have a cpp code for my entrance main sub, and a cu code to implement my cuda function.

3.I have a SizeXSizeYBufferCount Souce Data and hope to use 2 .net host thread to controll 2 GPU, and share the

jobs.

4.I add the odd jobs to queue of .net host thread 0 and add the even jobs to queue of .net host thread 1

5.I passed ThreadIndex to aach call of each thread and in the cuda host code, I use cudaSetDevice(ThreadIndex) to

select the GPU device.

6.to sync end of all jobs on thread 0 and thread 1

#include<stdio.h>

#include <cuda_runtime.h>

#define IMUL(a,b) __mul24(a,b)

#define BLOCK_SIZEX 128

#define BLOCK_SIZEY 1

const unsigned int BufferCount=10;

const unsigned int SizeX=8192;

const unsigned int SizeY=5000;

//CUDA Data

unsigned char *ptrcuS_h;

unsigned char *ptrcuD_h;

float elapsedTime;

void CUDAPreProc(int ThreadIndex,int SliceIndex){

	cudaSetDevice(ThreadIndex);

	cudaEvent_t start, stop;

	int DevNum;

	unsigned char *ptrD_d;

	cudaArray *cuArrayuchr;

	cudaChannelFormatDesc chDescUChr = cudaCreateChannelDesc<unsigned char>();

	dim3 dimBlockNorm(BLOCK_SIZEX, BLOCK_SIZEY);

	dim3 dimGridNorm(SizeX/dimBlockNorm.x,SizeY/dimBlockNorm.y);

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	cudaEventRecord(start, 0);

	cudaMallocArray(&cuArrayuchr,&chDescUChr,SizeX, SizeY);

	cudaMalloc((void**) &ptrD_d, sizeof(unsigned char) * SizeX * SizeY);

	//cudaMemcpyToArray(cuArrayuchr,0,0,(ptrcuS_h+SliceIndex*SizeX*SizeY),sizeof(unsigned char) * SizeX * 

SizeY,cudaMemcpyHostToDevice);

	cudaBindTextureToArray(texUChrConvSource, cuArrayuchr);

	Binarize<<<dimGridNorm,dimBlockNorm>>>(ptrD_d,SizeX,SizeY,100);

	cudaUnbindTexture(texUChrConvSource);

	//cudaMemcpy(ptrcuD_h+SliceIndex*SizeX*SizeY, ptrD_d, sizeof(unsigned char) * SizeX * SizeY, 

cudaMemcpyDeviceToHost);

	//cudaThreadSynchronize();

	printf ("Job Number %i is fininshed, Total CUDA Calculation Time = %f,Th Number %

i.\n",SliceIndex,elapsedTime,ThreadIndex);

	cudaGetDevice(&DevNum);

	printf ("Worker GPU Number is %i\n.",DevNum);

	//cuda free

	cudaFreeArray(cuArrayuchr);

	cudaFree(ptrD_d);

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&elapsedTime, start, stop);

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

}

int CUDAINIT(unsigned char **ptrptrcuS_h,unsigned char **ptrptrcuD_h){

	//Allocate cuda pinned memory

	cudaHostAlloc( (void**) ptrptrcuS_h, sizeof(unsigned char) * SizeX*SizeY *BufferCount, cudaHostAllocPortable 

);

	cudaHostAlloc( (void**) ptrptrcuD_h, sizeof(unsigned char) * SizeX*SizeY *BufferCount, cudaHostAllocPortable 

);

	ptrcuS_h=*ptrptrcuS_h;

	ptrcuD_h=*ptrptrcuD_h;

	return 0;

}

eyalhir74 · June 17, 2009, 7:09am

#include<stdio.h>

#include <cuda_runtime.h>

#define IMUL(a,b) __mul24(a,b)

#define BLOCK_SIZEX 128

#define BLOCK_SIZEY 1

const unsigned int BufferCount=10;

const unsigned int SizeX=8192;

const unsigned int SizeY=5000;

//CUDA Data

unsigned char *ptrcuS_h;

unsigned char *ptrcuD_h;

float elapsedTime;

void CUDAPreProc(int ThreadIndex,int SliceIndex){

	cudaSetDevice(ThreadIndex);

	cudaEvent_t start, stop;

	int DevNum;

	unsigned char *ptrD_d;

	cudaArray *cuArrayuchr;

	cudaChannelFormatDesc chDescUChr = cudaCreateChannelDesc<unsigned char>();

	dim3 dimBlockNorm(BLOCK_SIZEX, BLOCK_SIZEY);

	dim3 dimGridNorm(SizeX/dimBlockNorm.x,SizeY/dimBlockNorm.y);

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	cudaEventRecord(start, 0);

	cudaMallocArray(&cuArrayuchr,&chDescUChr,SizeX, SizeY);

	cudaMalloc((void**) &ptrD_d, sizeof(unsigned char) * SizeX * SizeY);

	//cudaMemcpyToArray(cuArrayuchr,0,0,(ptrcuS_h+SliceIndex*SizeX*SizeY),sizeof(unsigned char) * SizeX * 

SizeY,cudaMemcpyHostToDevice);

	cudaBindTextureToArray(texUChrConvSource, cuArrayuchr);

			   

	Binarize<<<dimGridNorm,dimBlockNorm>>>(ptrD_d,SizeX,SizeY,100);

[b]				//++ - Kerenel invocation is async - control imidiatly goes back to CPU and you probably also unbind the texture

				//++ which the kernel probably uses.[/b]

	cudaUnbindTexture(texUChrConvSource);

[b]				//++ You've removed the cudaMemcpy which implicitly calls cudaThreadSyncronize - so the CPU code continues right 

				//++  through again. You also removed the cudaThreadSyncronize call so nothing is actually done.[/b]

	//cudaMemcpy(ptrcuD_h+SliceIndex*SizeX*SizeY, ptrD_d, sizeof(unsigned char) * SizeX * SizeY, 

cudaMemcpyDeviceToHost);

	//cudaThreadSynchronize();

	printf ("Job Number %i is fininshed, Total CUDA Calculation Time = %f,Th Number %

i.\n",SliceIndex,elapsedTime,ThreadIndex);

	cudaGetDevice(&DevNum);

	printf ("Worker GPU Number is %i\n.",DevNum);

	//cuda free

	cudaFreeArray(cuArrayuchr);

	cudaFree(ptrD_d);

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&elapsedTime, start, stop);

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

}

int CUDAINIT(unsigned char **ptrptrcuS_h,unsigned char **ptrptrcuD_h){

	//Allocate cuda pinned memory

	cudaHostAlloc( (void**) ptrptrcuS_h, sizeof(unsigned char) * SizeX*SizeY *BufferCount, cudaHostAllocPortable 

);

	cudaHostAlloc( (void**) ptrptrcuD_h, sizeof(unsigned char) * SizeX*SizeY *BufferCount, cudaHostAllocPortable 

);

	ptrcuS_h=*ptrptrcuS_h;

	ptrcuD_h=*ptrptrcuD_h;

	return 0;

}

Your code is not correct, you’re not measuring time correctly in the first place and your kernel probably doesnt

do anything in anycase. Please see my notes in bold in the code box.

I suggest you look at the MultiGPU sample in the SDK, but first maybe go over simpler samples in the SDK

to better understand how to make sure you’re kernel actually ran.

hope that helps…

eyal

darot · June 18, 2009, 1:24am

in the example

it use cutStartThread to create threads

but if I have to create the threads in cpp file but not cu file. then I can only use .net thread or c++ native thread.

Is threre any restriction for multi gpu use on thread contorl?

eyalhir74 · June 18, 2009, 6:03am

That has nothing to do with why the code is not working.

You simply dont sync after the kernel call and therefore the CPU host code continues right away (its not realted

to cutStartThread or .net threads). You must sync after your kernel call - either by explicitly calling SyncThreads

or by doing a cudaMemCopy which will implicitly call SyncThreads for you.

Untill you do it, your kernel will run in 0ms, won’t do anything and either you’ll get garbage results or your GPU will get stuck.

eyal

darot · June 18, 2009, 7:45am

Thanks for your explain.

And in my code, two host thread will call the same sub in cu file but set different gpu device to be used.

Is it ok or the sub should be diffrerent for simutaneously usage?

eyalhir74 · June 18, 2009, 8:02am

Your kernel should be the same for any GPU - the code is the same (much like it is the same for all MPs, blocks and threads you use).

The only difference is indeed what you’ve said - the call to cudaSetDevice prior to running the kernel (or using any other CUDA code for that matter)

eyal

Topic		Replies	Views
multi-GPU parallel operation CUDA Programming and Performance	4	4032	May 1, 2008
Multiple GPU computing CUDA Programming and Performance	8	7885	May 7, 2008
Multi GPU not working as expected - please comment CUDA Programming and Performance	11	38422	December 2, 2023
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8635	December 18, 2008
Overhead of using more than one streams? CUDA Programming and Performance	5	6180	April 14, 2009
Multi-GPU - Some questions CUDA Programming and Performance	10	10743	January 21, 2010
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4206	May 13, 2010
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7744	December 5, 2008
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3033	November 13, 2007
Simple multiGPU - Why is it failed Example to understand how multiGPU work CUDA Programming and Performance	8	4350	March 6, 2008

How to check work is done by different GPU in multi GPU environment

Related topics