How to check work is done by different GPU in multi GPU environment

I have a system with 2 GTX 285.
And I write a code with 2 host thread(use .net thread in cpp and call cuda code in .cu file)
I use Runtime API to code it.
So I use cudaSetDevice on both two host thread before calling the same cuda kernel.
But how to understand its really work on two diffrent CPU?

  1. The amount of work accomplished by 2 GPUs will hopefully be twice as fast.

  2. Try to run one thread at a time, once against device(0) and once against device(1) and see its working in both cases

  3. Add logs to the gpu code that will write the time/device id/… each to a different log.

  4. Run the visual profiler or some other monitioring tools (the temprature will go high on working GPUS… )

  5. Review the code - there is realy little place for doubts :)


You pass ordinal of the device to cudaSetDevice(). If you pass 0 in first thread and 1 in second thread then your program will use both devices. THe easiest way to make sure that both GPUs are working is to monitor their temperature – it should increase as you start executing kernels on device.

I tried but 2 host threads for 2 GPU is not faster then only one. I confused me very much.

Becase my cpp code is very long, I try to describe how I use it and post my cu code below.

1.under$VS C++ CLR envrionment.

2.I have a cpp code for my entrance main sub, and a cu code to implement my cuda function.

3.I have a SizeXSizeYBufferCount Souce Data and hope to use 2 .net host thread to controll 2 GPU, and share the


4.I add the odd jobs to queue of .net host thread 0 and add the even jobs to queue of .net host thread 1

5.I passed ThreadIndex to aach call of each thread and in the cuda host code, I use cudaSetDevice(ThreadIndex) to

select the GPU device. sync end of all jobs on thread 0 and thread 1


#include <cuda_runtime.h>

#define IMUL(a,b) __mul24(a,b)

#define BLOCK_SIZEX 128

#define BLOCK_SIZEY 1

const unsigned int BufferCount=10;

const unsigned int SizeX=8192;

const unsigned int SizeY=5000;

//CUDA Data

unsigned char *ptrcuS_h;

unsigned char *ptrcuD_h;

float elapsedTime;

void CUDAPreProc(int ThreadIndex,int SliceIndex){


	cudaEvent_t start, stop;

	int DevNum;

	unsigned char *ptrD_d;

	cudaArray *cuArrayuchr;

	cudaChannelFormatDesc chDescUChr = cudaCreateChannelDesc<unsigned char>();

	dim3 dimBlockNorm(BLOCK_SIZEX, BLOCK_SIZEY);

	dim3 dimGridNorm(SizeX/dimBlockNorm.x,SizeY/dimBlockNorm.y);



	cudaEventRecord(start, 0);

	cudaMallocArray(&cuArrayuchr,&chDescUChr,SizeX, SizeY);

	cudaMalloc((void**) &ptrD_d, sizeof(unsigned char) * SizeX * SizeY);

	//cudaMemcpyToArray(cuArrayuchr,0,0,(ptrcuS_h+SliceIndex*SizeX*SizeY),sizeof(unsigned char) * SizeX * 


	cudaBindTextureToArray(texUChrConvSource, cuArrayuchr);



	//cudaMemcpy(ptrcuD_h+SliceIndex*SizeX*SizeY, ptrD_d, sizeof(unsigned char) * SizeX * SizeY, 



	printf ("Job Number %i is fininshed, Total CUDA Calculation Time = %f,Th Number %



	printf ("Worker GPU Number is %i\n.",DevNum);

	//cuda free



	cudaEventRecord(stop, 0);


	cudaEventElapsedTime(&elapsedTime, start, stop);




int CUDAINIT(unsigned char **ptrptrcuS_h,unsigned char **ptrptrcuD_h){

	//Allocate cuda pinned memory

	cudaHostAlloc( (void**) ptrptrcuS_h, sizeof(unsigned char) * SizeX*SizeY *BufferCount, cudaHostAllocPortable 


	cudaHostAlloc( (void**) ptrptrcuD_h, sizeof(unsigned char) * SizeX*SizeY *BufferCount, cudaHostAllocPortable 




	return 0;


Your code is not correct, you’re not measuring time correctly in the first place and your kernel probably doesnt

do anything in anycase. Please see my notes in bold in the code box.

I suggest you look at the MultiGPU sample in the SDK, but first maybe go over simpler samples in the SDK

to better understand how to make sure you’re kernel actually ran.

hope that helps…


in the example

it use cutStartThread to create threads

but if I have to create the threads in cpp file but not cu file. then I can only use .net thread or c++ native thread.

Is threre any restriction for multi gpu use on thread contorl?

That has nothing to do with why the code is not working.

You simply dont sync after the kernel call and therefore the CPU host code continues right away (its not realted

to cutStartThread or .net threads). You must sync after your kernel call - either by explicitly calling SyncThreads

or by doing a cudaMemCopy which will implicitly call SyncThreads for you.

Untill you do it, your kernel will run in 0ms, won’t do anything and either you’ll get garbage results or your GPU will get stuck.


Thanks for your explain.

And in my code, two host thread will call the same sub in cu file but set different gpu device to be used.

Is it ok or the sub should be diffrerent for simutaneously usage?

Your kernel should be the same for any GPU - the code is the same (much like it is the same for all MPs, blocks and threads you use).

The only difference is indeed what you’ve said - the call to cudaSetDevice prior to running the kernel (or using any other CUDA code for that matter)