multi-gpu cublas

I would like to utilize all of my GPUs to concurrently conduct cublas routines. As an example, I can do something like this.

cublasAlloc(…)

for (i = 0; i < gpuCount; i++)
{
cuDeviceGet(&dev, i);
cublasSetVector(…)
cublasDgemm (…)

}

cublasFree(…)


Now, accordingly, it does seem like memory is getting allocated to all of the GPUs and cublasDgemm happens on different GPUs but all of this seems to be occurring sequentially and not in parallel. Is this due to the fact that I need multiple host threads? If so, what would be the easiest (and best) way to create these given that I’m working on Linux machines (pthreads?). Do I create these threads inside the for loop? Thanks.

Each host thread can only hold one GPU context in the runtime API, so what you are trying to do won’t work. With linux, pthreads (or maybe something like boost threads if you are working in C++) is the easiest way to go. Context establishment takes time, so I wouldn’t recommend launching a thread per operation, but rather launch one persistent thread per GPU, and then use a condition variable to broadcast or unicast work to each thread.

So would something like this work?

void *gpuFunction (…)

{

 // all cublas related stuff (mem allocation, memcpy, cublas call, memory free)

}

int main()

{

pthread_t thread[gpuCount];

for (i = 0; i < gpuCount; i++)

{

 pthread_create(&thread[i], NULL, gpuFunction, (void*) i);

}

}

If you are only ever doing the work inside gpuFunction once per gpu per application run, then yes, that will probably work. If you are planning on calling gpuFunction more than once, there are better ways than launching a new thread and doing all the context establishment every time (which will probably cost you anything up to a second per thread launch).

According to my pseudo code above, whenever I allocate memory to device, the thread hangs. Do you know why?

Left my crystal ball in the office again, sorry. That suggests you aren’t doing something you should, like calling cublasInit for each thread. But seriously, some actual code would be helpful.

I’m using CUBLAS a lot with multiple GPUs, to do so i just launch one pthread (and therefore a context) per GPU, then i launch cublasInit as usual, which seems to create another context which is stacked on top of the default context. If you don’t play with contexts by yourself, it’s working perfectly. Otherwise all the usual rules still apply (for instance you have to call the allocation routines from the thread that holds the context), you cannot share memory directly between GPUs, so if you want to do a transfer from one GPU to another, you have to make a copy into main memory in between, and the have the two threads which control the two GPUs to cooperate.

Cédric

It is my understanding that you should call cublasInit() at the worker thread level if you aren’t already calling some other runtime API function like cudaInit first - it initializes the context and some internal CUBLAS stuff. If that doesn’t happen, subsequent CUBLAS functions will fail.

My per thread initialization looks something like this (you will have to use your imagination as to what the structures look like):

void gpuInitialise(struct gpuThread *g)

{

	char initmsg[maxstring];

	/*

	 * Check whether the device is compute prohibited,

	 * and skip it if it is

	 */

	if (g->deviceCompMode == CU_COMPUTEMODE_PROHIBITED) {

		g->deviceAvail = 0;

		return;

	}

	/*

	 * Check the compute capability and skip it if it

	 * didn't report 1.3

	 */

	if ( !((g->deviceCC[0] == 1) && (g->deviceCC[1] == 3)) ) {

		g->deviceAvail = 0;

		return;

	}

	/* Attempt to establish a runtime API context */

	if ( cudaSetDevice(g->deviceNumber) != cudaSuccess) {

		g->deviceAvail = 0;

		return;

	}

	/* Attempt to initialise CUBLAS */

	if ( cublasInit() != CUBLAS_STATUS_SUCCESS ) {

		g->deviceAvail = 0;

		return;

	}

	/*

	 * Update the GPU free memory and allocated as much

	 * memory as possible. Start with the maximum and

	 * work backwards until we find a number that works

	 * (Code due to V.Volkov).

	 */

	gpuAssert( cuMemGetInfo( &g->memReserved, &g->memTotal ) );

	while( cudaMalloc( (void**)&g->memPool, g->memReserved ) != cudaSuccess )

	{

		g->memReserved -= constMb;

		if( g->memReserved < constMb )

		{

			gpuAssert( cublasShutdown() );

			g->deviceAvail = 0;

			return;

		}

	}

	

	/* Reset Error states of both driver API and CUBLAS */

	(void)cudaGetLastError();

	(void)cublasGetError();

	g->deviceAvail = 1;

	sprintf(initmsg, "%d %s, Allocated %d Mb", 

			g->deviceNumber, g->deviceName, g->memReserved / constMb);

	gpuDiagMsg(stderr, initmsg, __FILE__, __LINE__);

}

Ok fixed the problem. What is the smallest matrix size in which you see an improvement by utilizing the 2nd GPU for cublasDgemm? In my results, for a 500x500 matrix, there is virtually no improvement in using two GPUs. For 1000x1000 matrix, I get a wall time of ~ 0.76 seconds when calling 2 cublasDgemm routines (using one GPU) vs ~ 0.60 seconds for two GPUs. I suspect that for small matrices (< N = 500), the data transfer time from GPU to CPU (and vice versa) dominates. Any way to improve upon this for small matrices?

I don’t understand what it is you are doing. Are you trying to distribute a single matrix multiply over several GPUs? Or perform several separate multiplies at the same time?

I don’t know what CPU you have at your disposal, but for a single 500x500 dgemm(), I don’t bother with the GPU at all because it is slower than the host. If you want to do several 500x500 dgemm() at the same time, then multigpu might make sense, but barely. But if you have many to do, computing a [500x500]*[500x(n500)] dgemm() on one GPU will be much faster. You times seem rather long for such small matrices, what does the timing actually include? Context establishment, memory allocation, plus transfers? If so, it should be pretty obvious why persistent worker threads are a good idea.

I’ll be doing several (over 500) matrix multiplications with sizes ranging from (small system) 500x500 to (large system) 2000x2000. Some of the multiplies would be A*B = C where (small) A = 500x2000, B = 2000x500 and (large) A = 2000x8000, B = 8000x2000. But as of now, I’m just playing around with single matrix multiplication so I get comfortable with the concept.

Hi all, I know that this is an old thread but I couldn’t find another more appropiate to my question.

I’m starting to work with MultiGPU and I’d like to make some tests for learning. My goal is to implement a basic MultiGPU solution with a cuBLAS function (sgemv) but I’m a bit lost. There’re not many examples and info.

My code would be this:

void *calGPU(void *i)
{
  int totalVectors...
  int lenVectors
  long int formatArray = totalVectors * lenVectors;
  float *gpu_vecArr, *gpu_Mat, *gpu_dotProd;
   
  .... 
  .... 
  // malloc ...

  cublasSetDevice(i);
  cublasInit();
  cublasAlloc(totalVectors * lenVectors, sizeof(float), void(**)&gpu_vecArr);
  cublasAlloc(lenVectors, sizeof(float), (void**)&gpu_Mat);
  cublasAlloc(totalVectors, sizeof(float), (void**)&gpu_dotProd);

   for ( vecI = 0; vecI < lenVectors; vecI++)
   {
    cublasSetVector(lenVectors, sizeof(float), &vecArr[vecI*len_tv], 1, gpu_Mat, 1);

    cublasSgemv(, totalVectors, lenVectors, , gpu_vecArr, totalVectors , gpu_Mat, , , g_dotProd,);

    cublasGetVector(lenVectors sizeof(float), gpu_dotProd, , , );
    ..
   }

  ...
...  
  cublasShutdown();
}

int main()
{
 ...
  pthread_t threads[gpuCount];
  for (i = 0; i < gpuCount; i++) 
   pthread_create(&threads[i], NULL, calGPU, (void*) i);

}

How could I alloc the memory for two GPU’s? Do I need to create a handle? Would it be possible to make it works? Which is the best way to distribute the vectors? Is it automatic? I think I’ve read somewhere that cuBLAS manage that.

Any help will be really appreciated.
Thanks.