multi-gpu cublas

BlahCuda · February 8, 2010, 7:48pm

I would like to utilize all of my GPUs to concurrently conduct cublas routines. As an example, I can do something like this.

…

cublasAlloc(…)

for (i = 0; i < gpuCount; i++)
{
cuDeviceGet(&dev, i);
cublasSetVector(…)
cublasDgemm (…)
…
}

cublasFree(…)

Now, accordingly, it does seem like memory is getting allocated to all of the GPUs and cublasDgemm happens on different GPUs but all of this seems to be occurring sequentially and not in parallel. Is this due to the fact that I need multiple host threads? If so, what would be the easiest (and best) way to create these given that I’m working on Linux machines (pthreads?). Do I create these threads inside the for loop? Thanks.

avidday · February 8, 2010, 7:54pm

Each host thread can only hold one GPU context in the runtime API, so what you are trying to do won’t work. With linux, pthreads (or maybe something like boost threads if you are working in C++) is the easiest way to go. Context establishment takes time, so I wouldn’t recommend launching a thread per operation, but rather launch one persistent thread per GPU, and then use a condition variable to broadcast or unicast work to each thread.

BlahCuda · February 8, 2010, 8:48pm

So would something like this work?

void *gpuFunction (…)

{

 // all cublas related stuff (mem allocation, memcpy, cublas call, memory free)

}

int main()

{

…

pthread_t thread[gpuCount];

for (i = 0; i < gpuCount; i++)

{

 pthread_create(&thread[i], NULL, gpuFunction, (void*) i);

}

…

}

avidday · February 8, 2010, 9:01pm

If you are only ever doing the work inside gpuFunction once per gpu per application run, then yes, that will probably work. If you are planning on calling gpuFunction more than once, there are better ways than launching a new thread and doing all the context establishment every time (which will probably cost you anything up to a second per thread launch).

BlahCuda · February 8, 2010, 10:42pm

According to my pseudo code above, whenever I allocate memory to device, the thread hangs. Do you know why?

avidday · February 8, 2010, 10:45pm

Left my crystal ball in the office again, sorry. That suggests you aren’t doing something you should, like calling cublasInit for each thread. But seriously, some actual code would be helpful.

gonnet · February 8, 2010, 11:01pm

I’m using CUBLAS a lot with multiple GPUs, to do so i just launch one pthread (and therefore a context) per GPU, then i launch cublasInit as usual, which seems to create another context which is stacked on top of the default context. If you don’t play with contexts by yourself, it’s working perfectly. Otherwise all the usual rules still apply (for instance you have to call the allocation routines from the thread that holds the context), you cannot share memory directly between GPUs, so if you want to do a transfer from one GPU to another, you have to make a copy into main memory in between, and the have the two threads which control the two GPUs to cooperate.

CÃ©dric

avidday · February 8, 2010, 11:29pm

It is my understanding that you should call cublasInit() at the worker thread level if you aren’t already calling some other runtime API function like cudaInit first - it initializes the context and some internal CUBLAS stuff. If that doesn’t happen, subsequent CUBLAS functions will fail.

My per thread initialization looks something like this (you will have to use your imagination as to what the structures look like):

void gpuInitialise(struct gpuThread *g)

{

	char initmsg[maxstring];

	/*

	 * Check whether the device is compute prohibited,

	 * and skip it if it is

	 */

	if (g->deviceCompMode == CU_COMPUTEMODE_PROHIBITED) {

		g->deviceAvail = 0;

		return;

	}

	/*

	 * Check the compute capability and skip it if it

	 * didn't report 1.3

	 */

	if ( !((g->deviceCC[0] == 1) && (g->deviceCC[1] == 3)) ) {

		g->deviceAvail = 0;

		return;

	}

	/* Attempt to establish a runtime API context */

	if ( cudaSetDevice(g->deviceNumber) != cudaSuccess) {

		g->deviceAvail = 0;

		return;

	}

	/* Attempt to initialise CUBLAS */

	if ( cublasInit() != CUBLAS_STATUS_SUCCESS ) {

		g->deviceAvail = 0;

		return;

	}

	/*

	 * Update the GPU free memory and allocated as much

	 * memory as possible. Start with the maximum and

	 * work backwards until we find a number that works

	 * (Code due to V.Volkov).

	 */

	gpuAssert( cuMemGetInfo( &g->memReserved, &g->memTotal ) );

	while( cudaMalloc( (void**)&g->memPool, g->memReserved ) != cudaSuccess )

	{

		g->memReserved -= constMb;

		if( g->memReserved < constMb )

		{

			gpuAssert( cublasShutdown() );

			g->deviceAvail = 0;

			return;

		}

	}

	

	/* Reset Error states of both driver API and CUBLAS */

	(void)cudaGetLastError();

	(void)cublasGetError();

	g->deviceAvail = 1;

	sprintf(initmsg, "%d %s, Allocated %d Mb", 

			g->deviceNumber, g->deviceName, g->memReserved / constMb);

	gpuDiagMsg(stderr, initmsg, __FILE__, __LINE__);

}

BlahCuda · February 8, 2010, 11:42pm

Ok fixed the problem. What is the smallest matrix size in which you see an improvement by utilizing the 2nd GPU for cublasDgemm? In my results, for a 500x500 matrix, there is virtually no improvement in using two GPUs. For 1000x1000 matrix, I get a wall time of ~ 0.76 seconds when calling 2 cublasDgemm routines (using one GPU) vs ~ 0.60 seconds for two GPUs. I suspect that for small matrices (< N = 500), the data transfer time from GPU to CPU (and vice versa) dominates. Any way to improve upon this for small matrices?

avidday · February 8, 2010, 11:56pm

I don’t understand what it is you are doing. Are you trying to distribute a single matrix multiply over several GPUs? Or perform several separate multiplies at the same time?

I don’t know what CPU you have at your disposal, but for a single 500x500 dgemm(), I don’t bother with the GPU at all because it is slower than the host. If you want to do several 500x500 dgemm() at the same time, then multigpu might make sense, but barely. But if you have many to do, computing a [500x500]*[500x(n500)] dgemm() on one GPU will be much faster. You times seem rather long for such small matrices, what does the timing actually include? Context establishment, memory allocation, plus transfers? If so, it should be pretty obvious why persistent worker threads are a good idea.

BlahCuda · February 9, 2010, 12:09am

I’ll be doing several (over 500) matrix multiplications with sizes ranging from (small system) 500x500 to (large system) 2000x2000. Some of the multiplies would be A*B = C where (small) A = 500x2000, B = 2000x500 and (large) A = 2000x8000, B = 8000x2000. But as of now, I’m just playing around with single matrix multiplication so I get comfortable with the concept.

borjamf · May 27, 2013, 4:26pm

Hi all, I know that this is an old thread but I couldn’t find another more appropiate to my question.

I’m starting to work with MultiGPU and I’d like to make some tests for learning. My goal is to implement a basic MultiGPU solution with a cuBLAS function (sgemv) but I’m a bit lost. There’re not many examples and info.

My code would be this:

void *calGPU(void *i)
{
  int totalVectors...
  int lenVectors
  long int formatArray = totalVectors * lenVectors;
  float *gpu_vecArr, *gpu_Mat, *gpu_dotProd;
   
  .... 
  .... 
  // malloc ...

  cublasSetDevice(i);
  cublasInit();
  cublasAlloc(totalVectors * lenVectors, sizeof(float), void(**)&gpu_vecArr);
  cublasAlloc(lenVectors, sizeof(float), (void**)&gpu_Mat);
  cublasAlloc(totalVectors, sizeof(float), (void**)&gpu_dotProd);

   for ( vecI = 0; vecI < lenVectors; vecI++)
   {
    cublasSetVector(lenVectors, sizeof(float), &vecArr[vecI*len_tv], 1, gpu_Mat, 1);

    cublasSgemv(, totalVectors, lenVectors, , gpu_vecArr, totalVectors , gpu_Mat, , , g_dotProd,);

    cublasGetVector(lenVectors sizeof(float), gpu_dotProd, , , );
    ..
   }

  ...
...  
  cublasShutdown();
}

int main()
{
 ...
  pthread_t threads[gpuCount];
  for (i = 0; i < gpuCount; i++) 
   pthread_create(&threads[i], NULL, calGPU, (void*) i);

}

How could I alloc the memory for two GPU’s? Do I need to create a handle? Would it be possible to make it works? Which is the best way to distribute the vectors? Is it automatic? I think I’ve read somewhere that cuBLAS manage that.

Any help will be really appreciated.
Thanks.

Topic		Replies	Views
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	929	August 23, 2018
Does CUBLAS 4 RC-2 support using multiple contexts from a single host-thread? CUDA Programming and Performance	11	10619	August 19, 2011
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4205	January 14, 2010
cublasZgemmBatched low performance 2x2 matrices; how to increase performance? GPU-Accelerated Libraries	9	1292	February 20, 2015
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	4969	February 10, 2011
Finding suitable cuBLAS function and half-spaces swap algorithm strategy discussion GPU-Accelerated Libraries	5	709	October 12, 2021
cuBLAS kernels always run serially despite streams and AsyncMemCpy?!? CUDA Programming and Performance	17	5783	September 30, 2015
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	9957	January 20, 2010
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6034	December 8, 2008
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13930	September 5, 2008

multi-gpu cublas

Related topics