Does CUBLAS 4 RC-2 support using multiple contexts from a single host-thread?

I’d like to use the support in CUDA 4.0 for handling multiple contexts (devices) from a single host thread. As I understand it, using the driver API, I can create one context per device, and asynchronously launch kernels and do memory transfers by passing around the right context and stream handles, all within 1 host thread.

However, page 9 of the CUBLAS Library reference says

I can see how this is necessary with the runtime API, which doesn’t explicitly use a context handle. But I’m wondering if CUBLAS really has host-thread-affinity when using the driver API, or can I “cheat” this requirement by calling cuCtxSetCurrent in between CUBLAS calls, like I do for the driver API?

In other words, is it supported to write a single-threaded program resembling this pseudo-code?

c0 = cuCtxCreate(device0)

cuCtxSetCurrent(c0)

s0 = cuStreamCreate()

b0 = cublasCreate()

c1 = cuCtxCreate(device1)

cuCtxSetCurrent(c1)

s1 = cuStreamCreate()

b1 = cublasCreate()

... // upload data to devices

cuCtxSetCurrent(c0)

cublasSetStream(s0)

cublasSgemm(b0, ... )

cuCtxSetCurrent(c1)

cublasSetStream(s1)

cublasSgemm(b1, ... )

... // do other work

cuCtxSetCurrent(s0)

cuStreamSynchronize(s0)

cuCtxSetCurrent(s1)

cuStreamSynchronize(s1)

... // download data and cleanup

The documentation is a bit misphrased but you got the idea. The cublas context must stay associated with the same cuda context.

So the sequence that you propose looks fine. But you could simply also use cudaSetDevice routine.
The sequence would look like this :
cudaSetDevice(0)
s0 = cuStreamCreate()
b0 = cublasCreate()

cudaSetDevice(1)
s1 = cuStreamCreate()
b1 = cublasCreate()

… // upload data to devices

cudaSetDevice(0)
cublasSetStream(s0)
cublasSgemm(b0, … )

cudaSetDevice(1)
cublasSetStream(s1)
cublasSgemm(b1, … )

… // do other work

Excellent, thank you. It seemed like a case of the documentation just being a bit out of date, but I wanted to be sure. Its nice to see the APIs moving towards being completely stateless and thread-safe.

Hi

We write code like this but it not work. For example time of exec 1 context with 1 GPU = 10sec, time of exec 2 context with 2 GPU = 20sec. Henсe, switch context it dont work or calculation run from synchronic mode.

Why it dont work ?

I use CUDA 4.0.13 and tesla C2050

I can’t say without seeing some example code. Are you using non-null streams? Are you delaying all “Synchronize” calls until after you’ve launched all your kernels? Which specific API calls are taking this long?

Sorry I can take code only tomorrow. My code is pracicly identical to that. I try to reply for you questions. I create streams (I try cudaStreamCreate and cuStreamCreate) in cycle for 0 to 4. In 0 iteration use stream[0], 1 iteration - stream[1] and etc.

I take the time with MPI_Wtime(). Before MPI_Wtime() i call cudathreadSyncronize().

My problem is that - no matter how many devices I use The time to increase, as if I call cublasddot sequently
Why ?
I create in device context cublasHandle and stream, after this I set context and call cublasddot with Handle but time increase :(

P S Sory for my english

#include <mpi.h>
#include <cublas.h>
#include <cublas_v2.h>
#include <cuda.h>

#define kol 1000
#define kol_device 4

#define useMultiGPU
#ifdef useMultiGPU
#define cuSetDevice(kol_device)
CALL(cudaSetDevice(kol_device));
#else
#define cuSetDevice(kol_device)
#endif

#define iterations 1000

// Variables
Double *h_A, *h_B, *h_C[kol_device];

double* d_A[kol_device], *d_B[kol_device], *d_C[kol_device];

double seconds, sum, ddot=0.0;

int main(int argc, char** argv)
{
cudaStream_t streams = (cudaStream_t) malloc(kol_device * sizeof(cudaStream_t));
int N=kol;

//////////////////////////!!!/////////////////
long size = N * sizeof(double);
long size2 = sizeof(double);
//////////////////////////!!!/////////////////
int flag=0;
int n;
double start_mpi = 0,
end_mpi = 0,
sum_seconds = 0.0,
my_gpu_allsum = 0.0,
sum=0.0,
my_gpu_sum[kol_device];
long n_part, size_part;

cublasHandle_t hndl[kol_device];


n_part = N/kol_device+1;
size_part  = n_part * sizeof(double);
szdouble = sizeof(double);

// Allocate input vectors h_A and h_B in host memory
h_A = (double*)malloc(size);
h_B = (double*)malloc(size);

for (int number_device=0; number_device<kol_device; number_device++)
{
cuSetDevice(number_device);
cudaMallocHost(&h_C[number_device], size_part);

// Allocate vectors in device memory 
cudaMalloc(&d_A[number_device], size_part);
cudaMalloc(&d_B[number_device], size_part);
cudaMalloc(&d_C[number_device], size_part);

// Copy vectors from host memory to device memory
cudaMemcpy(d_A[number_device], h_A+number_device*n_part, size_part, cudaMemcpyHostToDevice) ;
cudaMemcpy(d_B[number_device], h_B+number_device*n_part, size_part, cudaMemcpyHostToDevice) ;
cudaStreamCreate(&(streams[number_device])) ;
cublasCreate(&hndl[number_device]);

}
MPI_Init(&argc, &argv);
for ( int III=0; III < iterations; III++)
{
n=n_part;
start_mpi = 0, end_mpi = 0;
// ******** Start timer ***************
cudaThreadSynchronize( ) ;
start_mpi = MPI_Wtime();
// ******** Eof Start timer ***********

	for (int number_device=0; number_device<kol_device; number_device++)
	{
		cuSetDevice(number_device);

		cublasSetStream(hndl[number_device], streams[number_device]) ;
		cublasDdot(hndl[number_device], n_part, d_A[number_device], 1, d_B[number_device],1, & ddot);
		sum+=ddot;
	} 

// ******** Stop timer *******************
cudaDeviceSynchronize();
end_mpi = MPI_Wtime();
sum_seconds+=end_mpi - start_mpi;
// ******** Eof Stop timer *************** 

}
//printf(“result ddot = %f\n”, sum);
for (int number_device=0; number_device<kol_device; number_device++)
{
cudaFree(d_A[number_device]);
cudaFree(d_B[number_device]);
cudaFree(d_C[number_device]);
cudaFreeHost(h_C[number_device]);
cudaStreamDestroy(streams[number_device]);
}

free(h_A);
free(h_B);

MPI_Finalize();
}

I haven’t run your code, but I believe that you’re not seeing a speedup because running ddot with n~1000 elements is already very fast, and the time cost is dominated by launching the kernel and downloading the result, rather than the computation itself. There is a fixed latency for each kernel launch and PCI bus transfer, and I expect that those costs are much larger than your kernel execution time, given how small your dataset is.

For example, consider that a Tesla card runs at hundreds of gigaflops for double-precision, ddot should be about 2 FLOP per element (1 multiply + 1 accumulate), so the actual kernel execution time for ddot with n=1000 should take less than 1us (10^-6 sec). But I believe the kernel launch latency and the PCI bus transfer latency are each ~ 10us, and those are serialized on the host thread.

Try increasing your problem to something bigger: maybe a DGEMM of a 1000 x 1000 matricies. I expect you’ll start to see device-parallel speedups there.

Good luck!

Big thanks I will try

A question related to this:

My code looks like this:

for(i=0; i<numDevices; i++)

{

    cudaSetDevice(i);

    cublasCreate(&handle[i]);

cublasDgemm(...)

cublasDestroy(handle[i],...);

}

When I use a single thread for a certain data size (matrix of order 6144) to manage multiple contexts, it takes around 240s, whereas when I use multiple threads, it takes 66s…why is this huge difference, when the kernel is getting launched asynchronously from the same thread? I checked the GPU utilization from the cudaprofiler, and I see that the utilization % of the GPUs are like {25, 50, 75, 99}, whereas for the multiple host threads and multiple context it is around {98, 100, 82, 73}…

You should put cublasDestroy in a different loop because there is some cudaFree/cudaThreadSynchonise in it.(Currently all your code is serialized )
And in general, it is also better to create all your handle once in another loop

Your code should be like this :
for(i=0; i<numDevices; i++)
{
cudaSetDevice(i);
cublasCreate(&handle[i]);
}

for(i=0; i<numDevices; i++)
{
cudaSetDevice(i);
cublasDgemm(…)
cublasDtrsm(…) …
}

for(i=0; i<numDevices; i++)
{
cudaSetDevice(i);
cublasDestroy(handle[i]);
}

Many thanks, this did it!