Using CUBLAS with GTX295


I’m practicing with cublas, a simple matrix multiplication.

I succeeded on default device to do the thing.

And then, I wanted to use two GPUs. (GTX295 has two GPUs).

Before trying that I tried to use device 1, which is not a default device.

I just added “cudaSetDevice(1);”, in the first line of the code.

After adding the code, the program become strange.

Usually, Unknown error occurred, in the cublasSgemm function.

Rarely, cublas initialization error occurred.

The problem doesn’t occur, when I used matrix multiplication code I made or I used device 0.

Below is the code I used.

Any guess to fix this problem will be very helpful for me.

Thank you in advance.

#include <stdio.h>

#include <cuda.h>

#include <cublas_v2.h>

#define M 4096

#define N 4096

#define BLOCK_SIZE 16

__global__ void initMatrix(float *A, float *B)


	int t = M * (BLOCK_SIZE * blockIdx.x + threadIdx.x) + BLOCK_SIZE * blockIdx.y + threadIdx.y;

	A[t] = BLOCK_SIZE * blockIdx.y + threadIdx.y + 1;

	B[t] = (int)(BLOCK_SIZE * blockIdx.y + threadIdx.y) - (int)(BLOCK_SIZE * blockIdx.x + threadIdx.x);


int main()


	cudaError_t err;

	float* A;

	float* B;

	float* C;

	float* h_C;

	float a = 1;

	float b = 0;


	err = cudaSetDevice(1);

	cublasStatus_t cublasStatus;

	cublasHandle_t cublasHandle;

	cublasStatus = cublasCreate(&cublasHandle);


	cudaMalloc(&A, M * N * sizeof(float));

	cudaMalloc(&B, M * N * sizeof(float));

	cudaMalloc(&C, M * N * sizeof(float));

	h_C = (float*)malloc(M * N * sizeof(float));


	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

	dim3 dimGrid(M / BLOCK_SIZE, N / BLOCK_SIZE);

	initMatrix<<<dimGrid, dimBlock>>>(A, B);

	err = cudaDeviceSynchronize();

	if (err != cudaSuccess)


		printf("CUDA INITMATRIX error: %s\n", cudaGetErrorString(err));



	for (int i = 0; i < 100; i++)


		cublasStatus = cublasSgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, M, &a, A, M, B, M, &b, C, M);


		switch (cublasStatus)



			printf("%d: success\n", i);



			printf("stat: %d\n", cublasStatus);




		err = cudaDeviceSynchronize(); //Error occurred here.

		if (err != cudaSuccess)


			printf("CUDA synchronize error: %s\n", cudaGetErrorString(err));




	err = cudaMemcpy(h_C, C, M * M * sizeof(float), cudaMemcpyDeviceToHost);

	if (err != cudaSuccess)


		printf("CUDA Memcpy error: %s\n", cudaGetErrorString(err)); 








	return 0;


For your code to work on the non default GPU, you will need to establish a context of the GPU you selected with cudaSetDevice() before initializing CUBLAS. Something like this:

err = cudaSetDevice(1);

        cudaFree(0); // This will force a context to be created on device 1

cublasStatus_t cublasStatus;

        cublasHandle_t cublasHandle;

        cublasStatus = cublasCreate(&cublasHandle); // CUBLAS now uses the context on device 1

Any other action which establishes a context (or explicitly using the driver API context establishment calls) would work equally well.

Thank you for replying.

However, cudaFree(0) doesn’t work too. CUDA Toolkit Reference Manual says that If devPtr is 0, no operation is performed.

Also, I tried to explicitly create context.

Something like this.


err = cudaSetDevice(1);

CUdevice cuDevice;

cuDeviceGet(&cuDevice, 1);

CUcontext cuContext;

cuCtxCreate(&cuContext, 0, cuDevice);

cublasStatus_t cublasStatus;


This doesn’t work. Similar error as before.

If I changed cuDeviceGet(&cuDevice, 1); to cuDeviceGet(&cuDevice, 0), it works very well using device 0.

What happend to my device 1? Any other suggestion?