CUBLAS matrix-vector multiplication

anthonyfmorse · January 18, 2010, 10:15pm

I’m trying to optimise a simple matrix-vector multiplication… nothing fancy here, but I can’t quite work out CUBLAS. I found the following code in an old thread here but I’m not getting the setup of A B and C correct.

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <cublas.h>

#define hA 10

#define wA 10

#define hB 1

#define wB 10

int main(void)

{

	float *devPtrA , *devPtrB , *devPtrC;

	cublasInit(); // initilization of CUDA application

	cublasAlloc( hA*wA, sizeof(float), (void**) &devPtrA);

	cublasAlloc( wA*wB, sizeof(float), (void**) &devPtrB);

	cublasAlloc( hA*wB, sizeof(float), (void**) &devPtrC);

	// transfer host data to device

	cublasSetMatrix( hA, wA, sizeof(float), A, hA, devPtrA, hA);

	cublasSetMatrix( wA, wB, sizeof(float), B, wA, devPtrB, wA);

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasDgemm('N', 'N', hA, wB, wA, alpha, devPtrA, hA, 

		devPtrB, wA, beta, devPtrC, hA);

	cublasGetMatrix(hA, wB, sizeof(float), devPtrC, hA, C, hA);	

	cublasShutdown();

}

Can someone either point out how to set this up properly, or show me an example of simple matrix vector multiplication thats optimised well.

avidday · January 18, 2010, 10:24pm

DGEMM is the BLAS level 3 matrix-matrix product in double precision. You want SGEMV for the equivalent BLAS level 2 single precision matrix-vector product.

MMB · January 19, 2010, 3:17am

I’m trying to optimise a simple matrix-vector multiplication… nothing fancy here, but I can’t quite work out CUBLAS. I found the following code in an old thread here but I’m not getting the setup of A B and C correct.
#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <cublas.h>

#define hA 10

#define wA 10

#define hB 1

#define wB 10

int main(void)

{

	float *devPtrA , *devPtrB , *devPtrC;

	cublasInit(); // initilization of CUDA application

	cublasAlloc( hA*wA, sizeof(float), (void**) &devPtrA);

	cublasAlloc( wA*wB, sizeof(float), (void**) &devPtrB);

	cublasAlloc( hA*wB, sizeof(float), (void**) &devPtrC);

	// transfer host data to device

	cublasSetMatrix( hA, wA, sizeof(float), A, hA, devPtrA, hA);

	cublasSetMatrix( wA, wB, sizeof(float), B, wA, devPtrB, wA);

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasDgemm('N', 'N', hA, wB, wA, alpha, devPtrA, hA, 

		devPtrB, wA, beta, devPtrC, hA);

	cublasGetMatrix(hA, wB, sizeof(float), devPtrC, hA, C, hA);	

	cublasShutdown();

}
Can someone either point out how to set this up properly, or show me an example of simple matrix vector multiplication thats optimised well.

I suggest you look at the definintions of hb and wb. In addition, if you want to really do matrix-vector, why not use the sgemv routine?

MMB

anthonyfmorse · January 19, 2010, 12:49pm

Ok still stuck… I don’t quite get the CUBLAS documentation… what Do call cublasSegmv with???

current code looks like this…

#include <stdio.h>

#include <cublas.h>

#define M 10

#define N 10

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main(void)

{

	int i,j;

	float *devPtrA , *devPtrB , *devPtrC;

	float *a, *b; 

	a = (float *)malloc (M * N * sizeof (*a));

	b = (float *)malloc (N * sizeof (*b));

	for (j = 0; j < N; j++) {

			b[j] = j;

			for (i = 0; i < M; i++) {

					a[IDX2C(i,j,M)] = i * M + j + 1;

			}

	}

		

	for (j = 0; j < N; j++) {

			for (i = 0; i < M; i++) {

					printf ("%7.0f", a[IDX2C(i,j,M)]);

			}

			printf ("\n");

	}

	printf ("\n");

		

	cublasInit(); // initilization of CUDA application

	cublasAlloc( M*N, sizeof(float), (void**) &devPtrA);	//matrix A

	cublasAlloc( N, sizeof(float), (void**) &devPtrB);		//vector B

	cublasAlloc( N, sizeof(float), (void**) &devPtrC);		//vector C

	// transfer host data to device

	cublasSetMatrix( M, N, sizeof(float), a, N, devPtrA, N);

	cublasSetVector( N, sizeof(float), b, N, devPtrB, N);

	

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

//	cublasSgemv('N', M, N, alpha, devPtrA, alpha, devPtrB, M, devPtrC);

		

	cublasGetVector (N, sizeof(float), devPtrC, N, b, N);

	cublasShutdown(); 

}

avidday · January 19, 2010, 1:06pm

It is exactly the same call as a standard netlib BLAS sgemv (or page 67 of the CUBLAS pdf in the toolkit). You are missing the pitch of the matrix (lda) and the spacing of the two vectors (incx and incy):

void

cublasSgemv (char trans, int m, int n, float alpha,

			 const float *A, int lda, const float *x,

			 int incx, float beta, float *y, int incy)

anthonyfmorse · January 19, 2010, 1:27pm

Now cublasSegmv is not returning an error, but I’m still failing to understand something here as I’m getting garbage back.

Maybe I’m not getting the result properly??

#include <stdio.h>

#include <cublas.h>

#define M 10

#define N 10

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main(void)

{

	int i,j;

	float *devPtrA , *devPtrB , *devPtrC;

	float *a, *b; 

	a = (float *)malloc (M * N * sizeof (float));

	b = (float *)malloc (N * sizeof (float));

	for (j = 0; j < N; j++) {

			b[j] = j;

			for (i = 0; i < M; i++) {

					a[IDX2C(i,j,M)] = i * M + j + 1;

			}

	}

		

	for (j = 0; j < N; j++) {

			for (i = 0; i < M; i++) {

					printf ("%7.0f", a[IDX2C(i,j,M)]);

			}

			printf ("\n");

	}

	printf ("\n");

		

	cublasInit(); // initilization of CUDA application

	cublasAlloc( M*N, sizeof(float), (void**) &devPtrA);	//matrix A

	cublasAlloc( N, sizeof(float), (void**) &devPtrB);		//vector B

	cublasAlloc( N, sizeof(float), (void**) &devPtrC);		//vector C

	// transfer host data to device

	cublasSetMatrix( M, N, sizeof(float), a, N, devPtrA, N);

	cublasSetVector( N, sizeof(float), b, N, devPtrB, N);

	

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasSgemv('N', M, N, alpha, devPtrA, N, devPtrB, N, beta, devPtrC, N);

		

	cublasGetVector (N, sizeof(float), devPtrC, N, b, 1);

	

	for (j = 0; j < N; j++) printf ("%7.0f", b[j]);

	printf ("\n");

	cublasShutdown(); 

}

avidday · January 19, 2010, 1:30pm

incx and incy should both be 1 in this case, and lda should be M (ok this is square, so it shouldn’t matter, but in another less trivial case it might). Again, this is all in the documentation (or any reference or documentation on BLAS).

anthonyfmorse · January 19, 2010, 1:50pm

Thankyou for your patience with me here but this is still returning garbage…

cublasSgemv('N', M, N, alpha, devPtrA, M, devPtrB, 1, beta, devPtrC, 1);

		

	cublasGetVector (N, sizeof(float), devPtrC, N, b, 1);

	

	for (j = 0; j < N; j++) printf ("%7.0f", b[j]);

avidday · January 19, 2010, 1:58pm

So time to add some error checking. All those CUBLAS helper functions return values, and there is a function for polling the status from the BLAS calls themselves. They should all be asserted or otherwise checked. That at least will confirm whether the sgemv call actually runs or not.

avidday · January 19, 2010, 2:04pm

BLAH! double post…

anthonyfmorse · January 19, 2010, 2:31pm

Ok thanks for your help. I’ve implemented matrix-vector multiplication as a CUDA kernel and am using the trust lib for scan… I was wondering if CUBLAS would be much faster as it’s probably better optimised than my code but I don’t really want to commit much time to learning CUBLAS if I don’t know in advance that it’s going to be that much better.

anthonyfmorse · January 19, 2010, 4:59pm

Ok finally I have it working, though there is a problem if the matrix is not square…

It basically works twice as quick as my own cuda version. here is the code…

#include <stdio.h>

#include <cublas.h>

#define M 10

#define N 10

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main(void)

{

	cudaSetDevice(0);

	int i,j;

	float *devPtrA , *devPtrB , *devPtrC;

	float *a, *b; 

	

	//timer stuff

	cudaEvent_t start, stop;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	a = (float *)malloc (M * N * sizeof (float));

	b = (float *)malloc (N * sizeof (float));

	for (j = 0; j < N; j++) {

			b[j] = 1;

			for (i = 0; i < M; i++) {

					a[IDX2C(i,j,M)] = i * M + j + 1;

			}

	}

		

	for (j = 0; j < N; j++) {

			for (i = 0; i < M; i++) {

					printf ("%7.0f", a[IDX2C(i,j,M)]);

			}

			printf ("\n");

	}

	printf ("\n");

	for(i=0;i<N;i++) printf("%f ",b[i]);

	printf ("\n\n");

		

	cublasInit(); // initilization of CUDA application

	cublasAlloc( M*N, sizeof(float), (void**) &devPtrA);	//matrix A

	cublasAlloc( N, sizeof(float), (void**) &devPtrB);		//vector B

	cublasAlloc( N, sizeof(float), (void**) &devPtrC);		//vector C

	// transfer host data to device

	cublasSetMatrix( M, N, sizeof(float), a, N, devPtrA, N);

	cublasSetVector( N, sizeof(float), b, 1, devPtrB, 1);

	

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	

	//start timer 

	cudaEventRecord(start, 0);

	

	cublasSgemv('N', M, N, alpha, devPtrA, M, devPtrB, 1, beta, devPtrC, 1);

		

	// block until the device has completed

	cudaThreadSynchronize();

	//end timer

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	float elapsedTime;

	cudaEventElapsedTime(&elapsedTime, start, stop);		

		  

	cublasGetVector (N, sizeof(float), devPtrC, 1, b, 1);

	

	for (j = 0; j < N; j++) printf ("%7.0f", b[j]);//IDX2C(j,1,M)]);

	printf("\n\ntime = %f\n\n",elapsedTime);

	cublasShutdown(); 

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

}

anthonyfmorse · January 19, 2010, 7:55pm

Hmm there are still problems with the CUBLAS code in the previous post but it works for square matrix. However I made a mistake in comparing the timing, this code is for a 10 x 10 martix but I was comparing with my own, non-optimised CUDA code which was running a 100x100 matrix, and then performing a sigmoid function on the resulting vector. It turns out the CUBLAS implementation is actually slower than my own code, which can’t be right??? maybe I’m misunderstanding something…

I include my own version of matrix-vector multiplication (followed by sigmoid transformation as the basis of a neural network update).

Can someone verify my timing (adjust M & N to 100 in the CUBLAS version in the previous post)

Also verify the overall program execution time?

// incrementArray.cu

#include <stdio.h>

#include <assert.h>

#include <cuda.h>

//-----------------------------------------------------------------------------------

void NN_OnHost(float *activity, float *weights, int N)

{

  int i,j;

  float new_activity[N];

	for (i=0;i<N;i++) { 

		new_activity[i] = 0;

		for(j=0;j<N;j++) {

			new_activity[i] += activity[j] * weights[(j*N)+i];

		}

	}

	for (i=0; i < N; i++) {

		activity[i] = 1/(1+exp(-new_activity[i]));

	}

}

//-----------------------------------------------------------------------------------

__global__ void NN_OnDevice(float *activity, float *weights, float *new_activity, int N)

{

  int j, idx = threadIdx.x;

  new_activity[idx] = 0;

  for(j=0;j<N;j++) {

	new_activity[idx] += activity[j] * weights[(j*N)+idx];

  }

  __syncthreads();

  activity[idx] = 1/(1+exp(-new_activity[idx]));

}

//-----------------------------------------------------------------------------------

int main(void)

{

  cudaSetDevice(0);

  float *activity_h, *weights_h, *new_activity_h;		   // pointers to host memory

  float *activity_d, *weights_d, *new_activity_d;		   // pointer to device memory

  int i, j, N = 100;

  size_t size = N*sizeof(float);

	  //timer stuff

	cudaEvent_t start, stop;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

// allocate arrays on host

  activity_h = (float *)malloc(size);

  new_activity_h = (float *)malloc(size);

  weights_h = (float *)malloc(size*size);

// allocate array on device 

  cudaMalloc((void **) &activity_d, size);

  cudaMalloc((void **) &new_activity_d, size);

  cudaMalloc((void **) &weights_d, size*size);

// initialization of host data

  for (i=0; i<N; i++) {

	activity_h[i] = (float(rand() % 100) / 100);

	for(j=0; j<N; j++) {

		weights_h[(j*N)+i] = (float(rand() % 200) / 100) - 1;

		//printf("%f ",weights_h[(j*N)+i]);

	}

	//printf("%f ",activity_h[i]);

  }

  //printf("\n");

// copy data from host to device

  cudaMemcpy(activity_d, activity_h, sizeof(float)*N, cudaMemcpyHostToDevice);

  cudaMemcpy(weights_d, weights_h, sizeof(float)*N*N, cudaMemcpyHostToDevice);

	

  // do calculation on host

  NN_OnHost(activity_h, weights_h, N);

for (i=0; i<10; i++) printf("%f ",activity_h[i]);

  printf("\n");

	//start timer 

	cudaEventRecord(start, 0);

// do calculation on device:	

  NN_OnDevice <<< 1, N >>> (activity_d, weights_d, new_activity_d, N);

	// block until the device has completed

	cudaThreadSynchronize();

	//end timer

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	float elapsedTime;

	cudaEventElapsedTime(&elapsedTime, start, stop);

	

  // Retrieve result from device and store in b_h

  cudaMemcpy(new_activity_h, activity_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

for (i=0; i<10; i++) printf("%f ",new_activity_h[i]);

  printf("\n");

  printf("time = %f\n\n",elapsedTime);

// cleanup

  free(activity_h); free(weights_h); free(new_activity_h); 

  cudaFree(activity_d); cudaFree(weights_d); cudaFree(new_activity_d);

  cudaEventDestroy(start);

  cudaEventDestroy(stop);

}

I also have a partially optimised version using shared memory but the speed increase is not significant… yet…

avidday · January 19, 2010, 9:10pm

CUBLAS is optimized for large matrices. A 100x100 * 100x1 gemv call only contains 20000 FLOPs, which is an absolutely trivial amount, even for the least powerful CUDA capable GPUs. You would never bother to do it on the device - an optimised host BLAS will be considerably faster at those sizes on a modern CPU (the whole problem dataset would fit in a typical L1 cache). Add in the PCI-e overheads and it makes no sense to use the GPU.

Is 100x100 really your applications typical gemv() call size?

anthonyfmorse · January 20, 2010, 2:53pm

I have several applications, and so far have only written test code but this would implement the step of a 100 neurone neural network. In practice I’ll be using many such sized networks and so they will run in parallel (probably one per block). I’ll also be running several visual feature extraction algorithms that are more expensive, the outputs of which will feed into the NN’s (the vector part). This will either be from other cards, the CPU, or maybe the same card if I use a ‘fat kernel’. I think I can keep PCI traffic down to just the robot’s sensory data at each time step which would be 2 small ish camera images, 3 microphone .wav files and a bunch of touch & angle values.

Running this size of NN (i.e. 100 ish) I’ll stick with my own kernel, but there is the possibility of developing much larger NN’s in which case I’ll look at CUBLAS again.

Topic		Replies	Views
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3489	December 26, 2008
Slow CUDA SGEMM CUDA Programming and Performance	5	753	September 15, 2022
cublasSgemv slower than expected GPU-Accelerated Libraries	7	1082	December 22, 2020
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5744	November 25, 2009
Matlab mex file using cublas - problems CUDA Programming and Performance	13	9096	October 13, 2009
cublas matrix-vector problem CUDA Programming and Performance	1	3097	May 15, 2009
Cublas, cublasSgemv Matrix vector operation size Limitation CUDA Programming and Performance	2	10402	August 14, 2008
Matrix Vector multiply CUBLAS function CUDA Programming and Performance	4	1674	March 5, 2010
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	5125	February 10, 2011
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10149	March 24, 2014

CUBLAS matrix-vector multiplication

Related topics