CUBLAS matrix-vector multiplication

I’m trying to optimise a simple matrix-vector multiplication… nothing fancy here, but I can’t quite work out CUBLAS. I found the following code in an old thread here but I’m not getting the setup of A B and C correct.

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <cublas.h>

#define hA 10

#define wA 10

#define hB 1

#define wB 10

int main(void)

{

	float *devPtrA , *devPtrB , *devPtrC;

	cublasInit(); // initilization of CUDA application

	cublasAlloc( hA*wA, sizeof(float), (void**) &devPtrA);

	cublasAlloc( wA*wB, sizeof(float), (void**) &devPtrB);

	cublasAlloc( hA*wB, sizeof(float), (void**) &devPtrC);

	// transfer host data to device

	cublasSetMatrix( hA, wA, sizeof(float), A, hA, devPtrA, hA);

	cublasSetMatrix( wA, wB, sizeof(float), B, wA, devPtrB, wA);

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasDgemm('N', 'N', hA, wB, wA, alpha, devPtrA, hA, 

		devPtrB, wA, beta, devPtrC, hA);

	cublasGetMatrix(hA, wB, sizeof(float), devPtrC, hA, C, hA);	

	cublasShutdown();

}

Can someone either point out how to set this up properly, or show me an example of simple matrix vector multiplication thats optimised well.

DGEMM is the BLAS level 3 matrix-matrix product in double precision. You want SGEMV for the equivalent BLAS level 2 single precision matrix-vector product.

I suggest you look at the definintions of hb and wb. In addition, if you want to really do matrix-vector, why not use the sgemv routine?

MMB

Ok still stuck… I don’t quite get the CUBLAS documentation… what Do call cublasSegmv with???

current code looks like this…

#include <stdio.h>

#include <cublas.h>

#define M 10

#define N 10

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main(void)

{

	int i,j;

	float *devPtrA , *devPtrB , *devPtrC;

	float *a, *b; 

	a = (float *)malloc (M * N * sizeof (*a));

	b = (float *)malloc (N * sizeof (*b));

	for (j = 0; j < N; j++) {

			b[j] = j;

			for (i = 0; i < M; i++) {

					a[IDX2C(i,j,M)] = i * M + j + 1;

			}

	}

		

	for (j = 0; j < N; j++) {

			for (i = 0; i < M; i++) {

					printf ("%7.0f", a[IDX2C(i,j,M)]);

			}

			printf ("\n");

	}

	printf ("\n");

		

	cublasInit(); // initilization of CUDA application

	cublasAlloc( M*N, sizeof(float), (void**) &devPtrA);	//matrix A

	cublasAlloc( N, sizeof(float), (void**) &devPtrB);		//vector B

	cublasAlloc( N, sizeof(float), (void**) &devPtrC);		//vector C

	// transfer host data to device

	cublasSetMatrix( M, N, sizeof(float), a, N, devPtrA, N);

	cublasSetVector( N, sizeof(float), b, N, devPtrB, N);

	

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

//	cublasSgemv('N', M, N, alpha, devPtrA, alpha, devPtrB, M, devPtrC);

		

	cublasGetVector (N, sizeof(float), devPtrC, N, b, N);

	cublasShutdown(); 

}

It is exactly the same call as a standard netlib BLAS sgemv (or page 67 of the CUBLAS pdf in the toolkit). You are missing the pitch of the matrix (lda) and the spacing of the two vectors (incx and incy):

void

cublasSgemv (char trans, int m, int n, float alpha,

			 const float *A, int lda, const float *x,

			 int incx, float beta, float *y, int incy)

Now cublasSegmv is not returning an error, but I’m still failing to understand something here as I’m getting garbage back.

Maybe I’m not getting the result properly??

#include <stdio.h>

#include <cublas.h>

#define M 10

#define N 10

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main(void)

{

	int i,j;

	float *devPtrA , *devPtrB , *devPtrC;

	float *a, *b; 

	a = (float *)malloc (M * N * sizeof (float));

	b = (float *)malloc (N * sizeof (float));

	for (j = 0; j < N; j++) {

			b[j] = j;

			for (i = 0; i < M; i++) {

					a[IDX2C(i,j,M)] = i * M + j + 1;

			}

	}

		

	for (j = 0; j < N; j++) {

			for (i = 0; i < M; i++) {

					printf ("%7.0f", a[IDX2C(i,j,M)]);

			}

			printf ("\n");

	}

	printf ("\n");

		

	cublasInit(); // initilization of CUDA application

	cublasAlloc( M*N, sizeof(float), (void**) &devPtrA);	//matrix A

	cublasAlloc( N, sizeof(float), (void**) &devPtrB);		//vector B

	cublasAlloc( N, sizeof(float), (void**) &devPtrC);		//vector C

	// transfer host data to device

	cublasSetMatrix( M, N, sizeof(float), a, N, devPtrA, N);

	cublasSetVector( N, sizeof(float), b, N, devPtrB, N);

	

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	cublasSgemv('N', M, N, alpha, devPtrA, N, devPtrB, N, beta, devPtrC, N);

		

	cublasGetVector (N, sizeof(float), devPtrC, N, b, 1);

	

	for (j = 0; j < N; j++) printf ("%7.0f", b[j]);

	printf ("\n");

	cublasShutdown(); 

}

incx and incy should both be 1 in this case, and lda should be M (ok this is square, so it shouldn’t matter, but in another less trivial case it might). Again, this is all in the documentation (or any reference or documentation on BLAS).

Thankyou for your patience with me here but this is still returning garbage…

cublasSgemv('N', M, N, alpha, devPtrA, M, devPtrB, 1, beta, devPtrC, 1);

		

	cublasGetVector (N, sizeof(float), devPtrC, N, b, 1);

	

	for (j = 0; j < N; j++) printf ("%7.0f", b[j]);

So time to add some error checking. All those CUBLAS helper functions return values, and there is a function for polling the status from the BLAS calls themselves. They should all be asserted or otherwise checked. That at least will confirm whether the sgemv call actually runs or not.

BLAH! double post…

Ok thanks for your help. I’ve implemented matrix-vector multiplication as a CUDA kernel and am using the trust lib for scan… I was wondering if CUBLAS would be much faster as it’s probably better optimised than my code but I don’t really want to commit much time to learning CUBLAS if I don’t know in advance that it’s going to be that much better.

Ok finally I have it working, though there is a problem if the matrix is not square…

It basically works twice as quick as my own cuda version. here is the code…

#include <stdio.h>

#include <cublas.h>

#define M 10

#define N 10

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main(void)

{

	cudaSetDevice(0);

	int i,j;

	float *devPtrA , *devPtrB , *devPtrC;

	float *a, *b; 

	

	//timer stuff

	cudaEvent_t start, stop;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	a = (float *)malloc (M * N * sizeof (float));

	b = (float *)malloc (N * sizeof (float));

	for (j = 0; j < N; j++) {

			b[j] = 1;

			for (i = 0; i < M; i++) {

					a[IDX2C(i,j,M)] = i * M + j + 1;

			}

	}

		

	for (j = 0; j < N; j++) {

			for (i = 0; i < M; i++) {

					printf ("%7.0f", a[IDX2C(i,j,M)]);

			}

			printf ("\n");

	}

	printf ("\n");

	for(i=0;i<N;i++) printf("%f ",b[i]);

	printf ("\n\n");

		

	cublasInit(); // initilization of CUDA application

	cublasAlloc( M*N, sizeof(float), (void**) &devPtrA);	//matrix A

	cublasAlloc( N, sizeof(float), (void**) &devPtrB);		//vector B

	cublasAlloc( N, sizeof(float), (void**) &devPtrC);		//vector C

	// transfer host data to device

	cublasSetMatrix( M, N, sizeof(float), a, N, devPtrA, N);

	cublasSetVector( N, sizeof(float), b, 1, devPtrB, 1);

	

	// compute C = A*B in device

	float alpha = 1.0;

	float beta  = 0.0;

	

	//start timer 

	cudaEventRecord(start, 0);

	

	cublasSgemv('N', M, N, alpha, devPtrA, M, devPtrB, 1, beta, devPtrC, 1);

		

	// block until the device has completed

	cudaThreadSynchronize();

	//end timer

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	float elapsedTime;

	cudaEventElapsedTime(&elapsedTime, start, stop);		

		  

	cublasGetVector (N, sizeof(float), devPtrC, 1, b, 1);

	

	for (j = 0; j < N; j++) printf ("%7.0f", b[j]);//IDX2C(j,1,M)]);

	printf("\n\ntime = %f\n\n",elapsedTime);

	cublasShutdown(); 

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

}

Hmm there are still problems with the CUBLAS code in the previous post but it works for square matrix. However I made a mistake in comparing the timing, this code is for a 10 x 10 martix but I was comparing with my own, non-optimised CUDA code which was running a 100x100 matrix, and then performing a sigmoid function on the resulting vector. It turns out the CUBLAS implementation is actually slower than my own code, which can’t be right??? maybe I’m misunderstanding something…

I include my own version of matrix-vector multiplication (followed by sigmoid transformation as the basis of a neural network update).

Can someone verify my timing (adjust M & N to 100 in the CUBLAS version in the previous post)

Also verify the overall program execution time?

// incrementArray.cu

#include <stdio.h>

#include <assert.h>

#include <cuda.h>

//-----------------------------------------------------------------------------------

void NN_OnHost(float *activity, float *weights, int N)

{

  int i,j;

  float new_activity[N];

	for (i=0;i<N;i++) { 

		new_activity[i] = 0;

		for(j=0;j<N;j++) {

			new_activity[i] += activity[j] * weights[(j*N)+i];

		}

	}

	for (i=0; i < N; i++) {

		activity[i] = 1/(1+exp(-new_activity[i]));

	}

}

//-----------------------------------------------------------------------------------

__global__ void NN_OnDevice(float *activity, float *weights, float *new_activity, int N)

{

  int j, idx = threadIdx.x;

  new_activity[idx] = 0;

  for(j=0;j<N;j++) {

	new_activity[idx] += activity[j] * weights[(j*N)+idx];

  }

  __syncthreads();

  activity[idx] = 1/(1+exp(-new_activity[idx]));

}

//-----------------------------------------------------------------------------------

int main(void)

{

  cudaSetDevice(0);

  float *activity_h, *weights_h, *new_activity_h;		   // pointers to host memory

  float *activity_d, *weights_d, *new_activity_d;		   // pointer to device memory

  int i, j, N = 100;

  size_t size = N*sizeof(float);

	  //timer stuff

	cudaEvent_t start, stop;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

// allocate arrays on host

  activity_h = (float *)malloc(size);

  new_activity_h = (float *)malloc(size);

  weights_h = (float *)malloc(size*size);

// allocate array on device 

  cudaMalloc((void **) &activity_d, size);

  cudaMalloc((void **) &new_activity_d, size);

  cudaMalloc((void **) &weights_d, size*size);

// initialization of host data

  for (i=0; i<N; i++) {

	activity_h[i] = (float(rand() % 100) / 100);

	for(j=0; j<N; j++) {

		weights_h[(j*N)+i] = (float(rand() % 200) / 100) - 1;

		//printf("%f ",weights_h[(j*N)+i]);

	}

	//printf("%f ",activity_h[i]);

  }

  //printf("\n");

// copy data from host to device

  cudaMemcpy(activity_d, activity_h, sizeof(float)*N, cudaMemcpyHostToDevice);

  cudaMemcpy(weights_d, weights_h, sizeof(float)*N*N, cudaMemcpyHostToDevice);

	

  // do calculation on host

  NN_OnHost(activity_h, weights_h, N);

for (i=0; i<10; i++) printf("%f ",activity_h[i]);

  printf("\n");

	//start timer 

	cudaEventRecord(start, 0);

// do calculation on device:	

  NN_OnDevice <<< 1, N >>> (activity_d, weights_d, new_activity_d, N);

	// block until the device has completed

	cudaThreadSynchronize();

	//end timer

	cudaEventRecord(stop, 0);

	cudaEventSynchronize(stop);

	float elapsedTime;

	cudaEventElapsedTime(&elapsedTime, start, stop);

	

  // Retrieve result from device and store in b_h

  cudaMemcpy(new_activity_h, activity_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

for (i=0; i<10; i++) printf("%f ",new_activity_h[i]);

  printf("\n");

  printf("time = %f\n\n",elapsedTime);

// cleanup

  free(activity_h); free(weights_h); free(new_activity_h); 

  cudaFree(activity_d); cudaFree(weights_d); cudaFree(new_activity_d);

  cudaEventDestroy(start);

  cudaEventDestroy(stop);

}

I also have a partially optimised version using shared memory but the speed increase is not significant… yet…

CUBLAS is optimized for large matrices. A 100x100 * 100x1 gemv call only contains 20000 FLOPs, which is an absolutely trivial amount, even for the least powerful CUDA capable GPUs. You would never bother to do it on the device - an optimised host BLAS will be considerably faster at those sizes on a modern CPU (the whole problem dataset would fit in a typical L1 cache). Add in the PCI-e overheads and it makes no sense to use the GPU.

Is 100x100 really your applications typical gemv() call size?

I have several applications, and so far have only written test code but this would implement the step of a 100 neurone neural network. In practice I’ll be using many such sized networks and so they will run in parallel (probably one per block). I’ll also be running several visual feature extraction algorithms that are more expensive, the outputs of which will feed into the NN’s (the vector part). This will either be from other cards, the CPU, or maybe the same card if I use a ‘fat kernel’. I think I can keep PCI traffic down to just the robot’s sensory data at each time step which would be 2 small ish camera images, 3 microphone .wav files and a bunch of touch & angle values.

Running this size of NN (i.e. 100 ish) I’ll stick with my own kernel, but there is the possibility of developing much larger NN’s in which case I’ll look at CUBLAS again.