The use of CUBLAS. Using CUBLAS in simple C / C++ code.

Hello,

I’m new here and a beginner in programing NVDIA Cards. I read some pages about the CUBLAS implementation. I coded a sample to understand the use of CUBLAS by learning from the SDK examples.

Here is the sample code:

#include "cublas.h"

int main (int argc, char *argv[]) {

	int n = 256;

	float alpha = 1.0f;

	float *x = new float[n];

	float *y = new float[n];

	float *xptr;

	float *yptr;

	

	for(int k = 0; k < n; k++) {

  x[k] = 0.1;

  y[k] = 0.1;

	}

	

	cublasStatus state;

	

	if(cublasInit() == CUBLAS_STATUS_NOT_INITIALIZED) {

  printf("CUBLAS init error.\n");

  return -1;

	}

	

	state = cublasAlloc(n, sizeof(*y), (void**)&yptr);

	if(state != CUBLAS_STATUS_SUCCESS) {

  printf("Error allocation video memory.\n");

  return -1;

	}

	

	state = cublasAlloc(n, sizeof(*x), (void**)&xptr);

	if(state != CUBLAS_STATUS_SUCCESS) {

  printf("Error allocation video memory.\n");

  return -1;

	}

	

	state = cublasSetVector(n, sizeof(*x), x, 1, xptr, 1);

	if(state != CUBLAS_STATUS_SUCCESS) {

  printf("Error copy to video memory.\n");

  return -1;

	}

	

	state = cublasSetVector(n, sizeof(*y), y, 1, yptr, 1);

	if(state != CUBLAS_STATUS_SUCCESS) {

  printf("Error copy to video memory.\n");

  return -1;

	}

	

	// Call CUBLAS implementation

	cublasSaxpy(n, alpha, xptr, 1, yptr, 1);

state = cublasGetError();

if (state != CUBLAS_STATUS_SUCCESS) {

  printf("CUBLAS execution error.\n");

  return -1;

}

	state = cublasGetVector(n, sizeof(*yptr), yptr, 1, y, 1);

	if(state != CUBLAS_STATUS_SUCCESS) {

  printf("Error copy from video memory.\n");

  return -1;

	}	

	

	if(cublasFree(xptr) != CUBLAS_STATUS_SUCCESS) {

  printf("Error freeing video memory.\n");

  return -1;

	}	

	

	if(cublasFree(yptr) != CUBLAS_STATUS_SUCCESS) {

  printf("Error freeing video memory.\n");

  return -1;

	}	

	

	if(cublasShutdown() != CUBLAS_STATUS_SUCCESS) {

  printf("CUBLAS shutdown error.\n");

  return -1;

	}	

	

	if(x != NULL) delete[] x;

	if(y != NULL) delete[] y;

return 0;

}

Here is my question:

Did I understand the CUBLAS things right? Is this the way to call CUBLAS?

I ask this, because I benchmarked this subroutine vs. the vecLIB on OS X 10.5.2.

(Mac Book Pro / NV8600GT 128MB) and it seems that the CUBLAS is very slow.

But I don’t trust my benchmark :D I’m new in the C/C++ World - so beginners

make often things wrong… :devil:

Big thanks for any support or suggestions about this post.

Greetings from Germany

Manolo

Hello,

It is okay, but what do you want to do with n = 256 ? You should choose sth like n = 1e6 or even greater. A Ferrari is not for driving it 20 km/h. :-)

Big thanks about this hint. The n was only an example but I didn’t testet with a high n. So with n = 1e6 the Ferrari turns on the turbo and a gain about 50-60 GFLOP/s on my 8600GT 128MB RAM with axpy - nice.

Now, I only have to redesign my application to use 1e6 vectors :whistling:

Greetings from Germany

Manolo

uhhh 50 - 60 GFLOPS ? are u sure of that ? I had sth around 12 GFLOPS with my 8800 GTS. dont forget to time correctly. people normally forget to

  • warm up the ferrari … (execute the code several times before measuring)

  • and to put a cudaThreadSynchronize after the computation before stopping the timer, (the device gives the control back to the host before completing the task).

  • (optionally but to my mind necessary) execute the code several times after warm up and calculate the mean computation time.

otherwise is kind of cheating (and this michael schuhmacher would never do) :-). Please verify your results.

Okay, i will analyze my code a little bit more carefully. First I had some values about 5-15 GFLOP/S with one or two executions. But increasing the loops brings up the high performance.

So pseudocode goes here:

maxIter = 18

n = 1e6;

SetUpDataAndSystem();

// Warm UP

cublasSaxpy(...);

begin = Time();

DO 1,...,maxIter

  cublasSaxpy(....);

OD

end = Time();

TearDownDataAndSystem();

I will post a reviewed code soon…

But the value seems to be okay for me. In the forum i found post with benchmarks of sgemm with 130 GLOP/S on 8800 GTX. So 50 GFLOP/S on 8600 GT seems to be realistic…

Hi Manolo, how did you measure the number Gigaflops?

is it estimated with to the amount of data/Instructions?

Greetings

from Germany too

Tron

Hello,

here is the full code. I added the cudaThreadSynchronize() - because MS won’t cheat :haha: Now, I get about 1 GFLOP/S and the bench isn’t deterministic. Sometimes I get 0.8 and sometimes 2.0 . So I think there is a mistake somewhere :/ .

I didn’t knew that the cublas routines executes in non-blocking state. I didn’t find this information in the cublas reference.

Calculating the FLOP/S:

I have two vectors of rank n. So axpy makes 1 scalar multiplication (a*x[i]) and 1 addition (x[i]+y[i]) for each component of the vector. For each iteration there are 2n FLOP. Because of data-dependency. For 100 iterations the FLOP/S where calculating by:

flops = 100 * 2n / (sec the iterations needs)

Here the code:

#include <iostream>

#include <ctime>

#include <cuda_runtime.h>

#include <cublas.h>

double timediff(clock_t e, clock_t b)

{

	double ticks = e - b;

	double diff = ticks / CLOCKS_PER_SEC;

	return diff;

}

int main (int argc, char *argv[]) {

	

	long int maxIter = 1000;

	double speed;

	clock_t begin;

	clock_t end;

	int n = 1e6;

	float alpha = 1.0f;

	float *x = new float[n];

	float *y = new float[n];

	float *xptr;

	float *yptr;

	

	for(int k = 0; k < n; k++) {

  x[k] = 0.1;

  y[k] = 0.1;

	}

	

	cublasInit();

	cublasAlloc(n, sizeof(*y), (void**)&yptr);

	cublasAlloc(n, sizeof(*x), (void**)&xptr);

	cublasSetVector(n, sizeof(*x), x, 1, xptr, 1);

	cublasSetVector(n, sizeof(*y), y, 1, yptr, 1);

	

	cublasSaxpy(n, alpha, xptr, 1, yptr, 1);

	cudaThreadSynchronize();

	

	// Perform benchmark

	begin = clock();

	for(long int i = 0; i < maxIter; i++) {	

  cublasSaxpy(n, alpha, xptr, 1, yptr, 1);

	}

	cudaThreadSynchronize();

	end = clock();

	cublasGetVector(n, sizeof(*yptr), yptr, 1, y, 1);

	

	speed = timediff(end,begin);

	printf("CUBLAS %.6f Time (s) \n",speed);

	

	speed = (double)(maxIter * 1E-9 * n * 2) / (double)speed;

	printf("CUBLAS %.3f GFLOP/S \n",speed);

	

	cublasFree(xptr);

	cublasFree(yptr);

	cublasShutdown();

	

	if(x != NULL) delete[] x;

	if(y != NULL) delete[] y;

    return 0;

}

Big thanks for the replies.

Another question: Maybe cublasGetVector() is blocking. Because the system must wait until all data is transfered from GPU Memory to Host Memory. And what is with the scalar (alpha)? Maybe I have to use a CUDA type or anything else? It seems that every execution loads the alpha and then the performance drops to CPU speed? What do you think about this?

Hi sicb0161,

could you please post your code here? So I can verify your results. I measured gemm with 13 GFLOP/S on 8600M GT with 128MB and N = M = 1024. Dou you tested gemm, too?