Help with CUBLAS performance and timing issues, please help...

Hello everyone :).

My system is a regular Intel Core 2 Duo 1.67 GHz based one with a GeForce 8400M GS (a laptop)… currently executing it all under Fedora 10, nvidia-driver v180 beta (latest release compatible with CUDA 2.1 BETA), 2.1 BETA SDK + Toolkit…

I am working, after doing it in a basic, but straightforward way with regular CUDA code (with and without the use of shared-memory and pinned memory allocated on the host side), on a CUBLAS implementation of a normal large matrix * large vector multiply using CUBLAS.

I time my code using the provided timer functions in the CUDA SDK+toolkit and also test all of it against a CPU based implementation of said M*V algorithm:

float temp = 0;

	if ( null == a || null == b || null == c ) return;

	for (int j = 0; j < C; j++) {

	//for (int i = 0; i < R; i++) {

		temp = b[j];

		for (int i = 0; i < R; i++) {

		//for (int j = 0; j < C; j++) {

			if ( j == 0) c[i] = 0;  

			c[i] += MATC(a,i,j,R) * temp;

			//c[i] += MATC(a,i,j,R) * b[i];

		}

	}

For the CUBLAS part I time the complete roundtrip (upload, execution, and download) of data and while pinned memory helps reduce the transfer time… even with large matrices 5-7k x 5-7k the GPU with CUBLAS, including the time it takes to transfer data back and forth, manages to be at most 0.5x faster than the reference CPU code (Release configuration)… the operation is really PCI-Express transfer bound as that takes the greatest share of the time taken by the whole operation:

../../bin/linux/release/MatrixTest 

Initializing data...												 

...allocating CPU memory.											

Matrix is 5376x5376												  

Vector is 5376x1													 

Exec time only on CPU: 53.553001 (ms)

simpleCUBLAS test running..

Transfer + Exec + Readback time on GPU with CUBLAS: 108.306000 (ms)

Transfer to GPU with CUBLAS: 61.438999 (ms)

Transfer from GPU with CUBLAS: 46.792999 (ms)

GPU CUBLAS code is 0.494460x faster than the CPU code [...]

This is how the timer functions are defined:

void start_timer (unsigned int* t) {

	//CUT_SAFE_CALL(cutCreateTimer(&timer_t1));

	CUT_SAFE_CALL(cutCreateTimer(t));

	CUT_SAFE_CALL(cutStartTimer((*t)));

	return;

}

void stop_timer (unsigned int t, float * t_ms) {

	CUT_SAFE_CALL(cutStopTimer(t));

	(*t_ms) = cutGetTimerValue(t);

	CUT_SAFE_CALL(cutDeleteTimer(t));

	return;

}

The cublas function is called this way:

unsigned int timer_toGDDR = 0;

	float t_toGDDR_ms = 0.0f;

	unsigned int timer_toRAM = 0;

	float t_toRAM_ms = 0.0f;

	init_test1_data_CUBLAS (&h_C_CUBLAS);

	init_test1_CUBLAS_alloc (d_A, d_B, d_C1, status);

	unsigned int timer2 = 0;

	float timer2_ms = 0.0f;

	start_timer(&timer2);

	//toGPU

	start_timer(&timer_toGDDR);

	init_test1_CUBLAS_transfer_to_GPU (matA, vecB, d_A, d_B, argc, argv, status);

	stop_timer(timer_toGDDR, &t_toGDDR_ms);

	//data transfered

	////exec

	for (int n = 0; n < STEPS; n++) {

		//init_test1_CUBLAS_transferVec_to_GPU (vecB, d_B,argc, argv, status);

		cublasSgemv('n', ROWS, COLS, 1, d_A, ROWS, d_B, 1, 0, d_C1, 1);

	}

	//fromGPU

	start_timer(&timer_toRAM);

	init_test1_CUBLAS_readback (h_C_CUBLAS, d_C1, argc, argv, status);

	stop_timer(timer_toRAM, &t_toRAM_ms);

	//data transfered

	stop_timer(timer2, &timer2_ms);//Timer stopped

	////

	printf ("\n\nTransfer + Exec + Readback time on GPU with CUBLAS: %f (ms)\n\n", timer2_ms);

	printf ("\n\nTransfer to GPU with CUBLAS: %f (ms)\n\n", t_toGDDR_ms);

	printf ("\n\nTransfer from GPU with CUBLAS: %f (ms)\n\n", t_toRAM_ms);

	printf ("\nGPU CUBLAS code is %fx faster than the CPU code (VS SSE2)...\n\n",

		(timer / timer2_ms));

STEPS is a macro, set in the previous example to 1 (so it was not looping)… the CPU based test multiplication function is also looped the same way as I am looping this CUBLAS function.

What is interesting is how the time it takes to retrieve the data, in that code, is dependent on the value of the macro STEPS… look at how it ballons up when I increase STEPS to 50…

../../bin/linux/release/MatrixTest 

Initializing data...												 

...allocating CPU memory.											

Matrix is 5376x5376												  

Vector is 5376x1													 

Exec time only on CPU: 2400.915039 (ms)

simpleCUBLAS test running..

Transfer + Exec + Readback time on GPU with CUBLAS: 2101.375000 (ms)

Transfer to GPU with CUBLAS: 62.765999 (ms)

Transfer from GPU with CUBLAS: 2038.197021 (ms)

GPU CUBLAS code is 1.142545x faster than the CPU code [...]

why does it happen?.. I call the cublas op STEPS times, but the transfer/readback from the GPU is only launched after all the calls to the cublas op have been made… what does this mean?

P.S.:

this is the main .cu file used:

//

/*

* Copyright 1993-2007 NVIDIA Corporation.  All rights reserved.

*

* NOTICE TO USER:

*

* This source code is subject to NVIDIA ownership rights under U.S. and

* international Copyright laws.  Users and possessors of this source code

* are hereby granted a nonexclusive, royalty-free license to use this code

* in individual and commercial software.

*

* NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

* CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR

* IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH

* REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

* MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.

* IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,

* OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

* OF USE, DATA OR PROFITS,  WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

* OR OTHER TORTIOUS ACTION,  ARISING OUT OF OR IN CONNECTION WITH THE USE

* OR PERFORMANCE OF THIS SOURCE CODE.

*

* U.S. Government End Users.   This source code is a "commercial item" as

* that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of

* "commercial computer  software"  and "commercial computer software

* documentation" as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995)

* and is provided to the U.S. Government only as a commercial end item.

* Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through

* 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the

* source code with only those rights set forth herein.

*

* Any use of this source code in individual and commercial software must

* include, in the user documentation and internal comments to the code,

* the above Disclaimer and U.S. Government End Users Notice.

*/

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include <sys/time.h>

#include <string.h>

#include <cutil_inline.h>

#include <cutil_math.h>

#include <vector_types.h>

#include <iostream>

#include "IGGS_CUDA_defines.h"

#include "MatTest_helper.h"

#include "MatTest_kernel.cu"

#include "cublas.h"

/**

* \brief Main program

*

* This program aims to test a standard Matrix * Vector multiplication. The matrix is square and NxN while the vector is Nx1.

* The Matrix is considered as column major and stored as a 1-D array properly indexed through a custom macro

* defined inside the IGGS_CUDA_defines.h file.

*/

int main(int argc, char **argv){

	float* matA, * vecB, * h_C_CPU;

	float *d_A, *d_B;

	float timer = 0.0f;

	matA = null; d_A = null;

	vecB = null; d_B = null;

	h_C_CPU = null;

#if CUDA == 1

	float * d_C = null;

	float * h_C_GPU = null;

#endif

#if CUDA == 0

	cublasStatus status;

	float *h_C_CUBLAS = null;

	float * d_C1 = null;

#endif

	init_test1_data (&matA, &vecB, &h_C_CPU);

	timer = init_test1_CPU (matA, vecB, h_C_CPU);

#if CUDA == 1

	CUT_DEVICE_INIT(argc, argv);

	//init CUDA data and set-up for kernel execution

	init_test1_data_CUDA (&h_C_GPU);

	dim3 dimBlock (BLOCK_SIZE * BLOCK_SIZE);

	dim3 dimGrid (ROWS/TBLOCK); //1 Thread Block per ogni BLOCK_SIZE^2 righe di A

	unsigned int timer1 = 0;

	float timer1_ms = 0.0f;

	start_timer(&timer1);//Timer started

	init_test1_CUDA (matA,vecB, d_A,d_B, d_C, argc, argv); //data uploading...

		unsigned int total_memory;

		unsigned int free_memory;

		CUresult r;

		r = cuInit(0);

		printf("Init result	 : %d\n",r);

		r = cuMemGetInfo(&free_memory,&total_memory);

		printf("Mem info result : %d\n",r);

		printf("Total memory (d_A, d_B, d_C allocated): %f MB\n",(float)total_memory/(1024*1024));

		printf("Free memory (d_A, d_B, d_C allocated): %f MB\n", (float)free_memory/(1024*1024));

	printf("Executing GPU kernel...\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	////

	//

	//	dimBlock (N*N*N)

	//	MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B, ROWS, COLS);

	//	MatTest<<<dimGrid, N*N*N>>>(d_C, d_A, d_B, ROWS, COLS);

	//

	//usiamo un indice 1-D per i threads, N^3 threads per Thread Block, in entrambe le chiamate del kernel

	//

	////

	//

	//	dimBlock (N*N,*N)

	//	MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B, ROWS, COLS);

	//	MatTest<<<dimGrid, N*N*N>>>(d_C, d_A, d_B, ROWS, COLS);

	//

	//usiamo un indice 2-D per i threads, N^3 threads per Thread Block

	//

	////

	//

	//	dimBlock (N,N,N)

	//	MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B, ROWS, COLS);

	//	MatTest<<<dimGrid, N*N*N>>>(d_C, d_A, d_B, ROWS, COLS);

	//

	//usiamo un indice 3-D per i threads, N^3 threads per Thread Block

	//

	////

#if SHARED_MEM == 1

	printf ("\n\n\"Using Shared Memory...\"\n\n");

#endif

#if SHARED_MEM == 0

	printf ("\n\n\"Not using Shared Memory...\"\n\n");

#endif

	MatTest<<<dimGrid, dimBlock>>>(d_C, d_A, d_B); //Result: :)

	CUT_CHECK_ERROR("MatTest() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	printf("Reading back GPU result...\n\n");

	CUDA_SAFE_CALL( cudaMemcpy(h_C_GPU, d_C, DATA_V, cudaMemcpyDeviceToHost) );

	stop_timer(timer1, &timer1_ms);//Timer stopped

	printf ("\n\nTransfer + Exec + Readback time on GPU with CUDA: %f (ms)\n\n", timer1_ms);

	printf ("\nGPU CUDA code is %fx faster than the CPU code (VS SSE2)...\n\n",

		(timer / timer1_ms));

	printf ("\nC_GPU.x= %f C_GPU.y= %f C_GPU.z= %f C_GPU.w= %f\n", h_C_GPU[0], h_C_GPU[1],

		h_C_GPU[2], h_C_GPU[3]);

	printf ("\nC_CPU.x= %f C_CPU.y= %f C_CPU.z= %f C_CPU.w= %f\n", h_C_CPU[0], h_C_CPU[1],

		h_C_CPU[2], h_C_CPU[3]);

	if (!vectorEQ(h_C_CPU, h_C_GPU, COLS)) printf("\nh_C_CPU != h_C_GPU ... :(.\n");

	else printf("\nh_C_CPU == h_C_GPU... :).\n");

	printf("\nShutting down...\n");

	CUDA_SAFE_CALL( cudaFree(d_C) );

	CUDA_SAFE_CALL( cudaFree(d_B)   );

	CUDA_SAFE_CALL( cudaFree(d_A)   );

	free(h_C_GPU);

#endif

#if CUDA == 0

	unsigned int timer_toGDDR = 0;

	float t_toGDDR_ms = 0.0f;

	unsigned int timer_toRAM = 0;

	float t_toRAM_ms = 0.0f;

	init_test1_data_CUBLAS (&h_C_CUBLAS);

	init_test1_CUBLAS_alloc (d_A, d_B, d_C1, status);

	unsigned int timer2 = 0;

	float timer2_ms = 0.0f;

	start_timer(&timer2);

	//toGPU

	start_timer(&timer_toGDDR);

	init_test1_CUBLAS_transfer_to_GPU (matA, vecB, d_A, d_B, argc, argv, status);

	stop_timer(timer_toGDDR, &t_toGDDR_ms);

	//data transfered

	////exec

	for (int n = 0; n < STEPS; n++) {

		//init_test1_CUBLAS_transferVec_to_GPU (vecB, d_B,argc, argv, status);

		cublasSgemv('n', ROWS, COLS, 1, d_A, ROWS, d_B, 1, 0, d_C1, 1);

	}

	//fromGPU

	start_timer(&timer_toRAM);

	init_test1_CUBLAS_readback (h_C_CUBLAS, d_C1, argc, argv, status);

	stop_timer(timer_toRAM, &t_toRAM_ms);

	//data transfered

	stop_timer(timer2, &timer2_ms);//Timer stopped

	////

	printf ("\n\nTransfer + Exec + Readback time on GPU with CUBLAS: %f (ms)\n\n", timer2_ms);

	printf ("\n\nTransfer to GPU with CUBLAS: %f (ms)\n\n", t_toGDDR_ms);

	printf ("\n\nTransfer from GPU with CUBLAS: %f (ms)\n\n", t_toRAM_ms);

	printf ("\nGPU CUBLAS code is %fx faster than the CPU code (VS SSE2)...\n\n",

		(timer / timer2_ms));

	printf ("Risultati CPU (C/C++):\n");

	printf ("C_CPU.x=%f C_CPU.y=%f C_CPU.z=%f C_CPU.w=%f\n\n", h_C_CPU[0], h_C_CPU[1], h_C_CPU[2], h_C_CPU[3]);

	printf ("Risultati GPU (CUBLAS):\n");

	printf ("C_GPU.x=%f C_GPU.y=%f C_GPU.z=%f C_GPU.w=%f\n", h_C_CUBLAS[0], h_C_CUBLAS[1],

		h_C_CUBLAS[2], h_C_CUBLAS[3]);

	if (!vectorEQ(h_C_CPU, h_C_CUBLAS, COLS)) printf("\nh_C_CPU != h_C_CUBLAS ... :(.\n");

	else printf("\nh_C_CPU == h_C_CUBLAS... :).\n");

	free(h_C_CUBLAS);

#endif

#if PINNED == 1

	cutilSafeCall( cudaFreeHost(vecB));

	cutilSafeCall( cudaFreeHost(matA));

#endif

#if PINNED == 0

	free(vecB);

	free(matA);

#endif

	CUT_EXIT(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

// Helper function, returning uniformly distributed

// random float in [low, high] range

////////////////////////////////////////////////////////////////////////////////

here is the complete list, in case the uploading of the whole zipped project does not work of helper functions used

#include <cstdio>

#include <cstdlib>

#include <cmath>

#include <sys/time.h>

#include <string.h>

#include "cublas.h"

#include <cutil_inline.h>

#include <cuda.h>

#include <cutil_math.h>

#include <vector_types.h>

#include "IGGS_CUDA_defines.h"

#include "MatTest_helper.h"

////////////////////////////////////////////////////

/////CPU, data initialization

////////////////////////////////////////////////////

/** \brief General data initialization function.

*

*	This function takes one matrix and two vectors as inputs, allocates space for them in RAM

*	and initializes them.\n The matrix is set to an Identity matrix with the first four elements of the 4th column set to

*	the translation vector (1,0,0,1). The source vector is set to all 1's and the result vector is zeroed.

*

*	\param matA reference to the source matrix we want to allocate (RAM/System memory) 

*			and initialize, float**.

*	\param vecB reference to the source vector we want to allocate (RAM/System memory) 

*			and initialize, float**.

*	\param h_C_CPU reference to the result vector (C/C++ generated) we want to allocate (RAM/System memory)

*			and initialize, float**

*	\return void

*	\sa DATA_SZ, DATA_V, N_EL, ROWS, COLS, P_MATC, PRINT_MAT_DIM, PRINT_CVEC_DIM

*/

void init_test1_data (float** matA, float** vecB, float** h_C_CPU) {

	printf("Initializing data...\n");

	printf("...allocating CPU memory.\n");

#if PINNED == 1

	cutilSafeCall( cudaMallocHost ( (void**) matA, DATA_SZ));

	cutilSafeCall( cudaMallocHost ( (void**) vecB, DATA_V));

#endif

#if PINNED == 0

	*matA	 = (float *)malloc(DATA_SZ); //malloc (<n_items> * sizeof(<item type>))

	*vecB	 = (float *)malloc(DATA_V); //DATA_SZ = <n_items> * sizeof(float)

										 //malloc (DATA_SZ)

#endif

	*h_C_CPU = (float *)malloc(DATA_V);

	//PRINT_POINTER (*matA);

	//PRINT_POINTER (*vecB);

	for(int i = 0; i < ROWS; i++){

		for(int j = 0; j < COLS; j++) {

			if (j == i) P_MATC(matA,i,j,COLS) = 1.0f;

			else P_MATC(matA,i,j,COLS) = 0.0f;

			(*vecB)[j] = 1.0f;

			(*h_C_CPU)[j] = 0.0f;

		}

	}

	//let's do a test 4x4 case first in CUDA then with the CUBLAS libs.

	P_MATC(matA,0,3,COLS) = 1.0f;

	P_MATC(matA,1,3,COLS) = 0.0f;

	P_MATC(matA,2,3,COLS) = 0.0f;

	P_MATC(matA,3,3,COLS) = 1.0f;

	//PRINT_POINTER (*matA);

	PRINT_MAT_DIM (ROWS, COLS);

	PRINT_CVEC_DIM (N_EL);

	return;

}					  

/** \brief Times and runs the Mat*Vec test on the CPU (C/C++).

*

*	This function launches STEPS times the mat_vec() function which executes the multiplication

*	of source matrix matA and source vector vecB storing the result in the result vector h_C_CPU.

*

*	\param matA reference to the source matrix (RAM/System memory), 

*			float*.

*	\param vecB reference to the source vector (RAM/System memory), 

*			float*.

*	\param h_C_CPU reference to the result vector (RAM/System memory),

*			float*.

*	\return the time it takes to run this whole test (mat_vec() executed STEPS times), float.

*	\sa STEPS, ROWS, COLS, start_timer(), stop_timer(), mat_vec()

*/

float init_test1_CPU ( float *matA, float *vecB, float *h_C_CPU) 

{

	float time = 0.0f;

	unsigned int timer = 0;

	start_timer (&timer);

	for (int n = 0; n < STEPS; n++) {

		mat_vec(matA, ROWS, COLS, vecB, COLS, h_C_CPU);

	}

	stop_timer (timer, &time);

	printf ("\n\nExec time only on CPU: %f (ms)\n\n", time);

	return time;

}

void mat_vec ( float *a, const int R, const int C, float *b, const int SIZE, float *c) {

	//A is a column major ordered matrix

	float temp = 0;

	if ( null == a || null == b || null == c ) return;

	for (int j = 0; j < C; j++) {

	//for (int i = 0; i < R; i++) {

		temp = b[j];

		for (int i = 0; i < R; i++) {

		//for (int j = 0; j < C; j++) {

			if ( j == 0) c[i] = 0;  

			c[i] += MATC(a,i,j,R) * temp;

			//c[i] += MATC(a,i,j,R) * b[i];

		}

	}

	return;

}

bool vectorEQ (float* a, float* b, const int SIZE) {

	if (null == a || null == b) return false;

	for (int i = 0; i < SIZE; i++) {

		if ( a[i] != b[i] ) {

		

			printf ("\nIndex: %d\n", i);

			printf ("a[%d]: %f , b[%d]: %f\n", i, a[i], i, b[i]);

			return false;

		}

	}

	return true;

}

float RandFloat(float low, float high){

	float t = (float)rand() / (float)RAND_MAX;

	return (1.0f - t) * low + t * high;

}

void printVec (float* v, const int EL) {

	return;

}

void printMat (float* m, const int R, const int C) {

	return;

}

void malloc_MatrixFPv1 ( float *** mat, const int rows, const int columns) {

	int i,index;

	(*mat) = (float **) calloc (rows, sizeof(float *)); //calloc does the same thing as malloc but

	//inits memory to 0

	*(*mat) = (float *) calloc (rows * columns, sizeof (float));

	for (i = 0; i < rows; i++) {

		for (index = 0;; index = 0 + i*columns) {

			*((*mat) + i) = ((**mat) + index);

		}

	}

	return;

}

//se adoperiamo questa v2 della funzione che alloca una matrice dinamicamente allora

//useremo in seguito per accedere alla matrice, implementata con un largo array 1D,

//la seguente macro:

//

//		index(i, j , n_columns) ((j) + ((i) * (n_columns))) //row-major matrix

//

//

//gia' definita in precedenza

void malloc_matrixFPv2 ( float ** mat, const int rows, const int columns) {

	(*mat) = (float *) calloc (rows * columns, sizeof (float));

	return;

}

void start_timer (unsigned int* t) {

	//CUT_SAFE_CALL(cutCreateTimer(&timer_t1));

	CUT_SAFE_CALL(cutCreateTimer(t));

	CUT_SAFE_CALL(cutStartTimer((*t)));

	return;

}

void stop_timer (unsigned int t, float * t_ms) {

	CUT_SAFE_CALL(cutStopTimer(t));

	(*t_ms) = cutGetTimerValue(t);

	CUT_SAFE_CALL(cutDeleteTimer(t));

	return;

}

////////////////////////////////////////////////////

/////CUBLAS

////////////////////////////////////////////////////

void init_test1_data_CUBLAS (float** h_C_CUBLAS)

{

	*h_C_CUBLAS = (float *)malloc(DATA_V);

	

	//for(int i = 0; i < ROWS; i++){

	//	(*h_C_CUBLAS)[i] = 0.0f; 

	//}

}

void init_test1_CUBLAS_alloc (float* &d_A, float* &d_B,

						float* &d_C, cublasStatus &status) 

{

	printf("simpleCUBLAS test running..\n");

	//PRINT_POINTER(matA);

	status = cublasInit();

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! CUBLAS initialization error\n");

		exit( EXIT_FAILURE);

	}

	/* Allocate device memory for the matrices */

	status = cublasAlloc(ROWS*COLS, sizeof(d_A[0]), (void**)&d_A);

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! device memory allocation error (A)\n");

		exit( EXIT_FAILURE);

	}

	status = cublasAlloc(ROWS, sizeof(d_B[0]), (void**)&d_B);

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! device memory allocation error (B)\n");

		exit( EXIT_FAILURE);

	}

	status = cublasAlloc(ROWS, sizeof(d_C[0]), (void**)&d_C);

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! device memory allocation error (C)\n");

		exit( EXIT_FAILURE);

	}

	return;

}

void init_test1_CUBLAS_transfer_to_GPU ( float* matA, float* vecB, float* &d_A, float* &d_B,

								 int argc, char** argv, cublasStatus &status) 

{

	/* Initialize the device matrices with the host matrices */

	status = cublasSetVector(ROWS*COLS, sizeof(matA[0]), matA, 1, d_A, 1);

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! device access error (write A)\n");

		CUT_EXIT(argc,argv);

		exit( EXIT_FAILURE);

	}

	status = cublasSetVector(ROWS, sizeof(vecB[0]), vecB, 1, d_B, 1);

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! device access error (write B)\n");

		CUT_EXIT(argc,argv);

		exit( EXIT_FAILURE);

	}

}

void init_test1_CUBLAS_transferVec_to_GPU (float* vec, float* &d,

								 int argc, char** argv, cublasStatus &status) 

{

	status = cublasSetVector(ROWS, sizeof(vec[0]), vec, 1, d, 1);

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! device access error (write)\n");

		CUT_EXIT(argc,argv);

		exit( EXIT_FAILURE);

	}

}

void init_test1_CUBLAS_readback (float* h_C_CUBLAS, float* &d_C, 

								 int argc, char** argv, cublasStatus &status) 

{

	//status = cublasGetError();

	//if (status != CUBLAS_STATUS_SUCCESS) {

	//	fprintf (stderr, "!!!! kernel execution error.\n");

	//	CUT_EXIT(argc,argv);

		//return EXIT_FAILURE;

	//}

	

	/* Read the result back */

	status = cublasGetVector(COLS, sizeof(h_C_CUBLAS[0]), d_C, 1, h_C_CUBLAS, 1);

	if (status != CUBLAS_STATUS_SUCCESS) {

		fprintf (stderr, "!!!! device access error (read C)\n");

		CUT_EXIT(argc,argv);

		//return EXIT_FAILURE;

	}

	return;

}

////////////////////////////////////////////////////

/////CUDA

////////////////////////////////////////////////////

void init_test1_data_CUDA (float** h_C_GPU)

{

	*h_C_GPU = (float *)malloc(DATA_V); 

	for(int i = 0; i < ROWS; i++){

		(*h_C_GPU)[i] = 0.0f; 

	}

}

void init_test1_CUDA ( float *matA, float *vecB, float * &d_A, float * &d_B,

					  float * &d_C, int argc, char **argv) 

{

	printf("...allocating GPU memory.\n");

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_A, DATA_SZ)   ); //input matrix

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_B, DATA_V)   ); //input vector

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_C, DATA_V)   ); //result vector 

	printf("...copying input data to GPU mem.\n");

	//Copy options data to GPU memory for further processing 

	CUDA_SAFE_CALL( cudaMemcpy(d_A, matA, DATA_SZ, cudaMemcpyHostToDevice) );

	CUDA_SAFE_CALL( cudaMemcpy(d_B, vecB, DATA_V, cudaMemcpyHostToDevice) );

	printf("Data init done.\n");

	return;

}

This is the common defines file:

#ifndef IGGS_CUDA_DEFINES_H

#define IGGS_CUDA_DEFINES_H

//////////

#define null (NULL) /**< \brief I like my Java style defined null pointer :).*/

#define BLOCK_SIZE (16) /**< \brief The number of threads in a Thread Block (CUDA). */

#define AMP (21) /**< amplification factor, the way it is used it helps determine the number of Thread Blocks in a Grid */

#define STEPS (1) /**< \brief number of repetitions. */

#define PINNED (1) /**< \brief PINNED == 0 --> using PAGED memory.\n PINNED == 1 --> using fast PINNED memory. */

#define CUDA (0) /**< \brief CUDA == 0 --> CUBLAS execution path.\n CUDA == 1 --> CUDA execution path. */

#define SHARED_MEM (0) /**< \brief SHARED_MEM == 0 --> CUDA using global memory accesses only.\n

									SARED_MEM == 1 --> CUDA taking advantage of embedded Shared Memory. */

#define TBLOCK (BLOCK_SIZE*BLOCK_SIZE)

#define ROWS (AMP*BLOCK_SIZE*BLOCK_SIZE) /**< \brief Number of Rows in the source matrix and the source column vector.

											 \sa BLOCK_SIZE, AMP*/

#define COLS (AMP*BLOCK_SIZE*BLOCK_SIZE) /**< \brief Number of Columns in the source matrix.

											 \sa BLOCK_SIZE, AMP*/

#define N_EL (AMP*BLOCK_SIZE*BLOCK_SIZE) /**< \brief Number of elements in a column vector.

											 \sa BLOCK_SIZE, AMP*/

#define EQ(n1,n2) (n1 == n2)

#define N_MAT (ROWS * COLS) /**< \brief Number of elements (Rows * Columns) in the source matrix.

											 \sa ROWS, COLS*/

const int DATA_SZ = (N_MAT * sizeof(float)); /**< \brief Amount of bytes used to store the source matrix.

											 \sa N_MAT*/

const int DATA_V = (N_EL * sizeof(float)); /**< \brief Amount of bytes used to store the column vector (source and result).

											 \sa N_EL*/

#define PRINT_POINTER(x) (printf ("\nPointer is: %p\n", (x)))

#define PRINT_MAT_DIM(R,C) (printf ("Matrix is %dx%d\n", (R),(C)))

#define PRINT_CVEC_DIM(R) (printf ("Vector is %dx1\n", (R)))

#define PRINT_RVEC_DIM(C) (printf ("Vector is 1x%d\n", (C)))

#define P_MATR(mat, i, j, NCOLS) (*(mat))[indexR((i),(j),(NCOLS))]

#define MATR(mat, i, j, NCOLS) mat[indexR((i),(j),(NCOLS))]

#define P_MATC(mat, i, j, NROWS) (*(mat))[indexC((i),(j),(NROWS))] /**< \brief This macro dereferences the pointer to the matrix

																				and accesses it.

																		* \sa indexC(), NROWS */

#define MATC(mat, i, j, NROWS) mat[indexC((i),(j),(NROWS))]

#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) //column major order addressing + 1st element has id#1

#define IDX2C(i,j,ld) (((j)*(ld))+(i)) //column major order addressing + 1st element has id#0

////////////////////////////////////////////////////////////////////////////////

// Helper MACRO to correctly index an 1D array as if it were a 2D one

////////////////////////////////////////////////////////////////////////////////

#define indexR(i, j, n_cols) ((j) + ((i) * (n_cols))) /**< \brief This macro provides row-major order addressing. */

#define indexC(i ,j, n_rows) (((j) * (n_rows)) + (i)) /**< \brief This macro provides column-major order addressing + 1st element has id#0 */

#endif

This is the .cu file storing the CUDA kernel:

/*

 * Copyright 1993-2007 NVIDIA Corporation.  All rights reserved.

 *

 * NOTICE TO USER:

 *

 * This source code is subject to NVIDIA ownership rights under U.S. and

 * international Copyright laws.  Users and possessors of this source code

 * are hereby granted a nonexclusive, royalty-free license to use this code

 * in individual and commercial software.

 *

 * NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

 * CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR

 * IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH

 * REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

 * MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.

 * IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,

 * OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

 * OF USE, DATA OR PROFITS,  WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

 * OR OTHER TORTIOUS ACTION,  ARISING OUT OF OR IN CONNECTION WITH THE USE

 * OR PERFORMANCE OF THIS SOURCE CODE.

 *

 * U.S. Government End Users.   This source code is a "commercial item" as

 * that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of

 * "commercial computer  software"  and "commercial computer software

 * documentation" as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995)

 * and is provided to the U.S. Government only as a commercial end item.

 * Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through

 * 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the

 * source code with only those rights set forth herein.

 *

 * Any use of this source code in individual and commercial software must

 * include, in the user documentation and internal comments to the code,

 * the above Disclaimer and U.S. Government End Users Notice.

 */

#include <cutil.h>

#include <cutil_math.h>

#include <vector_types.h>

#include "IGGS_CUDA_defines.h"

//#include <nvMatrix.h>

using namespace std;

#define CHECK_BANK_CONFLICTS 0

#if CHECK_BANK_CONFLICTS

#define AS(i, j) CUT_BANK_CHECKER(((float*)&As[0][0]), (BLOCK_SIZE * i + j))

#define BS(i, j) CUT_BANK_CHECKER(((float*)&Bs[0][0]), (BLOCK_SIZE * i + j))

#else

#define AS(i, j) As[i][j]

#define BS(i, j) Bs[i][j]

#define BV(id) Bv[id]

#endif

///////////////////////////////////////////////////////////////////////////////

// On G80-class hardware 24-bit multiplication takes 4 clocks per warp

// (the same as for floating point  multiplication and addition),

// whereas full 32-bit multiplication takes 16 clocks per warp.

// So if integer multiplication operands are  guaranteed to fit into 24 bits

// (always lie withtin [-8M, 8M - 1] range in signed case),

// explicit 24-bit multiplication is preferred for performance.

///////////////////////////////////////////////////////////////////////////////

#define IMUL(a, b) __mul24(a, b)

__global__ void MatTest(

	float *d_C,

	float *d_A,

	float *d_B

){

	// Block index

	int bx = blockIdx.x;

	// Thread index

	int tx = threadIdx.x;

	int i = tx + (bx* blockDim.x);

	//printf ("tx: %d\n", tx);

	//printf ("bx: %d\n", bx);

	//printf ("i: %d\n", i);

#if SHARED_MEM == 1

	__shared__ float matA[TBLOCK * COLS];

	//__shared__ float vecB[ROWS];

	//__shared__ float vecC[ROWS];

	int c;

	for (c = 0; c < COLS; c++) {

		matA[indexC(tx,c, TBLOCK)] = d_A[indexC(i,c,ROWS)];

		//vecB[c] = d_B[c];

		//vecC[c] = 0;

	}

	__syncthreads();

	float t = 0;

	for (c = 0; c < COLS; c++) {

		t += matA[indexC(tx,c,TBLOCK)] * d_B[c];

	}

	//__syncthreads();

	d_C[i] = t;

	//this kernel uses  bytes of Shared Memory and  bit registers...

#endif

#if SHARED_MEM == 0

	d_C[i] = 0;

	for (int c = 0; c < COLS; c++) {

		d_C[i] += d_A[indexC(i,c,ROWS)] * d_B[c];

		//this kernel uses  bytes of Shared Memory and  bit registers...

	}

#endif

	//printf("d_C[%d]: %f\n", i, d_C[i]);

}

I was making a quite big mistake in my code, not calling a thread synchronize function after all the cublas function calls were made in the loop with variable STEP factor (STEP == 1 --> no loop)…

the main result of this mistake of mine was that even if STEP == 1 the value of the timer that measured the readback (GPU --> mem) would also contain the execution time in it basically making it seem as if the execution of the cublas function took an incredibly short amount of time… which just was not the case… a 5k x 5k * 5k x 1 matrix-vector multiply on a common 8400M GS can hardly seem credible if it is under 0.1 ms…

Changed code:

unsigned int timer_toGDDR = 0;

	float t_toGDDR_ms = 0.0f;

	unsigned int timer_toRAM = 0;

	float t_toRAM_ms = 0.0f;

	init_test1_data_CUBLAS (&h_C_CUBLAS);

	init_test1_CUBLAS_alloc (d_A, d_B, d_C1, status);

	unsigned int timer2 = 0;

	float timer2_ms = 0.0f;

	start_timer(&timer2);

	//toGPU

	start_timer(&timer_toGDDR);

	init_test1_CUBLAS_transfer_to_GPU (matA, vecB, d_A, d_B, argc, argv, status);

	stop_timer(timer_toGDDR, &t_toGDDR_ms);

	//data transfered

	////exec

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	for (int n = 0; n < STEPS; n++) {

		//init_test1_CUBLAS_transferVec_to_GPU (vecB, d_B,argc, argv, status);

		cublasSgemv('n', ROWS, COLS, 1, d_A, ROWS, d_B, 1, 0, d_C1, 1);

	}

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	//fromGPU

	start_timer(&timer_toRAM);

	init_test1_CUBLAS_readback (h_C_CUBLAS, d_C1, argc, argv, status);

	stop_timer(timer_toRAM, &t_toRAM_ms);

	//data transfered

	stop_timer(timer2, &timer2_ms);//Timer stopped

	////

	printf ("\n\nTransfer + Exec + Readback time on GPU with CUBLAS: %f (ms)\n", timer2_ms);

	printf ("\nExecution time on GPU with CUBLAS: %f (ms)\n", (timer2_ms - t_toGDDR_ms - t_toRAM_ms));

	printf ("\nTransfer to GPU with CUBLAS: %f (ms)\n", t_toGDDR_ms);

	printf ("\nTransfer from GPU with CUBLAS: %f (ms)\n", t_toRAM_ms);

	printf ("\nGPU CUBLAS code is %fx faster than the CPU code (VS SSE2)...\n\n",

		(timer / timer2_ms));

Output:

Initializing data...

...allocating CPU memory.

Matrix is 5376x5376

Vector is 5376x1

Exec time only on CPU: 48.166000 (ms)

simpleCUBLAS test running..

Transfer + Exec + Readback time on GPU with CUBLAS: 99.656998 (ms)

Execution time on GPU with CUBLAS: 40.692998 (ms)

Transfer to GPU with CUBLAS: 58.861000 (ms)

Transfer from GPU with CUBLAS: 0.103000 (ms)

GPU CUBLAS code is 0.483318x faster than the CPU code (VS SSE2)...