Cuda matrix multiplication too slow

jcpao · February 11, 2010, 2:47pm

Hello,

I’m quite new at Cuda programming and I took the example of Cuda matrix multiplication (without using shared memory) from the Programming Guide, the result is right but is too slow. I use two 1024 x 1024 matrix with 16 x 16 blocks and I get an execution time of 5.39s (both in Debug and Release mode) whereas I get in C alone: 6.64s in Debug mode and 4.09s in Release mode. I use Visual Studio 2005. So in Release mode, Cuda seems no better than C, so I think I must have done something wrong somewhere.

Could you tell me what I did wrong, please?
Thank you for your help.
jcpao

apaehler · February 11, 2010, 2:57pm

Try CUBLAS - the SDK code is not too efficient - “(without using shared memory)” - not a good idea. You also do not say what kind of card you have. A GTX280 will perform at about 380 GFlops single precision (SGEMM), while a 4-core Xeon will, with optimized code, using all cores, will be 80 GFlops.

YDD · February 11, 2010, 3:28pm

Something else has got to be wrong here. The SDK matrix multiply example is about one third the speed of CUBLAS (I was timing this recently, for my own nefarious purposes), and copying 1024^2 matrices to the GPU and back isn’t that slow. What do these times include? Was the CUDA context already established before the timer began? Those times look suspiciously like ‘whole program’ times.

jcpao · February 12, 2010, 11:04am

I forgot to say that I use a GeForce 8400 GS.

This is the program I run:

/* Programme Cuda pris dans le document NVIDIA CUDA: Programming Guide 2.3 (p.18 Ã 21)

include “stdafx.h”

include <stdio.h>

include <cuda.h>

typedef struct {

int width;

int height;

float* elements;

} Matrix;

/* Taille des blocs de fils d’exÃ©cution */

/* Les dimensions de matrice sont supposÃ©es Ãªtre des multiples de BLOCK_SIZE */

define BLOCK_SIZE 16

define MSIZE 1024

/* DÃ©claration de la matrice de multiplication Ã exÃ©cuter en parallÃ¨le */

global void MulMatKernel(const Matrix, const Matrix, Matrix);

int main(void)

{

/* DÃ©claration de pointeurs vers les matrices d’entrÃ©e */

Matrix a_h, b_h, c_h;

FILE *fp1, *fp2, *fp3;

fp1 = fopen(“a_h.txt”,“w”);

if(fp1 == NULL) {

printf("Ouverture du fichier %s impossible\n", "a_h.txt");

exit(-1);

}

fp2 = fopen(“b_h.txt”,“w”);

if(fp2 == NULL) {

printf("Ouverture du fichier %s impossible\n", "b_h.txt");

exit(-1);

}

fp3 = fopen(“c_h.txt”,“w”);

if(fp3 == NULL) {

printf("Ouverture du fichier %s impossible\n", "c_h.txt");

exit(-1);

}

a_h.width = MSIZE; a_h.height = MSIZE;

b_h.width = MSIZE; b_h.height = MSIZE;

c_h.width = MSIZE; c_h.height = MSIZE;

size_t size = MSIZE * MSIZE * sizeof(float); /* size_t = unsigned int */

/* Allocation de mÃ©moire pour les matrices d’entrÃ©e */

a_h.elements = (float*)malloc(size);

b_h.elements = (float*)malloc(size);

c_h.elements = (float*)malloc(size);

fprintf(fp1, “\n”); /* Espace dans le fichier avant l’Ã©criture des nombres */

/* Initialisation des matrices hÃ´te */

for (int j=0; j<a_h.height; j++)

	for (int i=0; i<a_h.width; i++){

		a_h.elements[a_h.width*j + i] = (float)rand()/RAND_MAX;

		fprintf(fp1, "%f\n", a_h.elements[a_h.width*j + i]);

		}

fprintf(fp2, “\n”); /* Espace dans le fichier avant l’Ã©criture des nombres */

/* La fonction rand dÃ©livre un nombre pseudoalÃ©atoire compris entre 0 et 32767(RAND_MAX) */

for (int j=0; j<b_h.height; j++)

	for (int i=0; i<b_h.width; i++){

		b_h.elements[b_h.width*j + i] = (float)rand()/RAND_MAX;

		fprintf(fp2, "%f\n", b_h.elements[b_h.width*j + i]);

		}

for (int j=0; j<c_h.height; j++)

	for (int i=0; i<c_h.width; i++)

		c_h.elements[c_h.width*j + i] = 0.0;

Matrix d_A, d_B, d_C;

d_A.width = a_h.width; d_A.height = a_h.height;

d_B.width = b_h.width; d_B.height = b_h.height;

d_C.width = c_h.width; d_C.height = c_h.height;

cudaMalloc((void**)&d_A.elements, size);

cudaMemcpy(d_A.elements, a_h.elements, size, cudaMemcpyHostToDevice);

cudaMalloc((void**)&d_B.elements, size);

cudaMemcpy(d_B.elements, b_h.elements, size, cudaMemcpyHostToDevice);

cudaMalloc((void**)&d_C.elements, size);

/* Appel de la fonction Ã exÃ©cuter en parallÃ¨le sur la carte graphique */

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(b_h.width / dimBlock.x, a_h.height / dimBlock.y);

// Mesure du temps d’exÃ©cution

cudaEvent_t start, stop;

float time;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord( start, 0 ); // DÃ©but

MulMatKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

//MulMat(a_h, b_h, c_h);

cudaThreadSynchronize();

// Fin de la mesure du temps d’exÃ©cution du programme

cudaEventRecord( stop, 0 ); // Fin

cudaEventSynchronize( stop );

cudaEventElapsedTime( &time, start, stop );

cudaEventDestroy( start );

cudaEventDestroy( stop );

// Print results

printf(“\n”);

printf(“Temps ecoule: %f ms\n”, time);

/* Transfert du rÃ©sultat de la carte graphique Ã la CPU */

cudaMemcpy(c_h.elements, d_C.elements, size, cudaMemcpyDeviceToHost);

cudaFree(d_A.elements);

cudaFree(d_B.elements);

cudaFree(d_C.elements);

for (int i=0; i<MSIZE; i++) {

for (int j=0; j<MSIZE; j++){

	if (c_h.elements[i*c_h.width + j] > MSIZE || c_h.elements[i*c_h.width + j] < 0)

		printf("erreur = %f i = %d, j = %d\n", c_h.elements[i*c_h.width + j], i, j);

	fprintf(fp3, "%f\n", c_h.elements[c_h.width*j + i]);

}

//printf("\n");

}

fclose(fp1);

fclose(fp2);

fclose(fp3);

// Cleanup

free(a_h.elements);

free(b_h.elements);

free(c_h.elements);

}

device void MulMatKernel(Matrix A, Matrix B, Matrix C)

{

// Each thread computes one element of C by accumulating results into Cvalue

float Cvalue = 0.0f;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int col = blockIdx.x * blockDim.x + threadIdx.x;

/* Calcul et rangement en colonnes pour Matlab */

for (int e = 0; e < A.width; ++e)

Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col];

C.elements[col * C.width + row] = Cvalue;

}

Thank you for your help.

jcpao

jcpao · February 12, 2010, 5:01pm

I forgot to say that I use a GeForce 8400 GS.

This is the program I run:

/* Programme Cuda pris dans le document NVIDIA CUDA: Programming Guide 2.3 (p.18 Ã 21)

include “stdafx.h”

include <stdio.h>

include <cuda.h>

typedef struct {

int width;

int height;

float* elements;

} Matrix;

/* Taille des blocs de fils d’exÃ©cution */

/* Les dimensions de matrice sont supposÃ©es Ãªtre des multiples de BLOCK_SIZE */

define BLOCK_SIZE 16

define MSIZE 1024

/* DÃ©claration de la matrice de multiplication Ã exÃ©cuter en parallÃ¨le */

global void MulMatKernel(const Matrix, const Matrix, Matrix);

int main(void)

{

/* DÃ©claration de pointeurs vers les matrices d’entrÃ©e */

Matrix a_h, b_h, c_h;

FILE *fp1, *fp2, *fp3;

fp1 = fopen(“a_h.txt”,“w”);

if(fp1 == NULL) {
printf("Ouverture du fichier %s impossible\n", "a_h.txt");

exit(-1);

}
fp2 = fopen(“b_h.txt”,“w”);

if(fp2 == NULL) {
printf("Ouverture du fichier %s impossible\n", "b_h.txt");

exit(-1);

}
fp3 = fopen(“c_h.txt”,“w”);

if(fp3 == NULL) {
printf("Ouverture du fichier %s impossible\n", "c_h.txt");

exit(-1);

}
a_h.width = MSIZE; a_h.height = MSIZE;

b_h.width = MSIZE; b_h.height = MSIZE;

c_h.width = MSIZE; c_h.height = MSIZE;

size_t size = MSIZE * MSIZE * sizeof(float); /* size_t = unsigned int */

/* Allocation de mÃ©moire pour les matrices d’entrÃ©e */

a_h.elements = (float*)malloc(size);

b_h.elements = (float*)malloc(size);

c_h.elements = (float*)malloc(size);

fprintf(fp1, “\n”); /* Espace dans le fichier avant l’Ã©criture des nombres */

/* Initialisation des matrices hÃ´te */

for (int j=0; j<a_h.height; j++)
	for (int i=0; i<a_h.width; i++){

		a_h.elements[a_h.width*j + i] = (float)rand()/RAND_MAX;

		fprintf(fp1, "%f\n", a_h.elements[a_h.width*j + i]);

		}
fprintf(fp2, “\n”); /* Espace dans le fichier avant l’Ã©criture des nombres */

/* La fonction rand dÃ©livre un nombre pseudoalÃ©atoire compris entre 0 et 32767(RAND_MAX) */

for (int j=0; j<b_h.height; j++)
	for (int i=0; i<b_h.width; i++){

		b_h.elements[b_h.width*j + i] = (float)rand()/RAND_MAX;

		fprintf(fp2, "%f\n", b_h.elements[b_h.width*j + i]);

		}
for (int j=0; j<c_h.height; j++)
	for (int i=0; i<c_h.width; i++)

		c_h.elements[c_h.width*j + i] = 0.0;
Matrix d_A, d_B, d_C;

d_A.width = a_h.width; d_A.height = a_h.height;

d_B.width = b_h.width; d_B.height = b_h.height;

d_C.width = c_h.width; d_C.height = c_h.height;

cudaMalloc((void**)&d_A.elements, size);

cudaMemcpy(d_A.elements, a_h.elements, size, cudaMemcpyHostToDevice);

cudaMalloc((void**)&d_B.elements, size);

cudaMemcpy(d_B.elements, b_h.elements, size, cudaMemcpyHostToDevice);

cudaMalloc((void**)&d_C.elements, size);

/* Appel de la fonction Ã exÃ©cuter en parallÃ¨le sur la carte graphique */

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(b_h.width / dimBlock.x, a_h.height / dimBlock.y);

// Mesure du temps d’exÃ©cution

cudaEvent_t start, stop;

float time;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventRecord( start, 0 ); // DÃ©but

MulMatKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

//MulMat(a_h, b_h, c_h);

cudaThreadSynchronize();

// Fin de la mesure du temps d’exÃ©cution du programme

cudaEventRecord( stop, 0 ); // Fin

cudaEventSynchronize( stop );

cudaEventElapsedTime( &time, start, stop );

cudaEventDestroy( start );

cudaEventDestroy( stop );

// Print results

printf(“\n”);

printf(“Temps ecoule: %f ms\n”, time);

/* Transfert du rÃ©sultat de la carte graphique Ã la CPU */

cudaMemcpy(c_h.elements, d_C.elements, size, cudaMemcpyDeviceToHost);

/* DÃ©sallocation de mÃ©moire sur la carte graphique */

cudaFree(d_A.elements);

cudaFree(d_B.elements);

cudaFree(d_C.elements);

fprintf(fp3, “\n”); /* Espace dans le fichier avant l’Ã©criture des nombres */

for (int i=0; i<MSIZE; i++) {
for (int j=0; j<MSIZE; j++){

	if (c_h.elements[i*c_h.width + j] > MSIZE || c_h.elements[i*c_h.width + j] < 0)

		printf("erreur = %f i = %d, j = %d\n", c_h.elements[i*c_h.width + j], i, j);

	fprintf(fp3, "%f\n", c_h.elements[c_h.width*j + i]);

}

//printf("\n");
}

fclose(fp1);

fclose(fp2);

fclose(fp3);

// Cleanup

free(a_h.elements);

free(b_h.elements);

free(c_h.elements);

}

device void MulMatKernel(Matrix A, Matrix B, Matrix C)

{

// Each thread computes one element of C by accumulating results into Cvalue

/* Chaque fil d’exÃ©cution calcule un Ã©lÃ©ment de la matrice rÃ©sultat en cumulant

les rÃ©sultats dans une valeur intermÃ©diaire Cvalue */

float Cvalue = 0.0f;

int row = blockIdx.y * blockDim.y + threadIdx.y;

int col = blockIdx.x * blockDim.x + threadIdx.x;

/* Calcul et rangement en colonnes pour Matlab */

for (int e = 0; e < A.width; ++e)
Cvalue += A.elements[row * A.width + e] * B.elements[e * B.width + col];
C.elements[col * C.width + row] = Cvalue;

}

Thank you for your help.

jcpao

These are the results I get from the profiler (see attached Excel document):
prof_mulmat3.xls (14 KB)

jcpao · February 17, 2010, 4:58pm

Hello,

I took the example of Cuda matrix multiplication using shared memory from the Programming Guide. I use two 1024 x 1024 matrix with 16 x 16 blocks. I take use 8 registers per thread. I use a GPU 8400 GS with 8 stream processors (1400 MHz).

I get an execution time (for the kernel alone) of 387ms.

Please could you tell me whether it is a slow or a normal execution time?
Thank you for your help. :)
jcpao

Topic		Replies	Views
Why different shape matrix multiplication have different performance? CUDA Programming and Performance	2	743	August 26, 2018
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	9993	January 18, 2012
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5579	August 12, 2009
matrix multiplication CUDA Programming and Performance	10	3768	March 7, 2010
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	715	September 6, 2023
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	923	August 23, 2018
CUDA Matrix Multiplication Performance CUDA Programming and Performance	3	2884	December 6, 2015
Non-square Matrix Multiplication, not getting any cValues back CUDA Programming and Performance	2	3389	July 1, 2009
Matrix multiplication from CUDA programming guide CUDA Programming and Performance	0	1825	November 23, 2009
cuda error out of memory how to increase the size of matrix in multiplication CUDA Programming and Performance	10	11730	October 1, 2009

Cuda matrix multiplication too slow

Related topics