Matrix Multiplication Garbage value :(

Hello,

I wrote the following program that performs matrix multiplication of two square matrrices of equal dimensions. The CUDA compiler shows no error but when I execute the program on my 9600GT, it prints correctly only first row of the product matrix (matrix c or c_d on device and c_h on host ), rest are garbage values. The kernal is actually is given here on page number 9 MAtrix MultiPlication by DAvid Kirk .
Although this might be a very trivial question, but being new to CUDA I am unable to resolve it. Can someone have a look at the kernal and help me out.

Thanks in advance

/* This is a CUDA program that performs matrix multiplication on square matrices of equal dimensions */


*********/

#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <cuda.h>

// keranl that runs on Device
global void matrixmul(int *a, int *b, int *c, int width )

{
int i=threadIdx.x;
int j=threadIdx.y;
int k,m,n,sum;

//* sum stores the values computd by the thread
sum=0;
for(k=0;k<width;++k)
{
m= a[jwidth+k];
n= b[k
width+i];
sum =sum+(mn);
}
c[j
width+i]=sum; /* c is the product matrix

}

//main();

int main()
{
int i,j,m,n,k,width,sum;

int *a_h,*b_h,*a_d,*b_d,*c_h,c_d;
const int N=1000;
size_t size= N
sizeof(int);

//Memory allocation on host and device, a_h, a_d
a_h=(int*)malloc(size);
cudaMalloc((void**)&a_d,size);

//Memory allocation on host and device, b_h, b_d
b_h=(int*)malloc(size);
cudaMalloc((void**)&b_d,size);

//Memory allocation on host and device, c_h, c_d
c_h=(int*)malloc(size);
cudaMalloc((void**)&c_d,size);

//User inputs (note that row=columns)
printf(“enter the row & coloum of the 1st matrix m “);
scanf(”%d%d”,&width,&width); //row = columns=width
printf(“enter the element of 1st matrix m”);
for(i=0;i<(widthwidth);i++)
{
scanf(“%d”,&a_h[i]);
}
for(i=0;i<(width
width);i++)
printf(“\t%d”,a_h[i]);
{
printf(“\n”);
}

//copying data (a_h) from Host to Device in a_d
cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

printf(“enter the row & coloum of 2nd matrix n”);
scanf(“%d%d”,&width,&width); //row = columns
printf(“enter the element of 2nd matrix n”);
for(j=0;j<(widthwidth);j++)
{
scanf(“%d”,&b_h[j]);
}
for(j=0;j<(width
width);j++)
printf(“\t%d”,b_h[j]);
{
printf(“\n”);
}

//copying data (b_h) from Host to Device in b_d
cudaMemcpy(b_d,b_h,size,cudaMemcpyHostToDevice);

//Kernal call
matrixmul<<<1,4>>>(a_d,b_d,c_d,width );

//copying data (c_d) from Device to Host in c_h
cudaMemcpy(c_h,c_d,size,cudaMemcpyDeviceToHost);

//printing the results

for(i=0;i<(width*width);i++)
{
printf(“%d”,c_h[i]);
}

getch();

free(a_h);
cudaFree(a_d);
free(b_h);
cudaFree(b_d);
free(c_h);
cudaFree(c_d);
}

Take a look at the matrixMul example in the SDK, it’ll save you a lot of time (or use CUBLAS) :)

N.

why are you launching kernel with 1 block (using only 1ofXXX MPs on your gpu) and 4 threads per block? is there a reason for it or it just happens to be a test figure?

here is one mistake i found so far

missing “*/”

i hope you dont mind me editing your code

[codebox]/* This is a CUDA program that performs matrix multiplication on square matrices of equal dimensions */

include <stdio.h>

include <stdlib.h>

include <conio.h>

include <cuda.h>

include <cutil_inline.h>

// keranl that runs on Device

global void matrixmul(int *a, int *b, int *c, int width) {

int i=threadIdx.x;

int j=threadIdx.y;

int k,m,n,sum;

//* sum stores the values computd by the thread

sum=0;

for(k=0;k<width;++k) {

m= a[j*width+k];

n= b[k*width+i];

sum =sum+(m*n);

} // end for

c[jwidth+i]=sum; / c is the product matrix */

} // end matrixmul

//main();

int main( int argc, char **argv ) {

int i,j,m,n,k,width,sum;

int *a_h,*b_h,*a_d,*b_d,*c_h,*c_d;

const int N=1000;

size_t size= N*sizeof(int);

//Memory allocation on host and device, a_h, a_d

a_h=(int*)malloc(size);

cudaMalloc((void**)&a_d,size);

//Memory allocation on host and device, b_h, b_d

b_h=(int*)malloc(size);

cudaMalloc((void**)&b_d,size);

//Memory allocation on host and device, c_h, c_d

c_h=(int*)malloc(size);

cudaMalloc((void**)&c_d,size);

//User inputs (note that row=columns)

printf("enter the row & coloum of the 1st matrix m ");

scanf_s(“%d%d”,&width,&width); //row = columns=width

printf(“enter the element of 1st matrix m”);

for(i=0;i<(width*width);i++) scanf_s(“%d”,&a_h[i]);

for(i=0;i<(width*width);i++) printf(“\t%d”,a_h[i]);

printf(“\n”);

//copying data (a_h) from Host to Device in a_d

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

printf(“enter the row & coloum of 2nd matrix n”);

scanf_s(“%d%d”,&width,&width); //row = columns

printf(“enter the element of 2nd matrix n”);

for(j=0;j<(width*width);j++) scanf_s(“%d”,&b_h[j]);

for(j=0;j<(width*width);j++) printf(“\t%d”,b_h[j]);

printf(“\n”);

//copying data (b_h) from Host to Device in b_d

cudaMemcpy(b_d,b_h,size,cudaMemcpyHostToDevice);

//Kernal call

matrixmul<<<1,4>>>(a_d,b_d,c_d,width);

//copying data (c_d) from Device to Host in c_h

cudaMemcpy(c_h,c_d,size,cudaMemcpyDeviceToHost);

//printing the results

for(i=0;i<(width*width);i++) printf(“%d”,c_h[i]);

// getch();

free(a_h); cudaFree(a_d);

free(b_h); cudaFree(b_d);

free(c_h); cudaFree(c_d);

////////////////////////////////////////////////////////////////////////////

// Exit

////////////////////////////////////////////////////////////////////////////

printf( “Shutting down…” );

cutilExit( argc, argv );

}[/codebox]

i really hope you wont start kicking me for this…

build log:

the program doesnt work yet, but i think it is due to the fact of scanf presense as these statements were giving me trouble before. automising matrix initialisation should fix the problem (note: i changed scanf to scanf_s in order to compile you change that back to suite yuor compiler). by “yet” i mean it works up to the state where get to enter last element of second matrix and it just crashes. i’ll with the program for a bit and see if i can fix it (dont expect much though)…

Thanks alot Keither!!

You are free to do anything with my code :), what I want is to correctly get output!!

And yes “*/” was missing , biut actually when I compiled my program on Visual Studio 2005, i did not mention any comment. The comiler was not giving and error but the out put was wrong. Please tel me how you are optimising the code. In my opinion the error lies somewhere in kernal itself as the out put before kernel invocation is coming out fine.

Also why did you use cutil_inline.h?

i dint change code much, just edited so that it is a bit more resemble my programing style and generally a bit more readeble for me as i am constantly switching between VS2005 ans ConTEXT.

i only added cutil_inline.h so that the exit routine would hold the cmd window open for me until i press “Enter”. as i am running executable as any program by “double click” and not through proper cmd comand call (it is just handu for me that way). and that is how i found cutilExit being better than scanf routine.

Keither: I took one block with 4 threads because I was checking the output for a 2 x 2 matrix. The product output was correct only in the first row. The rest was garbage !!, And moreover I do not want to take more than 1 block for this particular program.

I think I found the bug.

You initialized the kernel with <<<1, 4>>>, then your block is 1D with 4 threads. So in each thread,
j = threadIdx.y; j will always be 0.
and i is from 0 to 3.

and when you compute sum, for i = 0 or 1, it is ok, that’s why the first row in c is correct.
but for i = 2, and it is ok for accessing matrix a,
however when you access matrix b with n = b[k*width + i], you are accessing b[2] and b[4], and b[4] is an undefined value in you 2x2 matrix.

xintian

Hi Kiran_CUDA,

I believe the bug was found as pointed out by yangxin.

As short the program is it always hard to decipher it just reading the code.

I would add the following comments:

In the kernel you have a “for” loop. You have to get rid of it:
" for(k=0;k<width;++k) {"

What you have to do is create so many blocks as “width’s” you have and run the kernel in different
threads.

Also - “sum =sum+(m*n);” - this is basically a serial operation waiting for previous results and all advantages of parallel computing are lost with it. The algorithm has to be inherently parallel to make it faster.

One way to do the same and eliminate the “for loop” and serially dependent operations is to make the algorithm to work on partial sums and use reduction.

This is the trickiest part of CUDA programming - to get rid of the loops and create the equivalent code in blocks and threads.

That is why we are developing this THS VCG - Visual Code Generator for NVIDIA CUDA too. You work on the algorithm graphically and the code generated automatically.

Also with an interactive graphical environment you can change values interactively as you do now do with “scanf” but much easier.
You can save the project and better maintain it later.

It is still in development but if you register on our site: http://www.thstart.com . We are posting a lot of materials every week. I can notify you when we have something to show to the external public.

Thanks Thstart!! :)

Well can you tell me why people “hate” loops in CUDA? Also could you be kind enough to tell me how to use partial sums and use reductions.

I am also exploring THS VCG, will get back to you soon regarding my queries.

The program runs faster when uninterrupted. Using loops requires the program to change its sequence of instructions - e.g. checking if requirement

for end is met, and if not to continue, if yes to skip the loop and continue further with the code after the loop.

For x86 this is important because the processor is prefetching instructions and it is optimal to run uninterrupted.

For GPU computing the idea is to slice your code and make it run in different blocks/threads.

If you have a loop with parameter running say from 0 to 3 you can easily make it parallel by unrolling it:

from this:

for (i=0;i<=3;i++)

{

a[i] = b[i] + c[i];

}

to this:

  1. a[0] = b[0] + c[0];

  2. a[1] = b[1] + c[1];

  3. a[2] = b[2] + c[2];

  4. a[3] = b[3] + c[3];

Now if you have a loop running 10,000 times:

  1. a[0] = b[0] + c[0];

  2. a[1] = b[1] + c[1];

  3. a[2] = b[2] + c[2];

  4. a[3] = b[3] + c[3];

10,000. a[9999] = b[9999] + c[9999];

If you can make possible each one to be in different thread you can have them executed in parallel.

This is oversimplification but you get the idea how to get rid of the loop and use parallelization.

Reduction is easier to explain in finding minimum or maximum:

say you want to find the maximum in the sequence:

1 24 12 2 34 15 22 4 44 23 55 3

One way to find it quicker is to divide it in groups of 4:

  1.          2.                3.
    

<--------> <----------> <---------->

1 24 12 2 34 15 22 4 44 23 55 3

Step 1.

Then find the max of each group:

  1.          2.                3.
    

24 34 55

Step 2.

Then the max from remaining 3 maximums is = 55.

Now, Step 1. can execute 1., 2. and 3. in parallel, then Step 2. can continue.

In real world if you have 10,000 numbers, this algorithms will finish when you have just 1 number left.

The task of slicing the numbers in groups and feeding the blocks/threads in GPU Computing is not trivial.

You can get fast results, but the coding is not easy. Also in real world you can have 20,234, or whatever

quantity of numbers to find the maximum of. The max performance you can get if you fit your data to

the actual GPU. Each GPU has different parameters - number of cores, threads, blocks, etc.

The bottom line - it depends from data and the actual GPU and it is very time consuming to do it

manually - it is an iterative process - instruction selection and reordering; coding; change; and benchmarking

and again.

That is why we are developing the THS VCG - to automate this process as much as possible,