Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction

kartik3vv · January 22, 2009, 5:05am

global void mul(float *d_a,float *d_b,float *d_c,int size) {

int Row = blockIdx.y * blockDim.x + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;

for (int k = 0; k < blockDim.x; ++k)
d_c[RowCol+k] += d_a[Rowk+Col] * d_b[k*Col+Row];}

}

This is my kernel , and this is how i launch it , with block size to be 16
start=clock();

dim3 dimblock(blocksize,blocksize),dimgrid((int)(ms/256 + (ms%256 == 0 ?0:1)),(int)(ms/256 + (ms%256 == 0 ?0:1)));
mul<<<dimgrid,dimblock>>>(d_a,d_b,d_c,size);
cudaThreadSynchronize();

end=clock();

i get obsure results from this and even after threads synchronisation timing varies a lot , i cant launch more than a 5000x5000 matrix in this
i have a 9800 GX2 , and the cubin file always show 10 register usage . dunno whats the problem

can someone answer this to me plzz :

             1)   how is FSB limitation overcome in cuda , as intel cpu's bottleneck on multicore performance over 2 cores , as in consecutive memory access from all the threads.
            
             2)   Maximum sizes of each dimension of a block:    512 x 512 x 64 , Maximum sizes of each dimension of a grid:     65535 x 65535 x 1 . i wanted to know how are these limitations fixed
                  
                   dimensions of block basically means number of threads in block , but thats limited to 512

                    and grid dimensions would mean number of blocks .

kartik3vv · January 22, 2009, 6:50am

hello guys , y dont you guys just take a look at it once

Ocire · January 22, 2009, 11:30am

the problem is your addressing… e.g. RowCol+k gives some weird results. this should be replaced by something like Col+Rowsize.

you don’t have bounds checking, too in your kernel. do some if(Row<size&&Col<size){…}.

your for-loop also seems strange, why are you iterating exactly 256 times with each kernel and not size times or something like that?

you can greatly reduce your runtime by adding to a variable instead of global memory. so use float tmp=0.f; for(…){tmp+=…;}d_c[…]=tmp;.

and don’t be so impatient ;-)

MisterAnderson42 · January 22, 2009, 12:50pm

You should have a look at the matrix multiplication example in the CUDA programming guide. As was already pointed out, your kernel has issues.

clock() is an extremely imprecise time measurement. Your kernel call is probably completing in ~milliseconds or less. You need a higher resolution timer. I.e. gettimeofday on linux or queryPerformanceCounter on windoze: these are wrapped in a cross platform way by CUT which comes with the CUDA SDK. If you want extremely high precision timing of just the kernel launch, use the timers in the CUDA event API (read the programming guide).

What do you mean? There is no FSB on the GPU. The GTX 280, for instance, has a 512-bit memory bus tied directly from the RAM to the memory manager on the GPU. It is capable of feeding the GPU with ~140 GiB/s of bandwidth (read or write).

Because the hardware scheduler was designed that way.

Note that you can’t always run 512 threads in a block depending on the number of registers used in the kernel.

jack · January 22, 2009, 4:10pm

Also, you said that you can’t launch a matrix larger than 5000x5000…if this is due to memory restrictions, you could write a kernel that does an “in-place” multiplication (overwrites one of the input matrices with the output to save space). You should get a decent increase in the maximum possible matrix size from doing that.

kartik3vv · January 23, 2009, 5:46am

1st THANKS A LOT for replaying and addressing my problems , thanks

i changed my kernel to this one (can u plz check this out):

[codebox]global void mul(float *d_a,float *d_b,float *d_c,int size) {

int i,j,sz=(int)(size/blockDim.x);

float temp=0;

int Row = blockIdx.y * blockDim.x + threadIdx.y;

int Col = blockIdx.x * blockDim.x + threadIdx.x;

if(Row<sz && Col<sz)

{

for (int k = 0; k < (int)(sz/blockDim.y); ++k)

  temp += d_a[Row*k+Col] * d_b[k*Col+Row];

}

d_c[Row*sz+Col]=temp;

}[/codebox]

if u see my for loop , for (int k = 0; k < (int)(size/blockDim.x*blockDim.y); ++k)

where size is the row/col of a square matrix , blockDim.x*blockDim.y is the total threads , so this gives number of blocks ,

i’m trying to iterate this loop over to all the existing blocks , or maybe i must’ve taken gridDim , correct me plz if i’m wrong

yup thanks for that , i’ll definitely consider that advice , but i’ve already posted 3-4 different threads and no reply yet , so i thought i could as well ask again ;-)

PLZZ correct me!!!

kartik3vv · January 23, 2009, 5:54am

Thanks a lot for the reply , thaks again

You should have a look at the matrix multiplication example in the CUDA programming guide. As was already pointed out, your kernel has issues.

clock() is an extremely imprecise time measurement. Your kernel call is probably completing in ~milliseconds or less. You need a higher resolution timer. I.e. gettimeofday on linux or queryPerformanceCounter on windoze: these are wrapped in a cross platform way by CUT which comes with the CUDA SDK. If you want extremely high precision timing of just the kernel launch, use the timers in the CUDA event API (read the programming guide).

What do you mean? There is no FSB on the GPU. The GTX 280, for instance, has a 512-bit memory bus tied directly from the RAM to the memory manager on the GPU. It is capable of feeding the GPU with ~140 GiB/s of bandwidth (read or write).

Because the hardware scheduler was designed that way.

Note that you can’t always run 512 threads in a block depending on the number of registers used in the kernel.

1)hey the matrix multiplication in manual involves so much of optimisation and all , so i thought to write a simple cuda enabled MM , then optimise it

hey i tried using the CUT timer functions from cutil , but i got “cutil.h problrms” they were in SDK directory , i even gave its path , there was linking problem , 'm not root here so cant configure paths , i’ll try and make use of the cudaEvent routine , thanks

2)I used FSB coz i did not find a term for the GPU architecture , hey i thought the quick nature of GPU multicore thing was the presence of 16 banks in shared memory which allowed a half wrap to process and a full wrap in 4 clock cycles , and global memory had a latency of 400-600 clock cycles , correct me if i’m wrong , i could’nt intrepret what does these clock cycles meant ,

and GPU had 2 frequency , CORE and SHADER , which once to take into account for FLOPS calculations

3)hardware scheduler ? yeah i agree that for 512 thread limitation per block , can u tell the limitation on NUMBER OF BLOCKS AND GRIDS

yup i agree if i have to make lots of threads i need to compromise on shared memory and registers per thread , which being 16kb and 8192

kartik3vv · January 23, 2009, 5:58am

hey i have 1 GB per card , what possible meory restrictions i could have so i allocate 3 float arrays on GPU , that’ll be 350005000*sizeof(float) = 300 MB i still have lots of space

$ ./a.out 5000

cuda error : Kernel Invocation failed : invalid configuration argument

_Big_Mac · January 23, 2009, 11:19am

That could be a problem in host code where you call your kernel. The part between <<< and >>> is the configuration.

You use the shader clock to calculate flops.

The maximum amount of blocks in a grid is 65535 x 65535 (or maybe the last digits are 6). Each kernel runs on a single grid. I don’t know if there’s a limit of how man grids and/or kernels your code can have

MisterAnderson42 · January 23, 2009, 1:29pm

This doesn’t change the answer. The hardware scheduler is still responsible for launching all blocks. Think about it, on on MP with 8 blocks running flat out, one of them suddenly completes. The GPU isn’t going to go back to the host or anything dumb like that to get another block to run: they are all queued up in the hardware scheduler so that as soon as a block completes it is replaced by another.

jack · January 23, 2009, 3:46pm

With the GX2, each GPU gets half of the memory, so each one can only access 512MB. I would think you would be able to get closer to that limit, but from what I’ve read, there’s a decent amount of memory that you can’t ever access due to other processes using the GPU. Try testing some smaller numbers and see what the cutoff is between where it works and where it doesn’t.

kartik3vv · January 24, 2009, 4:43am

its the memory stroed for mode switch , thats around 7 -1 0 MB just the bits on the screen depending on the resolution , so leaving this i still have 500 MB on a card and 512 on the other

kartik3vv · January 24, 2009, 4:45am

1st THANKS A LOT for replaying and addressing my problems , thanks

i changed my kernel to this one (can u plz check this out):

[codebox]global void mul(float *d_a,float *d_b,float *d_c,int size) {

int i,j,sz=(int)(size/blockDim.x);

float temp=0;

int Row = blockIdx.y * blockDim.x + threadIdx.y;

int Col = blockIdx.x * blockDim.x + threadIdx.x;

if(Row<sz && Col<sz)

{

for (int k = 0; k < (int)(sz/blockDim.y); ++k)
  temp += d_a[Row*k+Col] * d_b[k*Col+Row];
}

d_c[Row*sz+Col]=temp;

}[/codebox]

if u see my for loop , for (int k = 0; k < (int)(size/blockDim.x*blockDim.y); ++k)

where size is the row/col of a square matrix , blockDim.x*blockDim.y is the total threads , so this gives number of blocks ,

i’m trying to iterate this loop over to all the existing blocks , or maybe i must’ve taken gridDim , correct me plz if i’m wrong

yup thanks for that , i’ll definitely consider that advice , but i’ve already posted 3-4 different threads and no reply yet , so i thought i could as well ask again ;-)

PLZZ correct me!!!

Plzzz anwer this ong GUYS , My GPU hanging Frequently with the execution of this piece of code and giving obscure results , thanks in advance

jack · January 24, 2009, 5:31am

No, from what I’ve read on the forums, you still won’t be able to access the full 500MB. There are other things internal to the video/CUDA driver that use memory as well, reducing the amount of memory that you are able to access. There is a method used by some developers on this forum where they “brute-force” some memory allocation, that is, trying to allocate a huge block, then a slightly smaller one, and so one until the allocation is successful. If you try this as an experiment, you will be able to determine the actual amount of memory available to you for your specific setup.

kartik3vv · January 24, 2009, 8:04am

EXACTLY RIGHT

'm only able to use 461 MB of memory ,

Plz can i have an account of my remaining 52 MB of memory

thanks a lot profquail , i’d be greateful if u can tell me or link me up to where my 42 MB is gone , assuming 10 MB for mode swittch

Topic		Replies	Views
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10164	January 18, 2012
matrix multiplication CUDA Programming and Performance	10	3948	March 7, 2010
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2464	October 12, 2021
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	53785	June 7, 2022
max matrix size in matrix multiplication matrix example in programming guide CUDA Programming and Performance	6	7042	November 5, 2007
matrix multiply reduction CUDA Programming and Performance	41	35860	January 15, 2011
limit of computation CUDA Programming and Performance	44	33130	April 8, 2008
Matrix Multiplication Inconsistency Different values output in every run of the matixMul program CUDA Programming and Performance	29	8980	December 16, 2009
Working with very large arrays in CUDA CUDA Programming and Performance	9	7391	March 29, 2011
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18504	October 17, 2013

Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction

Related topics