Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction

global void mul(float *d_a,float *d_b,float *d_c,int size) {

int Row = blockIdx.y * blockDim.x + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;

for (int k = 0; k < blockDim.x; ++k)
d_c[RowCol+k] += d_a[Rowk+Col] * d_b[k*Col+Row];}


This is my kernel , and this is how i launch it , with block size to be 16

dim3 dimblock(blocksize,blocksize),dimgrid((int)(ms/256 + (ms%256 == 0 ?0:1)),(int)(ms/256 + (ms%256 == 0 ?0:1)));


i get obsure results from this and even after threads synchronisation timing varies a lot , i cant launch more than a 5000x5000 matrix in this
i have a 9800 GX2 , and the cubin file always show 10 register usage . dunno whats the problem

can someone answer this to me plzz :

             1)   how is FSB limitation overcome in cuda , as intel cpu's bottleneck on multicore performance over 2 cores , as in consecutive memory access from all the threads.
             2)   Maximum sizes of each dimension of a block:    512 x 512 x 64 , Maximum sizes of each dimension of a grid:     65535 x 65535 x 1 . i wanted to know how are these limitations fixed
                   dimensions of block basically means number of threads in block , but thats limited to 512

                    and grid dimensions would mean number of blocks .

hello guys , y dont you guys just take a look at it once

the problem is your addressing… e.g. RowCol+k gives some weird results. this should be replaced by something like Col+Rowsize.

you don’t have bounds checking, too in your kernel. do some if(Row<size&&Col<size){…}.

your for-loop also seems strange, why are you iterating exactly 256 times with each kernel and not size times or something like that?

you can greatly reduce your runtime by adding to a variable instead of global memory. so use float tmp=0.f; for(…){tmp+=…;}d_c[…]=tmp;.

and don’t be so impatient ;-)

You should have a look at the matrix multiplication example in the CUDA programming guide. As was already pointed out, your kernel has issues.

clock() is an extremely imprecise time measurement. Your kernel call is probably completing in ~milliseconds or less. You need a higher resolution timer. I.e. gettimeofday on linux or queryPerformanceCounter on windoze: these are wrapped in a cross platform way by CUT which comes with the CUDA SDK. If you want extremely high precision timing of just the kernel launch, use the timers in the CUDA event API (read the programming guide).

What do you mean? There is no FSB on the GPU. The GTX 280, for instance, has a 512-bit memory bus tied directly from the RAM to the memory manager on the GPU. It is capable of feeding the GPU with ~140 GiB/s of bandwidth (read or write).

Because the hardware scheduler was designed that way.

Note that you can’t always run 512 threads in a block depending on the number of registers used in the kernel.

Also, you said that you can’t launch a matrix larger than 5000x5000…if this is due to memory restrictions, you could write a kernel that does an “in-place” multiplication (overwrites one of the input matrices with the output to save space). You should get a decent increase in the maximum possible matrix size from doing that.

1st THANKS A LOT for replaying and addressing my problems , thanks

i changed my kernel to this one (can u plz check this out):

[codebox]global void mul(float *d_a,float *d_b,float *d_c,int size) {

int i,j,sz=(int)(size/blockDim.x);

float temp=0;

int Row = blockIdx.y * blockDim.x + threadIdx.y;

int Col = blockIdx.x * blockDim.x + threadIdx.x;

if(Row<sz && Col<sz)


for (int k = 0; k < (int)(sz/blockDim.y); ++k)

  temp += d_a[Row*k+Col] * d_b[k*Col+Row];




if u see my for loop , for (int k = 0; k < (int)(size/blockDim.x*blockDim.y); ++k)

where size is the row/col of a square matrix , blockDim.x*blockDim.y is the total threads , so this gives number of blocks ,

i’m trying to iterate this loop over to all the existing blocks , or maybe i must’ve taken gridDim , correct me plz if i’m wrong

yup thanks for that , i’ll definitely consider that advice , but i’ve already posted 3-4 different threads and no reply yet , so i thought i could as well ask again ;-)

PLZZ correct me!!!

Thanks a lot for the reply , thaks again

1)hey the matrix multiplication in manual involves so much of optimisation and all , so i thought to write a simple cuda enabled MM , then optimise it

hey i tried using the CUT timer functions from cutil , but i got “cutil.h problrms” they were in SDK directory , i even gave its path , there was linking problem , 'm not root here so cant configure paths , i’ll try and make use of the cudaEvent routine , thanks

2)I used FSB coz i did not find a term for the GPU architecture , hey i thought the quick nature of GPU multicore thing was the presence of 16 banks in shared memory which allowed a half wrap to process and a full wrap in 4 clock cycles , and global memory had a latency of 400-600 clock cycles , correct me if i’m wrong , i could’nt intrepret what does these clock cycles meant ,

and GPU had 2 frequency , CORE and SHADER , which once to take into account for FLOPS calculations

3)hardware scheduler ? yeah i agree that for 512 thread limitation per block , can u tell the limitation on NUMBER OF BLOCKS AND GRIDS

yup i agree if i have to make lots of threads i need to compromise on shared memory and registers per thread , which being 16kb and 8192

hey i have 1 GB per card , what possible meory restrictions i could have so i allocate 3 float arrays on GPU , that’ll be 350005000*sizeof(float) = 300 MB i still have lots of space

$ ./a.out 5000

cuda error : Kernel Invocation failed : invalid configuration argument

That could be a problem in host code where you call your kernel. The part between <<< and >>> is the configuration.

You use the shader clock to calculate flops.

The maximum amount of blocks in a grid is 65535 x 65535 (or maybe the last digits are 6). Each kernel runs on a single grid. I don’t know if there’s a limit of how man grids and/or kernels your code can have

This doesn’t change the answer. The hardware scheduler is still responsible for launching all blocks. Think about it, on on MP with 8 blocks running flat out, one of them suddenly completes. The GPU isn’t going to go back to the host or anything dumb like that to get another block to run: they are all queued up in the hardware scheduler so that as soon as a block completes it is replaced by another.

With the GX2, each GPU gets half of the memory, so each one can only access 512MB. I would think you would be able to get closer to that limit, but from what I’ve read, there’s a decent amount of memory that you can’t ever access due to other processes using the GPU. Try testing some smaller numbers and see what the cutoff is between where it works and where it doesn’t.

its the memory stroed for mode switch , thats around 7 -1 0 MB just the bits on the screen depending on the resolution , so leaving this i still have 500 MB on a card and 512 on the other

Plzzz anwer this ong GUYS , My GPU hanging Frequently with the execution of this piece of code and giving obscure results , thanks in advance

No, from what I’ve read on the forums, you still won’t be able to access the full 500MB. There are other things internal to the video/CUDA driver that use memory as well, reducing the amount of memory that you are able to access. There is a method used by some developers on this forum where they “brute-force” some memory allocation, that is, trying to allocate a huge block, then a slightly smaller one, and so one until the allocation is successful. If you try this as an experiment, you will be able to determine the actual amount of memory available to you for your specific setup.


'm only able to use 461 MB of memory ,

Plz can i have an account of my remaining 52 MB of memory

thanks a lot profquail , i’d be greateful if u can tell me or link me up to where my 42 MB is gone , assuming 10 MB for mode swittch