Generalized SGMM

Hi.
Could anyone give me an idea as to how I could implement the SGMM for generalized dimension matrices?

I’m having quite a time deciding the tile-size, given generalized dimensions for the two input matrices.
The codes provided in the SDK only discuss a convenient case of both input matrices being square.

I understand that such a query has been raised in the past and the link below has been cited as reference to the approach, but i just am not able to access the links to the papers provided there.
[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

Help please External Image

What about using the cublasSgemm? Is there any reason why you would not want to use it?

I am guessing you could look at their source code to gigure out how to write one.

I would like to try it out on my own.

Could you direct me to the source code without me having to download the entire ~70MB CUDA Toolkit?

No replies… :(

Ok. Let me break this down and explain the part of the algorithm which is troubling me…

The problem I’m facing with generalized matrices, say A(m,n) * B(n,p) which shall result in a C(m,p) where m,n,p are the dimensions of the matrices A,B,C is incorporating the tile sizes with the generalized dimensions m,n,p.

In this case, say, I choose the tile width and tile breadth as 2x2, (since every element of the matrix is visited exactly twice to participate in the result of C).
Now that I’ve chosen my size as 2x2, thereby making the number of blocks in the program to be MAX(A.width/tile_size , B.width/tile_size) x MAX(A.height/tile_size , B.height/tile_size) I’ll have one row/column which shall not fit into a block. How do i manage the multiplication in this case?

Eg: Say I have A(23, 41) and B(41,7). Considering a block consisting of 2x2 tiles, the number of blocks in the program ought to be MAX(23/2, 41/2) x MAX(41/2, 7/2), which is, 20x20. So, if i have 20x20 blocks consisting of 2x2 tiles each, how do i manage the elements of the 41st row in B? Do i have to handle them separately once the 20x20 blocks have done their job?

Keeping memory coalescing as an objective, would the above strategy to divide the matrices into blocks work or is there a better way to do it??

Perhaps you can try padding the matrix with zero?

Alternatively, you could mask out the unneeded operations using [font=“Courier New”]if[/font] statements.

The important part is that you round the number of blocks up, not down.