Example of Matrix Multiplication with Shared memory

Hi ! I’m totally new with CUDA. I’ve read the CUDA C Programming guide (cuda 4.0) and I found a part (3.2.3) which described Shared Memory through Matrix Multiplication. However I don’t get how to use the stride efficiently.
This was the struct used:

// M(row,col)=*(M.elements + row M.stride + col)
typedef struct {
int width;
int height;
int stride;
float
elements;
} Matrix;

Is it possible to find a main which used this struct and the kernels proposed in this document ? The samples provided don’t use this struct. Something not to difficult to understand (without safeCall…) and using CUDA runtime, I’m still a student.

I don’t know how to choose parameters well enough (width, height, SIZE_BLOCK…) to get good performances with GPU. I’ve got a GPU with 2.1 compute capability. If you need further information, please do not hesitate !

I was trying to understand that example just 2 weeks ago I guess. I wrote a main function for it. This code is not perfect sample to show the performance of CUDA but it can help you to understand it. Good Luck!

matrix_shared.cu (3.39 KB)

Thanks !! It helps me a lot !