Hi.
I am a beginner learning CUDA recently, and stuck with the matrix multiplication problem. I’ve been searching posts about matrix multiplication, but I wasn’t able to get the right help. (maybe I should search more, though… :unsure:)
In the SDK example of matrix multiplication with shared memory, a matrix size is fixed like the following source code.
[codebox]// Thread block size
#define BLOCK_SIZE 16
// Matrix dimensions
// (chosen as multiples of the thread block size for simplicity)
#define WA (3 * BLOCK_SIZE) // Matrix A width
#define HA (5 * BLOCK_SIZE) // Matrix A height
#define WB (8 * BLOCK_SIZE) // Matrix B width
#define HB WA // Matrix B height
#define WC WB // Matrix C width
#define HC HA // Matrix C height[/codebox]
The thing is how I am going to handle the randomly sized matrices. There are two cases I can come up with. First, say, I choose 16x16 block size for shared memory, but randomly sized matrices turned out to be 5x3 and 3x8. In this case, BLOCK_SIZE is larger than matrices. Second, there could be the case that matrices are not multiples of BLOCK_SIZE. Then, threads might end up accessing to the wrong address. (right? :unsure: )
In sum,

How to choose the right size of BLOCK_SIZE at run time? (matrices are randomly sized, so they could be a retangule, not a square)

How to handle the case that matrices are not multiples of BLOCK_SIZE?
I really appreciate for your help and time in advance!