Hello everyone.
I’m novice at CUDA programming and now i’m trying to write a simple kernel that executes a multiplication of two matrices.
The code is following:
global void Vector2DMult(float* A, float* B, float* C, int v_dim, size_t pitch) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i < v_dim && j < v_dim) {
float* rowA = (float*) ((char*) A + ipitch);
float rowB = (float*) ((char*) B + j*pitch);
C[j] += rowA[j]*rowB[i];
}
}
Feeling is that the kernel calculates only the last elements in every row and every col (27=14; 58=40; 8*9=72)
So as i’m guessing: Does it happen because of several threads CAN NOT write to one address in global memory simultaneously?
Or is it any other reason which causes that i get such results?
The graphics card does things in parrallel. That is 32 threads in a warp execute axactly the same instruction in the same clock cycle NB #1
Also what appears to be one instruction in C is actually several
So if you break this line down
C[j] += rowA[j]*rowB[i];
it becomes something like
1) fetch C[j] rowA[j] rowB[i] from global memory
2) work on other warps or blocks until they all arrive (aprox 600 cycles)
3) calculate the results
4) store results in C[j] ( which will take a few hundred cycles to take effect)
The thing is that you can assume that all threads are going to read from C in step 1 before any results are returned to C in step 4
A different approch will work and work very well. Have a look in the Programming guide examples and SDK
NB #1
Well 8 do it in 1 clock cycle, next 8 in the next… and when all 32 have done that instruction they all move onto the next one up to a point where the other warps take their turn.
Lots of thanks kbam. I’ve read so much stuff but you solved my “brain deadlock”.
Can you give me some advices about some good explanations of coalescing, warps and threads to read. And more detailed explanation of mechanism of kernel execution.