Need help about global memory access by threads

Hello everyone.
I’m novice at CUDA programming and now i’m trying to write a simple kernel that executes a multiplication of two matrices.
The code is following:

global void Vector2DMult(float* A, float* B, float* C, int v_dim, size_t pitch) {

int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;

if (i < v_dim && j < v_dim) {
float* rowA = (float*) ((char*) A + ipitch);
float
rowB = (float*) ((char*) B + j*pitch);
C[j] += rowA[j]*rowB[i];
}
}

But as a result I get the following:

(0 1 2)(1 2 3)=(14)Instead of (18 )
(3 4 5)(4 5 6)=(40)
(66 )
(6 7 8)(7 8 9)=(72)
____(132)

Feeling is that the kernel calculates only the last elements in every row and every col (27=14; 58=40; 8*9=72)

So as i’m guessing: Does it happen because of several threads CAN NOT write to one address in global memory simultaneously?
Or is it any other reason which causes that i get such results?

Many thanks in avdance.

The graphics card does things in parrallel. That is 32 threads in a warp execute axactly the same instruction in the same clock cycle NB #1

Also what appears to be one instruction in C is actually several

So if you break this line down

C[j] += rowA[j]*rowB[i];

it becomes something like

1) fetch C[j] rowA[j] rowB[i] from global memory

2) work on other warps or blocks until they all arrive (aprox 600 cycles)

3) calculate the results

4) store results in C[j]   ( which will take a few hundred cycles to take effect)

The thing is that you can assume that all threads are going to read from C in step 1 before any results are returned to C in step 4

A different approch will work and work very well. Have a look in the Programming guide examples and SDK

NB #1

Well 8 do it in 1 clock cycle, next 8 in the next… and when all 32 have done that instruction they all move onto the next one up to a point where the other warps take their turn.

Lots of thanks kbam. I’ve read so much stuff but you solved my “brain deadlock”.

Can you give me some advices about some good explanations of coalescing, warps and threads to read. And more detailed explanation of mechanism of kernel execution.

I’ll be very appreciated for this.

You will get a lot of what you want from the Programming Guide (V 3.0) particularly sections 2.1, 2.2, 2.3, 4.1, 4.2, B.4

Also the Best Practices Guide (V3.0) section 3.2

Can find them here
http://developer.nvidia.com/object/cuda_3_0_downloads.html

Thanks. I’m gonna read these chapters first.