tera

Yes, you’re right. althougt I use

```
result += sdata[i] * M[v2_pos + (v2_size * i)];
```

instead of

```
result += sdata[i] * M[v2_pos * v1_size + i];
```

I think this access is more coalescing. Using it, I need the Matrix to be trasposed.

avidday

Thank you, that thread was very interesting to me.

I’m using it for neural nets, but I calculate “r= A1x1 + A2x2…+Anxn - b” instead of “r=Ax - b”. So the kernels I have calculate “r = r + Ax”. And in the beginning I got another kernel that calculates “r = -b”. The sizes of r (output) and x (input) can be different. I just thought “r = Ax” was a more general kernel to discuss.

One of my kernels to calculate “r = r + Ax” is pretty similar to yours. This one needs the matrix to be transposed but since I call the kernel a lot of times, I can tranpose it just once and keep it in memory this way. So I’m not counting the transposing time.

In the kernel that is similar to yours i have changed

```
float partial_sum = 0.0f;
for(int idx = threadIdx.x; idx < N_; idx += blockDim.x) {
float Aval = A[rowIdx+idx];
float xval = x[idx];
partial_sum += Aval * xval;
}
```

for

```
float partial_sum = 0.0f;
for(int idx = threadIdx.x; idx < N_; idx += blockDim.x) {
partial_sum += A[rowIdx+idx] * x[idx];
}
```

and i get a little improvement, but I don’t understand it.

This new kernel is faster than the one equivalent to yours (note that I don’t count the time I spend tranposing the matrix), but, as you said, the size of the input (x) is limited by the size of the shared memory. However, I can make partitions of x and At (tranposed A) easily and call the kernel how many times as neccesary.

Thank you very much, I thougt that I was wasting my time and my kernels were bad since I was getting better times with an assembly (with SSE2, XMM) than with CUDA. Now I know that my card (a 9500GT) is the one I have to blame for that.

Unfortunately I’m not using the nvidia tools to measure performance and I use just the clock since I’m used to it.