Hi,

I am making an application which need to multiply a (row) vector by a matrix (y = x*A). So I’ve done this code:

```
__global__ void calc(float* mat, float* in, float* out, int tam) {
int ix = blockIdx.x*blockDim.x + threadIdx.x;
float ans = 0.;
int j = 0;
for (int i = ix; i < tam*tam; i+=tam) {
ans += mat[i]*in[j];
j++;
}
out[ix] = ans;
}
```

It is correct, I’ve tested it against several matrices/vectors and it is all fine.

But the problem is that when I run the cudaprof with a small example (64x64) and I get the following result: “gld coalesced = 128” and “gld uncoalesced = 2048”. I don’t know where does these uncoalesced access come from.

Can someone help me?

Thank you,

Daniel.

PS.: The “in” is declared as **constant**