Hi cuda-mates.

IÂ´m currently working in the performance of a recurrent neuronal nework in GPU but it looks like i have a

cumbersome kernel due to problems with coalesced memory I guess.

The problematic kernel is the next function:

Hx_iq = -SUM[j=1 to n] (W_ij * Y_jq)

where :

[*]W = 1D Array of n*n

[*]Y = 1D Array of n*p

[*]sal = 1D array of n*p with the result of execute H function.

IÂ´ve coded it like the following code:

```
__global__ void H_Allocation_cuda(int *y, int *w, int n, int p, int *sal)
{
int idx=blockIdx.x*blockDim.x+threadIdx.x;
int out=0;
int j;
for(j=0;j<n;j++)
{
out=out+w[blockIdx.x*n+j]*y[j*p+threadIdx.x];
}
sal[idx]=out*-1;
}
```

And the call

```
H_Allocation_cuda<<<n,p>>>(d_Y,d_W,n,p,d_aux_h);
```

In attached pic you can see the problem.

Anyone can give me some advice how to solve it in a proper way?

Thanks :)