problem with coalesced memory bottleneck in cuda kernel

Hi cuda-mates.

I´m currently working in the performance of a recurrent neuronal nework in GPU but it looks like i have a

cumbersome kernel due to problems with coalesced memory I guess.

The problematic kernel is the next function:

Hx_iq = -SUM[j=1 to n] (W_ij * Y_jq)

where :

    []W = 1D Array of nn

    []Y = 1D Array of np

    []sal = 1D array of np with the result of execute H function.

I´ve coded it like the following code:

__global__ void H_Allocation_cuda(int *y, int *w, int n, int p, int *sal)

{

	int idx=blockIdx.x*blockDim.x+threadIdx.x;

	int out=0;

	int j;

	

	for(j=0;j<n;j++)

	{

		out=out+w[blockIdx.x*n+j]*y[j*p+threadIdx.x];

	}

	sal[idx]=out*-1;

}

And the call

H_Allocation_cuda<<<n,p>>>(d_Y,d_W,n,p,d_aux_h);

In attached pic you can see the problem.

Anyone can give me some advice how to solve it in a proper way?

Thanks :)

But CUBLAS will solve the problems of non coalesced memory acces?

I´ve to take a look to this library.