problem with coalesced memory bottleneck in cuda kernel

Hi cuda-mates.

I´m currently working in the performance of a recurrent neuronal nework in GPU but it looks like i have a

cumbersome kernel due to problems with coalesced memory I guess.

The problematic kernel is the next function:

Hx_iq = -SUM[j=1 to n] (W_ij * Y_jq)

where :

    []W = 1D Array of nn

    []Y = 1D Array of np

    []sal = 1D array of np with the result of execute H function.

I´ve coded it like the following code:

__global__ void H_Allocation_cuda(int *y, int *w, int n, int p, int *sal)


	int idx=blockIdx.x*blockDim.x+threadIdx.x;

	int out=0;

	int j;








And the call


In attached pic you can see the problem.

Anyone can give me some advice how to solve it in a proper way?

Thanks :)

But CUBLAS will solve the problems of non coalesced memory acces?

I´ve to take a look to this library.