Hi cuda-mates.
I´m currently working in the performance of a recurrent neuronal nework in GPU but it looks like i have a
cumbersome kernel due to problems with coalesced memory I guess.
The problematic kernel is the next function:
Hx_iq = -SUM[j=1 to n] (W_ij * Y_jq)
where :
[]W = 1D Array of nn
[]Y = 1D Array of np
[]sal = 1D array of np with the result of execute H function.
I´ve coded it like the following code:
__global__ void H_Allocation_cuda(int *y, int *w, int n, int p, int *sal)
{
int idx=blockIdx.x*blockDim.x+threadIdx.x;
int out=0;
int j;
for(j=0;j<n;j++)
{
out=out+w[blockIdx.x*n+j]*y[j*p+threadIdx.x];
}
sal[idx]=out*-1;
}
And the call
H_Allocation_cuda<<<n,p>>>(d_Y,d_W,n,p,d_aux_h);
In attached pic you can see the problem.
Anyone can give me some advice how to solve it in a proper way?
Thanks :)