problem with coalesced memory bottleneck in cuda kernel

furi3r · July 4, 2011, 4:58pm

Hi cuda-mates.

IÂ´m currently working in the performance of a recurrent neuronal nework in GPU but it looks like i have a

cumbersome kernel due to problems with coalesced memory I guess.

The problematic kernel is the next function:

Hx_iq = -SUM[j=1 to n] (W_ij * Y_jq)

where :

[]W = 1D Array of nn

[]Y = 1D Array of np

[]sal = 1D array of np with the result of execute H function.

IÂ´ve coded it like the following code:

__global__ void H_Allocation_cuda(int *y, int *w, int n, int p, int *sal)

{

	int idx=blockIdx.x*blockDim.x+threadIdx.x;

	int out=0;

	int j;

	

	for(j=0;j<n;j++)

	{

		out=out+w[blockIdx.x*n+j]*y[j*p+threadIdx.x];

	}

	sal[idx]=out*-1;

}

And the call

H_Allocation_cuda<<<n,p>>>(d_Y,d_W,n,p,d_aux_h);

In attached pic you can see the problem.

Anyone can give me some advice how to solve it in a proper way?

Thanks :)

LSChien · July 5, 2011, 1:35am

The problematic kernel is the next function:

Hx_iq = -SUM[j=1 to n] (W_ij * Y_jq)

where :

[]W = 1D Array of nn

[]Y = 1D Array of np

[]sal = 1D array of np with the result of execute H function.

You can use CUBLAS to compute this matrix multiplication.

If you want to implement this gemm by youself, then your tile is not good.

Please look at matrixMul in SDK.

IÂ´ve coded it like the following code:
__global__ void H_Allocation_cuda(int *y, int *w, int n, int p, int *sal)

{

	int idx=blockIdx.x*blockDim.x+threadIdx.x;

	int out=0;

	int j;

	

	for(j=0;j<n;j++)

	{

		out=out+w[blockIdx.x*n+j]*y[j*p+threadIdx.x];

	}

	sal[idx]=out*-1;

}
And the call
H_Allocation_cuda<<<n,p>>>(d_Y,d_W,n,p,d_aux_h);
In attached pic you can see the problem.

Anyone can give me some advice how to solve it in a proper way?

Thanks :)

furi3r · July 5, 2011, 7:36am

But CUBLAS will solve the problems of non coalesced memory acces?

IÂ´ve to take a look to this library.

Topic		Replies	Views
How to resolve this Coalescing problem? CUDA Programming and Performance	11	2225	May 28, 2009
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2134	October 7, 2008
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10241	June 28, 2009
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8021	June 24, 2009
CUBLAS and coalesced operations Access patterns inside CUBLAS code? CUDA Programming and Performance	7	3620	September 21, 2008
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11824	March 20, 2009
Coalesced? CUDA Programming and Performance	6	2874	February 7, 2009
Need some help to understand how to coalesce memory access CUDA Programming and Performance	4	1020	June 30, 2010
Why coalesced loads and writes? CUDA Programming and Performance	2	1310	April 8, 2009
gld coalesced = 0, but addresses are aligned! CUDA Programming and Performance	10	1694	March 20, 2010

problem with coalesced memory bottleneck in cuda kernel

Related topics