Device function performance

Hi,

Could anybody explain how to optimize the following code:

__device__ float applyKxTranspose(float *zeta, float fy, int M, int N, float *result, int x, int y, int index)
{
	float zeta_index = zeta[index]/fy;
	if (x==0)
	{
		result[index] = zeta_index;
		return;
	}

	float sum = zeta_index;
	if (x==N-1)
		sum += -zeta_index;		
		
	sum+= -zeta[index-1]/fy;
	if (x==N-2 )
		sum += zeta[index+1]/fy;

	result[index] = sum;
}

where zeta = 640x480, M = 640, N=480, result=640x480, x and y - current thread + block index

I guess, it can be shared memory, but I don’t have any experience with it.
Maybe there are other things that can be improved?
Could you give me any tips?

// fy_rec = 1 /  fy
// pending the origin of fy, and the threadblock size used, either calculate fy_rec once, prior to calling the function, or once (by a single thread) within the function perhaps

float sum;

__shared__ float s_zeta[block_size];

s_zeta[threadIdx.x] = zeta[index];

__syncthreads(); // depending on block size

sum = s_zeta[threadIdx.x] * fy_rec;

if (x > 0)
{
   if (x==N-1)
   {
      sum = 0;		
   }		

   sum+= -s_zeta[threadIdx.x-1] * fy_rec;

   if (x==N-2 )
   {
      sum += s_zeta[threadIdx.x+1] * fy_rec;
   }
}

result[index] = sum;

you can probably push zeta[index] * fy_rec into shared, instead of merely zeta[index]
also, my thinking is that zeta would cache upon 1st read, hence shared memory can not truly speed up
zeta[a - 1] or zeta[a + 1] by that much
perhaps njuffa, scottgray, and others can deliberate

Thanks for your response!

I’ve tried this approach. But I’ve got a conflict of access between blocks.
s_zeta[threadIdx.x-1];
s_zeta[threadIdx.x+1]

Actually, I do not understand how it possible to solve this type of error.

Moreover, if I add two more shared arrays for
zeta[threadIdx.x-1];
zeta[threadIdx.x+1]

It makes the speed more slow.

if the block’s access spills over into adjacent blocks (x != threadIdx.x), then you would need the additional reads into the adjacent blocks
shared memory can not speed up those additional reads; shared memory can only allow reuse of that data
your data access seems to be unique, such that there is little room for data reuse via shared memory

nonetheless, if you want to use shared memory, i suppose it would be something like:

shared float s_zeta[block_size + 2];

s_zeta[threadIdx.x + 1] = zeta[index];

if (threadIdx.x == 0)
{
if (x > 0)
{
s_zeta[0] = zeta[index - 1];
}

if (x < Y)
{
s_zeta[block_size + 1] = zeta[index + 1];
}
}

The code looks to be bound by memory bandwidth, so you would want to focus on that. Let the CUDA profiler guide you. Irregular access patterns, as they seem to exist here, are particularly bad for performance on the GPU when the data is stored in global memory.

You are not showing the entire kernel so it is not clear where the data manipulated by this code is stored. Also, optimizing this code may be what is known as an “XY-problem”. The function name implies transposition. Every time I come across code that explicitly copies or transposes arrays or matrices, or computes a matrix inverses, I suspect that there are better ways of accomplishing the top-level task. Whether that is the case here I cannot say since there is not enough context.