Linear Interpolation

I have a great 1 dimensional data array on the GPU with 1024*512elements from datatype float. This data array is a picture (width = 1024 pixel x high =512pixel)

The second array is a 1 dimensional array with 1024 elements from datatype float. This array contains 1024 coefficients. Now I want to multiply the first row with the coefficent array.

calculate the first row:

data_array[0] = (data_array[1] - data_array[0]) * coefficent_array[0] --> 0. thread from block 0

data_array[1] = (data_array[2] - data_array[1]) * coefficent_array[1] --> 1. thread from block 0

data_array[511] = (data_array[512] - data_array[511]) * coefficent_array[511] --> 511. thread from block 0

data_array[512] = (data_array[513] - data_array[512]) * coefficent_array[512] --> 0. thread from block 1

data_array[513] = (data_array[514] - data_array[513]) * coefficent_array[513] --> 1. thread from block 1

data_array[1022] = (data_array[1023] - data_array[1022]) * coefficent_array[1022] --> 510. thread from block 1

data_array[1023] = data_array[1023] --> 511. thread from block 1 !!!

Here is my code:

dim3 dimGrid;

dim3 dimBlock;

dimBlock.x = 512;

dimBlock.y = 1;

dimGrid.x = 2;

dimGrid.y = 512;

kernel_linear_interpolation<<<dimGrid, dimBlock>>>(Data_In, Coefficient_Array, Number_of_Coefficients, Number_of_Rows);

global void kernel_linear_interpolation(float *Data_In, float *Coefficient_Array, int Number_of_Coefficients, int Number_of_Rows)

{

int tidx = threadIdx.x + blockIdx.x*blockDim.x;

int tidy = threadIdx.y + blockIdx.y*blockDim.y;

if( tidx >= Number_of_Coefficients - 1 )    // data_array[1023] = data_array[1023]   --> 511. thread from block 1  Is this instruction correct ???

{

	return;

}

if( tidy >= Number_of_Rows )

{

	return;

}

Data_In[tidy * Number_of_Coefficients + tidx] = (Data_In[tidx + 1] - Data_In[tidx]) *Coefficient_Array[tidx];	  // <b>I think here is my error </b>



__syncthreads();

}

Can you be more specific as to the nature of the problem you’re seeing?

One problem is that your input and output arrays are the same, so some threads may attempt to update the values while other threads are using them. Use separate input and output arrays to avoid this problem.

Another potential problem is if you want interpolation, you should probably be doing dataOut = dataIn + coef*(dataIn[x+1] - dataIn), instead of just taking the delta, which will give you a sort of gradient instead.

Another potential problem is that you are not using tidy*Number_of_Coefficients on the input (right hand side of =), which means that all rows will get the same value, which is probably not what you want.

The final __syncthreads() at the end is unnecessary, but harmless.

Perhaps try something like this:

__global__ void kernel_linear_interpolation(float *Data_Out, float *Data_In, float *Coefficient_Array, int Number_of_Coefficients, int Number_of_Rows) {

	int tidx = threadIdx.x + blockIdx.x*blockDim.x;

	int tidy = threadIdx.y + blockIdx.y*blockDim.y;

	if( tidy >= Number_of_Rows ) {

		return;

	}

	int w = Number_of_Coefficients;

	if( tidx >= w) {

		return;

	}

	else if (tidx == w - 1) {

		Data_Out[tidy * w + tidx] = Data_In[tidy * w + tidx];

	}

	else {

		Data_Out[tidy * w + tidx] = Data_In[tidy * w + tidx] + (Data_In[tidy * w + tidx + 1] - Data_In[tidy * w + tidx]) *Coefficient_Array[tidx];

	}

}

Thank you for your help

I will test your code.

you may also try using texture linear filtering which returns: tex1D(x) = T[i] * (1 - a) + T[i+1] * a
where ‘i’ is an integer part of ‘x’ and ‘a’ is a fraction (see cuda programming guide appendix).

this should be advantageous because textures are cached and the linear interpolation is performed on the hardware

Thank you for the tip. At the moment my speed is fast enough, but i will try it. I will look in the manual for an example.