Unexpected algorithm behaviour

I have some lines of code that give a non expected output, and I do not get to understand why this unexpected behaviour occurs. Maybe someone can try to help me uderstand.

Simplified version of the code:

// In another function

dim3 gridSize, gridSize2;
dim3 blockSize, blockSize2;

blockSize.x = 1;                      //Threads per block
blockSize.y = frameHeightP;

gridSize.x = frameWidth;	      // Blocks per grid
gridSize.y = 1;

__global__ void MyKern(//inputs)

    int deltaxSamp;
    int x = blockIdx.x;     // each col
    int y = threadIdx.y;	// each row

    for (int channel_i = 0; channel_i<frameWidth; channel_i++) //each elem
         deltaxSamp = abs(channel_i - x);
         if (x < frameWidth && y < frameHeightP) 
              if( condition1 involves deltaxSamp)
                          tempOutputMatrix[y + (channel_i*frameHeightP)] =  
                          tempOutputMatrix[y + (channel_i*frameHeightP)] + 
                          inputMatrix[delayIndexD[  (channel_i*frameWidth*frameHeightP) + (y*frameWidth) + x ] + x*frameHeightP ] ;

The important line of the code is
“tempOutputMatrix[y + (channel_iframeHeightP)] = tempOutputMatrix[y + (channel_iframeHeightP)] + inputMatrix[delayIndexD[ (channel_iframeWidthframeHeightP) + (yframeWidth) + x ] + xframeHeightP ];”

I would expect the output (a matrix/image) to be STEADY, instead I get TINTLING data. Why?

it may be due to poor synchronization or poor element mapping/ offsetting

it is difficult to conclude these possibilities, as you provide little background information

you seem to have/ use frames and channels…?
could you state in words what the intent of the kernel is?

int x = blockIdx.x; // each col
int y = threadIdx.y; // each row

for (int channel_i = 0; channel_i<frameWidth; channel_i++) //each elem

this seems wrong from a work distribution point of view
but again, it is rather difficult to follow what constitutes work, and how you distribute your work among thread blocks and threads

Yes one of my hypotheses is that poor synchronization could be the problem; what I thought is that the for loop variable (channel_i) might be running at a different pace than x and y (cuda thread variables), but this is very hard to check as even the nvidia debugger does not follow the variables as one would like. But I am not sure this explanation makes sense.

it is because channel_i and others are declared as local variables; local variables are as difficult to track/ check as global variables/ data, in my mind, for a number of reasons

if int x = blockIdx.x; relates to blockSize.x = 1; then i would interpret it as one col, not each col

tempOutputMatrix[y + (channel_iframeHeightP)] =
tempOutputMatrix[y + (channel_i
frameHeightP)] +
inputMatrix[delayIndexD[ (channel_iframeWidthframeHeightP) + (yframeWidth) + x ] + xframeHeightP ] ;

at first glance, it seems that this hardly requires synchronization, as the input and output arrays differ, and the output does not seem to depend on adjacent inputs, meaning writes (output) preceding reads (input) are mostly prevented
but then, i can not place delayIndexD
and the notion of a loop may overthrow this, depending on how the loop progresses

In here I attach a schematic with the different steps, maybe it helps you understand.

As you will see DelayIndexes is a matrix with calculated indexes that will indicate which positions of InputMatrix to access later on.

the schematic is clear

it does not seem like a synchronization problem to prevent writes before reads, as separate thread blocks seem to write separate rows of the output matrix, in a rather independent manner

i presume you have initialized the output array tempOutputMatrix properly, given that you are accumulating (+=)

also, are you sure about the indexing of inputMatrix and delayIndexD?
[ (channel_iframeWidthframeHeightP) + (yframeWidth) + x ] + xframeHeightP ];

otherwise, i do not see fault, and i would direct my attention towards the 2 conditions: condition1; condition2

Yes the initializations are all done.

I already tried on commenting the conditions but it does not really change the tintling part.

Thanks for the help anyway! :)

remembered a few other points:

are you confident that the values are in the arrays (matrices) by the time that the kernel starts accessing them?

a key debug point may actually be as the kernel starts writing to the output matrix
you could easily:
a) reduce the grid dimension such that only 1 row of the output matrix is written, instead of all, as a way to i) improve the ease of debugging, ii) serve as a confirmation
this should be very easy to do
b) temporarily dump the result written out to shared memory, in order to be able to follow the progression; shared memory is easy to track, and you should be able to see how the output develops and when the output goes wrong, and then perhaps better understand why
use a breakpoint and a __syncthreads() to allow the complete shared array to fill

I am confident that the data being used by the kernel is right and arriving at the right time.

What I did up to now is copying the output from the gpu to the cpu and then to a .txt file, then I read in with Matlab to see the data (outputMatrix).

Maybe it is also good to use the shared memory as you indicate, as then you can track the progression “real time”, I will try it.

It is indeed a good idea to take smaller samples/grids to analyze what is going on, up to now I tried to work with smaller samples/few columns, but the same thing happens…

At the end what happened is that I had a racing condition going on, so threads were overwriting itselves.
This was solved using atomic add operations.

The performance was not heavily affected by doing that.