Unexpected algorithm behaviour

iat88 · March 2, 2015, 10:19am

I have some lines of code that give a non expected output, and I do not get to understand why this unexpected behaviour occurs. Maybe someone can try to help me uderstand.

Simplified version of the code:

// In another function

dim3 gridSize, gridSize2;
dim3 blockSize, blockSize2;

blockSize.x = 1;                      //Threads per block
blockSize.y = frameHeightP;

gridSize.x = frameWidth;	      // Blocks per grid
gridSize.y = 1;

__global__ void MyKern(//inputs)
{

    int deltaxSamp;
	
    int x = blockIdx.x;     // each col
    int y = threadIdx.y;	// each row

    for (int channel_i = 0; channel_i<frameWidth; channel_i++) //each elem
    {
         deltaxSamp = abs(channel_i - x);
         if (x < frameWidth && y < frameHeightP) 
         {
              if( condition1 involves deltaxSamp)
              {	
                    if(condition2)
                    {
                          tempOutputMatrix[y + (channel_i*frameHeightP)] =  
                          tempOutputMatrix[y + (channel_i*frameHeightP)] + 
                          inputMatrix[delayIndexD[  (channel_i*frameWidth*frameHeightP) + (y*frameWidth) + x ] + x*frameHeightP ] ;
                    }
               }
           }	
       }	
}

The important line of the code is
“tempOutputMatrix[y + (channel_iframeHeightP)] = tempOutputMatrix[y + (channel_iframeHeightP)] + inputMatrix[delayIndexD[ (channel_iframeWidthframeHeightP) + (yframeWidth) + x ] + xframeHeightP ];”

I would expect the output (a matrix/image) to be STEADY, instead I get TINTLING data. Why?

little_jimmy · March 2, 2015, 11:48am

it may be due to poor synchronization or poor element mapping/ offsetting

it is difficult to conclude these possibilities, as you provide little background information

you seem to have/ use frames and channels…?
could you state in words what the intent of the kernel is?

int x = blockIdx.x; // each col
int y = threadIdx.y; // each row

for (int channel_i = 0; channel_i<frameWidth; channel_i++) //each elem

this seems wrong from a work distribution point of view
but again, it is rather difficult to follow what constitutes work, and how you distribute your work among thread blocks and threads

iat88 · March 3, 2015, 7:23am

Yes one of my hypotheses is that poor synchronization could be the problem; what I thought is that the for loop variable (channel_i) might be running at a different pace than x and y (cuda thread variables), but this is very hard to check as even the nvidia debugger does not follow the variables as one would like. But I am not sure this explanation makes sense.

little_jimmy · March 3, 2015, 8:37am

it is because channel_i and others are declared as local variables; local variables are as difficult to track/ check as global variables/ data, in my mind, for a number of reasons

if int x = blockIdx.x; relates to blockSize.x = 1; then i would interpret it as one col, not each col

tempOutputMatrix[y + (channel_iframeHeightP)] =
tempOutputMatrix[y + (channel_iframeHeightP)] +
inputMatrix[delayIndexD[ (channel_iframeWidthframeHeightP) + (yframeWidth) + x ] + xframeHeightP ] ;

at first glance, it seems that this hardly requires synchronization, as the input and output arrays differ, and the output does not seem to depend on adjacent inputs, meaning writes (output) preceding reads (input) are mostly prevented
but then, i can not place delayIndexD
and the notion of a loop may overthrow this, depending on how the loop progresses

iat88 · March 3, 2015, 9:37am

In here I attach a schematic with the different steps, maybe it helps you understand.

As you will see DelayIndexes is a matrix with calculated indexes that will indicate which positions of InputMatrix to access later on.

External Media

little_jimmy · March 3, 2015, 10:30am

the schematic is clear

it does not seem like a synchronization problem to prevent writes before reads, as separate thread blocks seem to write separate rows of the output matrix, in a rather independent manner

i presume you have initialized the output array tempOutputMatrix properly, given that you are accumulating (+=)

also, are you sure about the indexing of inputMatrix and delayIndexD?
[ (channel_iframeWidthframeHeightP) + (yframeWidth) + x ] + xframeHeightP ];

otherwise, i do not see fault, and i would direct my attention towards the 2 conditions: condition1; condition2

iat88 · March 3, 2015, 11:40am

Yes the initializations are all done.

I already tried on commenting the conditions but it does not really change the tintling part.

Thanks for the help anyway! :)

little_jimmy · March 3, 2015, 12:36pm

remembered a few other points:

are you confident that the values are in the arrays (matrices) by the time that the kernel starts accessing them?

a key debug point may actually be as the kernel starts writing to the output matrix
you could easily:
a) reduce the grid dimension such that only 1 row of the output matrix is written, instead of all, as a way to i) improve the ease of debugging, ii) serve as a confirmation
this should be very easy to do
b) temporarily dump the result written out to shared memory, in order to be able to follow the progression; shared memory is easy to track, and you should be able to see how the output develops and when the output goes wrong, and then perhaps better understand why
use a breakpoint and a __syncthreads() to allow the complete shared array to fill

iat88 · March 3, 2015, 12:54pm

I am confident that the data being used by the kernel is right and arriving at the right time.

What I did up to now is copying the output from the gpu to the cpu and then to a .txt file, then I read in with Matlab to see the data (outputMatrix).

Maybe it is also good to use the shared memory as you indicate, as then you can track the progression “real time”, I will try it.

It is indeed a good idea to take smaller samples/grids to analyze what is going on, up to now I tried to work with smaller samples/few columns, but the same thing happens…

iat88 · March 20, 2015, 10:16am

At the end what happened is that I had a racing condition going on, so threads were overwriting itselves.
This was solved using atomic add operations.

The performance was not heavily affected by doing that.

Topic		Replies	Views
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8287	April 15, 2011
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
do not understand thread/block division CUDA Programming and Performance	10	2799	April 23, 2012
Strange behaviour in extended simulations CUDA Programming and Performance	15	8287	October 12, 2010
Inconsistent kernel run times CUDA Programming and Performance	12	5795	August 5, 2009
Program gives unexpected error compiles smooth, but output is unexpected result CUDA Programming and Performance	5	3295	October 17, 2011
Random "illegal memory accesses" with untiled 2D convolution algorithm CUDA Programming and Performance	5	911	October 12, 2021
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1940	March 1, 2011
clock() function on titanX CUDA Programming and Performance	22	2209	March 12, 2022

Unexpected algorithm behaviour

Related topics