Issue with jetson nano memory

NVIDIA Jetson Nano 4GB, 4.9.140-tegra , CUDA 10-2 till 11-3

Hi i am making a droplet animation while learning CUDA programming, so noob i am.
i am using openglut in a bash terminal enviroment the function is:

global void doeding( int size, float * a, float * b, float * c, float * gloval)
{
int index = threadIdx.x;
int stride = blockDim.x;
float xidx = 0.0;
float yidx = 0.0;
size = ((500 + 2) * (500 + 2));
for (int lidx = index; lidx < size; lidx += stride)
{
xidx = (float)(lidx % 502); yidx = (float)(lidx / 502);
xidx = (xidx==501) ? 500 : xidx ;
xidx = (xidx==0) ? 1 : xidx ;
yidx = (float)(lidx / 502);
yidx = (yidx==501) ? 500 : yidx ;
yidx = (yidx==0) ? 1 : yidx ;

      c[(int)((yidx * (500 + 2)) + xidx)] = gloval[0] * ( gloval[1] * (
                                            b[(int)((yidx * (500 + 2)) + (xidx - 1))]
                                         +  b[(int)((yidx * (500 + 2)) + (xidx + 1))]
                                         +  b[(int)(((yidx - 1) * (500 + 2)) + xidx)]
                                         +  b[(int)(((yidx + 1) * (500 + 2)) + xidx)])
                                         -  a[(int)((yidx * (500 + 2)) + xidx)]);
  }

for (int lidx = 1; lidx < size; lidx++)
  {
        a[lidx] = b[lidx];
        b[lidx] = c[lidx];
  }

i start with one pixel at [(250 * 502) + 251] = 1.0;
when i call the global function with <<<1, 1>>> or <<<1, 32>>> everything is fine.
however as soon as i raise in steps of 32 for instance <<<1, 64>>> i get an disturbance on the botom of the screen :


the bottom should look like the top or left or right side of the circle
when i change the number of threads the same thing happens, the only difference is the speed in which this disturbance manifests itself.
What happens in memory when you change the blocksize ?
How can i prevent this disturbance ? and still use all the threads and have higher blocksizes ?

In advance thanks DO_Ray

Your code expects/requires all of the updates to c to happen:

  c[(int)((yidx * (500 + 2)) + xidx)] = ...

before any of the updates to a and b:

for (int lidx = 1; lidx < size; lidx++)
  {
        a[lidx] = b[lidx];
        b[lidx] = c[lidx];
  }

With a kernel launch of <<<1,1>>> or <<<1,32>>> that is likely to happen, due to the nature of warp execution as well as your usage of grid-stride loop kernel design. However when you go beyond 1 warp: <<<1,64>>> this is no longer as likely to happen.

For this trivial case: <<<1, XYZ>>> you can probably “fix” this by inserting a __syncthreads() statement in between your two kernel loops. i.e.:

__syncthreads();     // add this line here
for (int lidx = 1; lidx < size; lidx++)
  {
        a[lidx] = b[lidx];
        b[lidx] = c[lidx];
  }

However that will no longer be sufficient if you go to a multi-block kernel launch: <<< W, XYZ>>> where W > 1.

To understand this fully, as well as to come up with strategies for addressing the multi-block case, you’ll need to understand the fundamental CUDA concept that the CUDA execution model provides no guaranteed order of thread execution unless you as a programmer take explicit steps to order thread execution.

To apply this knowledge, pretend that the CUDA thread whose threadIdx.x value is 0 executes all of your code, before any other thread begins. Then 1. Then 2. You should be able to grasp at that point that your code is broken without additional design features.

As an aside, your first kernel loop has a proper grid-stride design, so that it can handle a flexible grid size. Your second does not. It may not be doing what you want, anyway, in the case of more than 1 thread. I think you should probably convert your second loop to grid stride.

It’s also good practice to properly format your code when posting here. You can edit your posting. Select all the code, then click the </> then save, is one simple approach to get better code formatting.

global void doeding( int size, float * a, float * b, float * c, float * gloval)
{
int index = threadIdx.x;
int stride = blockDim.x;
float xidx = 0.0;
float yidx = 0.0;
size = ((500 + 2) * (500 + 2));
for (int lidx = index; lidx < size; lidx += stride)
{
xidx = (float)(lidx % 502); yidx = (float)(lidx / 502);
xidx = (xidx==501) ? 500 : xidx ;
xidx = (xidx==0) ? 1 : xidx ;
yidx = (float)(lidx / 502);
yidx = (yidx==501) ? 500 : yidx ;
yidx = (yidx==0) ? 1 : yidx ;

      c[(int)((yidx * (500 + 2)) + xidx)] = gloval[0] * ( gloval[1] * (
                                            b[(int)((yidx * (500 + 2)) + (xidx - 1))]
                                         +  b[(int)((yidx * (500 + 2)) + (xidx + 1))]
                                         +  b[(int)(((yidx - 1) * (500 + 2)) + xidx)]
                                         +  b[(int)(((yidx + 1) * (500 + 2)) + xidx)])
                                         -  a[(int)((yidx * (500 + 2)) + xidx)]);
  }

for (int lidx = 1; lidx < size; lidx++)
  {
        a[lidx] = b[lidx];
        b[lidx] = c[lidx];
  }

Sorry an editor error, did not mean to post the code,
anyway thank you for your extensive reply.