dealing with 3d grids for cfd simulations strange behaviour of nested loops in CUDA

Hello,

this posting might get longer as I wish to present the development that leads to some strange behaviour of nested loops in CUDA. I hope this does not discourage my readers. For quick summary please refer to the last piece of code with the text below it.

I’m currently trying to setup some cfd (computational fluid dynamics) simulations in CUDA. Therefore I need 3D grids - which is not a big problem. Additionally I intend to write my kernels in such way that it is possible to have less threads than grid elements - as this enables the use of grids of arbitrary size and may improve performance. So each thread must be able to deal with several grid elements. The most obvious way to do this is to have three nested for loops jumping through the computational grid in multiples of the CUDA grid:

struct Buffer

{

    Real    *ptr;

    int2    pitchedSize;

};

#define IDX(i,j,k,buf) \

        (((k) * (buf).pitchedSize.y + (j)) * (buf).pitchedSize.x + (i))

struct Grid

{

    int3    nc,     /* number of cells (including ghost cells) */

            ng;     /* number of ghost cells */

};

__constant__ static struct Grid grid_;

__global__ void checkGrid( Buffer r, dim3 realGridDim )

{

    int i, j, k, bz;

bz = blockIdx.y / realGridDim.y;

for( k = blockDim.z * bz + threadIdx.z;

            k < grid_.nc.z; k += blockDim.z * realGridDim.z )

        for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;

                j < grid_.nc.y; j += blockDim.y * realGridDim.y )

            for( i = blockDim.x * blockIdx.x + threadIdx.x;

                    i < grid_.nc.x; i += blockDim.x * gridDim.x )

            {

                nijk = IDX( i, j, k, r );

                r.ptr[nijk] = 42.0;

            }

}

When executing this kernel I first choose a 3d block layout (bx,by,bz) and then call the kernel with (bx,by*bz,1) blocks while passing the real block layout as argument to the kernel. This allows the reconstruction of blockIdx.z (‘bz’ in my example). To pass the grid variable ‘r’ to the kernel I use my own type of pitched pointer which is specific to the datatype I use so I can easily access the ptr field with an array index. The actual computational grid (in contrast to the CUDA grid) is defined in the constant ‘grid_’ variable.

So far everything works fine - ‘r’ is filled with 42.0. But now I would like to introduce a ghost layer - that means that the boundary layer of the grid is treated differently. In this example it is just to be omitted from the assignment. The thickness of the ghost layer is defined in the ng field of ‘grid_’. So first let’s omit the lower boundaries:

for( k = blockDim.z * bz + threadIdx.z;

            k < grid_.nc.z; k += blockDim.z * realGridDim.z )

        for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;

                j < grid_.nc.y; j += blockDim.y * realGridDim.y )

            for( i = blockDim.x * blockIdx.x + threadIdx.x;

                    i < grid_.nc.x; i += blockDim.x * gridDim.x )

                if( (i >= grid_.ng.x) && (j >= grid_.ng.y) && (k >= grid_.ng.z) )

                {

                    nijk = IDX( i, j, k, r );

                    r.ptr[nijk] = 42.0;

                }

I could have started to for loops with an displacement but instead I chose an additional if in the loop body as I expect this to result in a better performance due aligned memory access. Still everything works fine.

Now I also would like to omit the upper boundaries:

for( k = blockDim.z * bz + threadIdx.z;

            k < grid_.nc.z-grid_.ng.z; k += blockDim.z * realGridDim.z )

        for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;

                j < grid_.nc.y-grid.ng.y; j += blockDim.y * realGridDim.y )

            for( i = blockDim.x * blockIdx.x + threadIdx.x;

                    i < grid_.nc.x-grid_.ng.x; i += blockDim.x * gridDim.x )

                if( (i >= grid_.ng.x) && (j >= grid_.ng.y) && (k >= grid_.ng.z) )

                {

                    nijk = IDX( i, j, k, r );

                    r.ptr[nijk] = 42.0;

                }

Here I just modified the upper loop boundaries. In this case the loop body will never be executed (or more correct: ‘r’ is not filled with 42.0 as I cannot really check otherwise if the loop body gets executed). Modifying just one loop boundary will give the expected result but as soon as I modify more than one loop boundary the loop body is not executed.

I also tried to leave the loop boundaries unmodified and add additional conditions to the if:

for( k = blockDim.z * bz + threadIdx.z;

            k < grid_.nc.z; k += blockDim.z * realGridDim.z )

        for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;

                j < grid_.nc.y; j += blockDim.y * realGridDim.y )

            for( i = blockDim.x * blockIdx.x + threadIdx.x;

                    i < grid_.nc.x; i += blockDim.x * gridDim.x )

                if( (i >= grid_.ng.x) && (i < grid_.nc.x-grid_.ng.x)

                    && (j >= grid_.ng.y) && (j < grid_.nc.y-grid_.ng.y)

                    && (k >= grid_.ng.z) && (k < grid_.nc.z-grid_.ng.z) )

                {

                    nijk = IDX( i, j, k, r );

                    r.ptr[nijk] = 42.0;

                }

This works as expected but the upper grid boundaries are checked twice which is somewhat unaesthetic.

Next I tried to additionally modify the loop boundaries - and still everything worked fine. So I started removing the upper boundary checks in the if statement again. The result was: having modified upper loop boundaries requires having at least one upper boundary check in the if statement for the if body to be executed. The actual value for the upper boundary does not matter at all. So the following code still works:

for( k = blockDim.z * bz + threadIdx.z;

            k < grid_.nc.z-grid_.ng.z; k += blockDim.z * realGridDim.z )

        for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;

                j < grid_.nc.y-grid_.ng.y; j += blockDim.y * realGridDim.y )

            for( i = blockDim.x * blockIdx.x + threadIdx.x;

                    i < grid_.nc.x-grid_.ng.x; i += blockDim.x * gridDim.x )

                if( (i >= grid_.ng.x) && (i < INT_MAX)

                    && (j >= grid_.ng.y) && (k >= grid_.ng.z) )

                {

                    nijk = IDX( i, j, k, r );

                    r.ptr[nijk] = 42.0;

                }

In this case the if body gets executed - but as soon as I remove (i < INT_MAX) ‘r’ is left unmodified.

For me this behaviour looks pretty much like a bug (would be the second one to find for me in two days). But I’m open to be proven wrong.

Regards,

enuhtac

update:

If the loop body gets executed or not is not only a matter of some exit conditions but also depends on contents of the loop body. The simple example above gets executed. But as soon as I do some more complex calculations inside the loop body, the body is skipped.