problems with nested loop, again

Hello,

sorry for my third post on this topic. The first one resulted from a really stupid mistake I did - so I deleted it. But then another problem arose from the same peace of code which I presented in my last post. Strangely that post wasn’t even viewed once, so I’m posting the same topic again. This time with a different title (maybe “3d grids for cfd” discourages many people) and with much less useless details.

I have the following test kernel that should fill the 3d variable ‘r’ (with size ‘nc’) with the value 42.0. Just the interior of the box is the be filled, the boundary layer (defined by ‘ng’) is to be omitted from this procedure:

__global__ void loopTest( Buffer r, int3 nc, int3 ng, dim3 realGridDim )

{

    int i, j, k, bz, nijk;

bz = blockIdx.y / realGridDim.y;

for( k = blockDim.z * bz + threadIdx.z;

            k < nc.z - ng.z; k += blockDim.z * realGridDim.z )

        for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;

                j < nc.y - ng.y; j += blockDim.y * realGridDim.y )

            for( i = blockDim.x * blockIdx.x + threadIdx.x;

                    i < nc.x - ng.x; i += blockDim.x * gridDim.x )

                if( (i >= ng.x) && (j >= ng.y) && (k >= ng.z) )

                {

                    int nijk = IDX( i, j, k, r );

                    r.ptr[nijk] = 42.0;

                }

}

The struct ‘Buffer’ is my own type of pitched pointer and the macro IDX takes the three i,j,k indices to turn them into a single 1d-index to access the memory.

If I would assume to always have a 1:1 mapping of threads to array elements I could omit the three nested loops, of course. But this code is designed to work even if there are less threads than array elements.

These three nested loops are the cause for a bunch of strange phenomenons. First of all the loop body often does not get executed at all. It depends on the the actual code of the loop body and on the number and layout of threads if the loop body gets executed or not. In some configurations the kernel call even stops with an ‘unspecified launch failure’. I have absolutely no idea whats going on here.

I found a quite similar but very old thread in this forum: The Official NVIDIA Forums | NVIDIA. It also deals with an unspecified launch failure in a nested loop setting. But the answer is obviously not applicable to my case as I’m neither using Windows nor does this kernel take longer than 5 seconds to execute.

I could modify the kernel to work more reliable (that means I could not find a configuration that broke the correct execution of the kernel):

__global__ void loopTest( Buffer r, int3 nc, int3 ng, dim3 realGridDim )

{

    int i, j, k, nijk,

        bz = bz = blockIdx.y / realGridDim.y,

        iStart = blockDim.x * blockIdx.x + threadIdx.x,

        jStart = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y,

        kStart = blockDim.z * bz + threadIdx.z;

    __shared__ int iEnd, jEnd, kEnd, iSkip, jSkip, kSkip;

if( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )

    {

        iEnd = nc.x - ng.x,

        jEnd = nc.y - ng.y,

        kEnd = nc.z - ng.z,

        iSkip = blockDim.x * gridDim.x,

        jSkip = blockDim.y * realGridDim.y,

        kSkip = blockDim.z * realGridDim.z;

    }

__syncthreads();

for( k = kStart; k < kEnd; k += kSkip )

        for( j = jStart; j < jEnd; j += jSkip )

            for( i = iStart; i < iEnd; i += iSkip )

                if( (i >= ng.x) && (j >= ng.y) && (k >= ng.z) )

                {

                    int nijk = IDX( i, j, k, r );

                    r.ptr[nijk] = 42.0;

                }

}

All I did is to remove all the calculation from the loop definition into separate variables. I did not do so before to save registers - also I do not expect the loops to be executed many times so I thought the recalculation would not be so costly. Again I’m using shared memory to save registers - but maybe that’s not a good idea, only timing tests could show this.

But still the question remains: why does the first code example behave so strangely? Can I expect the second example to be working reliably? Or do I have to take into account that there may be situations in which also the second example won’t work?

I’m hoping for some answers :-)

Regards,

enuhtac

update:

I confirmed that the second example also does not work reliably. That’s a pitty… All I can do is to omit the loops completely. But this in turn means that I cannot implement a scalar product or other reduction functions in the standard way… I would be glad for suggestions…

BTW: cuda-memcheck does not report any problems.

enuhtac

Can you give a complete, self-contained example (including the CPU code using the kernel, and the IDX() function/macro)?

Hello tera,

thanks for your answer.

the source of my problem seems to be manifold as I cannot reproduce to failure of the two examples above. I suppose I just made a mistake when evaluating the returned data from the kernel.

Nevertheless the actual problem remains when the kernel is a little bit more complex. Here’s a kernel and its calling function that shows the described strange behaviour:

__global__ void loopTest(   Buffer r, Buffer x, int dim,

                            int3 nc, int3 ng, dim3 realGridDim )

{

    int i, j, k, bz;

bz = blockIdx.y / realGridDim.y;

for( k = blockDim.z * bz + threadIdx.z;

            k < nc.z - ng.z; k += blockDim.z * realGridDim.z )

        for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;

                j < nc.y - ng.y; j += blockDim.y * realGridDim.y )

            for( i = blockDim.x * blockIdx.x + threadIdx.x;

                    i < nc.x - ng.x; i += blockDim.x * gridDim.x )

                if( (i >= ng.x) && (j >= ng.y) && (k >= ng.z) )

                {

                    int     nijk = IDX( i, j, k, r.pitchedSize ),

                            nijk_n;

                    Real    tmp = 0.0f;

if( dim >= 2 )

                    {

                        Real tmp2 = 0.0f;

nijk_n = nijk;

                        tmp2 += x.ptr[nijk_n] - x.ptr[nijk];

tmp += tmp2;

}

r.ptr[nijk] = nijk_n;

                }

}

void testInterface( Real *r, Real *a, int nCells[3], int nGhosts[3],

                    int gridDim[3], int blockDim[3] )

{

    struct Buffer   rD, aD;

    dim3            threads( blockDim[0], blockDim[1], blockDim[2] ),

                    realBlocks( gridDim[0], gridDim[1], gridDim[2] ),

                    cudaBlocks( gridDim[0], gridDim[1] * gridDim[2], 1 );

    int3            nc = { nCells[0], nCells[1], nCells[2] },

                    ng = { nGhosts[0], nGhosts[1], nGhosts[2] };

rD = allocBuffer( nc );

    aD = allocBuffer( nc );

copyToBuffer( &rD, r, nc );

    copyToBuffer( &aD, a, nc );

loopTest<<< cudaBlocks, threads >>>( rD, aD, 3, nc, ng, realBlocks );

    CUDA_CALL( cudaThreadSynchronize() )

copyFromBuffer( r, &rD, nc );

    copyFromBuffer( a, &aD, nc );

freeBuffer( &rD );

    freeBuffer( &aD );

}

I created this kernel by deleting as much code as possible from my actual kernel while still being able to reproduce the problem.

This is the definition of Buffer with its “member” functions and the IDX macro:

#define IDX(i,j,k,n) (((k) * (n).y + (j)) * (n).x + (i))

struct Buffer

{

    Real    *ptr;

    int2    pitchedSize;

    int     origXSize;

};

struct Buffer bufferFromPitchedPtr( const struct cudaPitchedPtr *cudaPtr )

{

    Buffer result;

result.ptr = (Real *) cudaPtr->ptr;

    result.pitchedSize.x = cudaPtr->pitch / sizeof( Real );

    result.pitchedSize.y = cudaPtr->ysize;

    result.origXSize = cudaPtr->xsize / sizeof( Real );

return result;

}

struct cudaPitchedPtr bufferToPitchedPtr( const struct Buffer *b )

{

    return make_cudaPitchedPtr( b->ptr,

            b->pitchedSize.x * sizeof( Real ),

            b->origXSize * sizeof( Real ), b->pitchedSize.y );

}

void copyToBuffer( struct Buffer *dst, const Real *src, int3 size )

{

    cudaMemcpy3DParms parms = { 0 };

parms.srcPtr = make_cudaPitchedPtr( (void *) src,

            size.x * sizeof( Real ), size.x * sizeof( Real ), size.y );

    parms.dstPtr = bufferToPitchedPtr( dst );

    parms.extent = make_cudaExtent( size.x * sizeof( Real ), size.y, size.z );

    parms.kind = cudaMemcpyHostToDevice;

CUDA_CALL( cudaMemcpy3D( &parms ) );

}

void copyFromBuffer( Real *dst, const struct Buffer *src, int3 size )

{

    cudaMemcpy3DParms parms = { 0 };

parms.srcPtr = bufferToPitchedPtr( src );

    parms.dstPtr = make_cudaPitchedPtr( dst,

            size.x * sizeof( Real ), size.x * sizeof( Real ), size.y );

    parms.extent = make_cudaExtent(

            size.x * sizeof( Real ), size.y, size.z );

    parms.kind = cudaMemcpyDeviceToHost;

CUDA_CALL( cudaMemcpy3D( &parms ) );

}

struct Buffer allocBuffer( int3 size )

{

    cudaPitchedPtr tmp;

CUDA_CALL( cudaMalloc3D( &tmp, make_cudaExtent(

            size.x * sizeof( Real ), size.y, size.z ) ) )

return bufferFromPitchedPtr( &tmp );

}

void freeBuffer( struct Buffer *buf )

{

    CUDA_CALL( cudaFree( buf->ptr ) )

}

Real is typedef’ed to be float. CUDA_CALL is the obvious error-checking macro.

The function testInterface() is being called from main() which uses the blitz++ library to create the host variables and writes an output file in silo format that can be visualized with the visit software. I won’t post the main function here as I don’t think that it is of help (please ask if you are interested).

In all my current tests I initialize the variables in this way:

nc = { 42, 42, 42 }

ng = { 1, 1, 1 }

blockDim = { 32, 4, 4 }

gridDim = { 2, 11, 11 }

Concerning the platform: I’m using ubuntu 10.04 64bit and CUDA toolkit 3.2.16. My graphics card is quite old: Quadro FX 570 with compute capability 1.1.

Regards,

enuhtac

HI

Change parms.srcPtr = make_cudaPitchedPtr( (void *) src, size.x * sizeof( Real ), size.x * sizeof( Real ), size.y );

into parms.srcPtr = make_cudaPitchedPtr( (void *) src, size.x * sizeof( Real ), size.x, size.y );

and try.