Introduction of a 'printf("");' leads to different results

Hello CUDA people,

I have a really strange problem. One of the kernels I use is launched several times within my application (with a lot of other things in between including CudaThreadSynchronize() as well). If I introduce a ‘printf("");’ (see code) the kernel works properly. But if I don’t the kernel crashes with an ‘unspecified launch error’ after several calls! Some people will recognize this code sample as the code in the cuda SDK examples (cuda particles). The kernel looks as follows:

__global__ void reorder(unsigned int *CellStart, unsigned int *CellEnd,

                        float2 *inPos, float2 *outPos, float2 *inVel, float2 *outVel, float2* inForce, float2* outForce,

                        unsigned int *CellId, unsigned int *PId, unsigned int NumberParticles)


    // Shared memory can be seen in the thread block.

    // The qualifier 'extern' provides that the size of the array is determined at launch time.

    // The size has to be put in the kernel call as a third argument

    extern __shared__ unsigned int sharedCellId[];

unsigned int index = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int cellid;

if (index < NumberParticles)


        // Initialize the shared array with the cellids

        cellid = CellId[index];

        // Shared memory in different blocks are independent!

        sharedCellId[threadIdx.x+1] = cellid;

//there is the printf("");



        // For communication between blocks the first entry is responsible

        // \p index > 0 && \p threadIdx.x == 0 corresponds to the first thread in a block which is not the zeroth

        if (index > 0 && threadIdx.x == 0)


            // each (per block) \p sharedCellId[0] gets the \p CellId of the last

            // thread in the block with a blockindex decremented by one

            sharedCellId[0] = CellId[index-1];



    // now the shared array is complete


if (index < NumberParticles)


        // The first index corresponds to a start

        // if the cellid is unequal to the cellId of the last thread.

        // One has to take the shift in the shared array into account.

        if (index == 0 || cellid != sharedCellId[threadIdx.x])


CellStart[cellid] = index;

            // if the index is greater than 0 a cell end is given by the cellid of the last thread

            if (index > 0)

                CellEnd[sharedCellId[threadIdx.x]] = index;


// the last index corresponds to an end

        if (index == NumberParticles -1)


            CellEnd[cellid] = index + 1;


// Now the device data will be sorted

        unsigned int sortedPId = PId[index];

outPos[index] = inPos[sortedPId];

        outVel[index] = inVel[sortedPId];

        outForce[index] = inForce[sortedPId];



I am absoulty unsure about possible reasons for this behaviour.

Additional question: Does dynamically allocated shared need to be freed? (I don’t think so, because it should have the lifetime of a block.)

Thanks in advance!


as far as I know, dynamically allocated shared memory is freed automatically. However, be carefull that the amount of shared memory allocated dynamically does not exceed the limit of your GPU card. This would probably lead to a launch failure but I have learned to be carefull before making such an assumption.

For your printf problem, I have had similar ones. In my case, with a printf, the kernel would give wrong results! Some of these issues where solved by doing a fresh install of the latest CUDA library/toolkit, but some other still remain… I observed similar weird problems when compiling in debug mode some kernels that use shared memory and the SIMT architecture.