Hello CUDA people,
I have a really strange problem. One of the kernels I use is launched several times within my application (with a lot of other things in between including CudaThreadSynchronize() as well). If I introduce a ‘printf(“”);’ (see code) the kernel works properly. But if I don’t the kernel crashes with an ‘unspecified launch error’ after several calls! Some people will recognize this code sample as the code in the cuda SDK examples (cuda particles). The kernel looks as follows:
__global__ void reorder(unsigned int *CellStart, unsigned int *CellEnd,
float2 *inPos, float2 *outPos, float2 *inVel, float2 *outVel, float2* inForce, float2* outForce,
unsigned int *CellId, unsigned int *PId, unsigned int NumberParticles)
{
// Shared memory can be seen in the thread block.
// The qualifier 'extern' provides that the size of the array is determined at launch time.
// The size has to be put in the kernel call as a third argument
extern __shared__ unsigned int sharedCellId[];
unsigned int index = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int cellid;
if (index < NumberParticles)
{
// Initialize the shared array with the cellids
cellid = CellId[index];
// Shared memory in different blocks are independent!
sharedCellId[threadIdx.x+1] = cellid;
//there is the printf("");
printf("");
// For communication between blocks the first entry is responsible
// \p index > 0 && \p threadIdx.x == 0 corresponds to the first thread in a block which is not the zeroth
if (index > 0 && threadIdx.x == 0)
{
// each (per block) \p sharedCellId[0] gets the \p CellId of the last
// thread in the block with a blockindex decremented by one
sharedCellId[0] = CellId[index-1];
}
}
// now the shared array is complete
__syncthreads();
if (index < NumberParticles)
{
// The first index corresponds to a start
// if the cellid is unequal to the cellId of the last thread.
// One has to take the shift in the shared array into account.
if (index == 0 || cellid != sharedCellId[threadIdx.x])
{
CellStart[cellid] = index;
// if the index is greater than 0 a cell end is given by the cellid of the last thread
if (index > 0)
CellEnd[sharedCellId[threadIdx.x]] = index;
}
// the last index corresponds to an end
if (index == NumberParticles -1)
{
CellEnd[cellid] = index + 1;
}
// Now the device data will be sorted
unsigned int sortedPId = PId[index];
outPos[index] = inPos[sortedPId];
outVel[index] = inVel[sortedPId];
outForce[index] = inForce[sortedPId];
}
}
I am absoulty unsure about possible reasons for this behaviour.
Additional question: Does dynamically allocated shared need to be freed? (I don’t think so, because it should have the lifetime of a block.)
Thanks in advance!