I suppose I deserved to be thrashed for that.
In my defense, I was trying to ask a very narrow question about the kernel call, which you kindly answered.
That said, I am not seeing the behavior you are.
The kernel call looks like:
bool GPUGenerate_segmentsFlat(int maxpoints, int numContours, int *onumElements, int *ocumElements, int *rnumElements, float *oContPerim, bool *oFinishedStatus, GPURec *GlobalMem)
{
if(maxpoints <= DCE_THREADS_PER_BLOCK)
{
gpuDCESegsFlatShared<<<numContours,DCE_THREADS_PER_BLOCK>>>(numContours, onumElements, ocumElements, rnumElements, oContPerim, oFinishedStatus, GlobalMem);
cudaDeviceSynchronize();
return(true);
}
return(true);
};
The kernel itself, looks like:
__global__ void gpuDCESegsFlatShared(int numContours, int *onumElements, int *ocumElements, int *rnumElements, float *oContPerim, bool *oFinishedStatus, GPURec *GlobalMem)
{
// __shared__ GPURec sdata[1024];
// __shared__ bool fdata[1024];
// __shared__ int ndata[1024];
// __shared__ bool globalend;
int seg1x, seg1y, seg2x, seg2y;
int tidx = threadIdx.x;
int bidx = blockIdx.x;
int gidx = gridDim.x;
int ttest;
if(gidx != 0)
ttest++;
// various bits of initialization code are here
if(tidx == 0)
{
if(bidx != 0)
{
fdata[bidx] = false;
ndata[bidx] = rnumElements[bidx];
}
else
{
fdata[bidx] = false;
ndata[bidx] = rnumElements[bidx];
}
}
// real function code is in here
__syncthreads();
}
Here, I know/checked that numContours = 2. DCE_THREADS_PERBLOCK = 1024.
When I put breakpoints on either the gidx test or checking bidx !=0, I never reach them.
As far as I can tell, I never see blockIdx.x that is non-zero.
For testing purposes, I commented out the __shared declaration to see if that might have something to do with things. Still, no effect.