Hello,
sorry for my third post on this topic. The first one resulted from a really stupid mistake I did - so I deleted it. But then another problem arose from the same peace of code which I presented in my last post. Strangely that post wasn’t even viewed once, so I’m posting the same topic again. This time with a different title (maybe “3d grids for cfd” discourages many people) and with much less useless details.
I have the following test kernel that should fill the 3d variable ‘r’ (with size ‘nc’) with the value 42.0. Just the interior of the box is the be filled, the boundary layer (defined by ‘ng’) is to be omitted from this procedure:
__global__ void loopTest( Buffer r, int3 nc, int3 ng, dim3 realGridDim )
{
int i, j, k, bz, nijk;
bz = blockIdx.y / realGridDim.y;
for( k = blockDim.z * bz + threadIdx.z;
k < nc.z - ng.z; k += blockDim.z * realGridDim.z )
for( j = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y;
j < nc.y - ng.y; j += blockDim.y * realGridDim.y )
for( i = blockDim.x * blockIdx.x + threadIdx.x;
i < nc.x - ng.x; i += blockDim.x * gridDim.x )
if( (i >= ng.x) && (j >= ng.y) && (k >= ng.z) )
{
int nijk = IDX( i, j, k, r );
r.ptr[nijk] = 42.0;
}
}
The struct ‘Buffer’ is my own type of pitched pointer and the macro IDX takes the three i,j,k indices to turn them into a single 1d-index to access the memory.
If I would assume to always have a 1:1 mapping of threads to array elements I could omit the three nested loops, of course. But this code is designed to work even if there are less threads than array elements.
These three nested loops are the cause for a bunch of strange phenomenons. First of all the loop body often does not get executed at all. It depends on the the actual code of the loop body and on the number and layout of threads if the loop body gets executed or not. In some configurations the kernel call even stops with an ‘unspecified launch failure’. I have absolutely no idea whats going on here.
I found a quite similar but very old thread in this forum: The Official NVIDIA Forums | NVIDIA. It also deals with an unspecified launch failure in a nested loop setting. But the answer is obviously not applicable to my case as I’m neither using Windows nor does this kernel take longer than 5 seconds to execute.
I could modify the kernel to work more reliable (that means I could not find a configuration that broke the correct execution of the kernel):
__global__ void loopTest( Buffer r, int3 nc, int3 ng, dim3 realGridDim )
{
int i, j, k, nijk,
bz = bz = blockIdx.y / realGridDim.y,
iStart = blockDim.x * blockIdx.x + threadIdx.x,
jStart = blockDim.y * (blockIdx.y - realGridDim.y*bz) + threadIdx.y,
kStart = blockDim.z * bz + threadIdx.z;
__shared__ int iEnd, jEnd, kEnd, iSkip, jSkip, kSkip;
if( (threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0) )
{
iEnd = nc.x - ng.x,
jEnd = nc.y - ng.y,
kEnd = nc.z - ng.z,
iSkip = blockDim.x * gridDim.x,
jSkip = blockDim.y * realGridDim.y,
kSkip = blockDim.z * realGridDim.z;
}
__syncthreads();
for( k = kStart; k < kEnd; k += kSkip )
for( j = jStart; j < jEnd; j += jSkip )
for( i = iStart; i < iEnd; i += iSkip )
if( (i >= ng.x) && (j >= ng.y) && (k >= ng.z) )
{
int nijk = IDX( i, j, k, r );
r.ptr[nijk] = 42.0;
}
}
All I did is to remove all the calculation from the loop definition into separate variables. I did not do so before to save registers - also I do not expect the loops to be executed many times so I thought the recalculation would not be so costly. Again I’m using shared memory to save registers - but maybe that’s not a good idea, only timing tests could show this.
But still the question remains: why does the first code example behave so strangely? Can I expect the second example to be working reliably? Or do I have to take into account that there may be situations in which also the second example won’t work?
I’m hoping for some answers :-)
Regards,
enuhtac