Now and then I run into this problem (or a similar one).
I have piece of code which handles for instance two images at the same. When I compile and run it will return zero for both of the images) but if I comment some parts of the code (say dealing with image 1) and then compile and run the answer will be correct for image 2 and if I do the other way around the answer will be correct for image 1.
Below is an example:
texture<float, 2, cudaReadModeElementType> textureInputDataDeformed;
texture<float, 2, cudaReadModeElementType> textureInputDataTarget;
global void DownSampleFactor2(float* deformedImage, float targetImage, float filter, DataSize outputDataSize, DataSize dataSizeFilter) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
// Shared memory
extern __shared__ float filterShared[];
int numberOfFilterCoefficients = dataSizeFilter.dataWidth * dataSizeFilter.dataHeight;
// Each thread reads a filter coefficient into shared memory, max BLOCK_SIZE_X * BLOCK_SIZE_Y coefficients per read
int filterCoefficientOffset = 0;
while (numberOfFilterCoefficients > 0) {
if ( (threadIdx.x + threadIdx.y * blockDim.x) < numberOfFilterCoefficients ) {
filterShared[threadIdx.x + threadIdx.y * blockDim.x + filterCoefficientOffset] = filter[threadIdx.x + threadIdx.y * blockDim.x + filterCoefficientOffset];
}
numberOfFilterCoefficients -= BLOCK_SIZE_X * BLOCK_SIZE_Y;
filterCoefficientOffset += BLOCK_SIZE_X * BLOCK_SIZE_Y;
}
__syncthreads();
if (idx < outputDataSize.dataWidth && idy < outputDataSize.dataHeight) {
int xOffset;
int yOffset;
float sumDeformed = 0.0f;
float sumTarget = 0.0f;
xOffset = -(dataSizeFilter.dataWidth - 1)/2;
for (int filterX = dataSizeFilter.dataWidth - 1; filterX >= 0; filterX--) {
yOffset = -(dataSizeFilter.dataHeight - 1)/2;
for (int filterY = dataSizeFilter.dataHeight - 1; filterY >= 0; filterY--) {
sumDeformed += filterShared[filterX + filterY * dataSizeFilter.dataWidth] * tex2D(textureInputDataDeformed, 2*idx + xOffset + 0.5f, 2*idy + yOffset + 0.5f);
sumTarget += filterShared[filterX + filterY * dataSizeFilter.dataWidth] * tex2D(textureInputDataTarget, 2*idx + xOffset + 0.5f, 2*idy + yOffset + 0.5f);
yOffset++;
}
xOffset++;
}
deformedImage[idx + idy * outputDataSize.dataWidth] = sumDeformed;
targetImage[idx + idy * outputDataSize.dataWidth] = sumTarget;
}
}
Where if I comment the sumDeformed += … row then targetImage will be correct and if I comment sumTarget += … then deformedImage will be ok.
Any suggestions?
To be noted is that this code runs ok on my computer with a GTX 285 but not on my laptop with a Quadro FX 770M. I have though encountered similar issues on my GTX 285.