Code is ok but not when running it all at the same time

Now and then I run into this problem (or a similar one).

I have piece of code which handles for instance two images at the same. When I compile and run it will return zero for both of the images) but if I comment some parts of the code (say dealing with image 1) and then compile and run the answer will be correct for image 2 and if I do the other way around the answer will be correct for image 1.

Below is an example:

texture<float, 2, cudaReadModeElementType> textureInputDataDeformed;
texture<float, 2, cudaReadModeElementType> textureInputDataTarget;

global void DownSampleFactor2(float* deformedImage, float targetImage, float filter, DataSize outputDataSize, DataSize dataSizeFilter) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Shared memory
extern __shared__ float filterShared[];
int numberOfFilterCoefficients = dataSizeFilter.dataWidth * dataSizeFilter.dataHeight;

// Each thread reads a filter coefficient into shared memory, max BLOCK_SIZE_X * BLOCK_SIZE_Y coefficients per read
int filterCoefficientOffset = 0;
while (numberOfFilterCoefficients > 0) {
	if ( (threadIdx.x + threadIdx.y * blockDim.x) < numberOfFilterCoefficients ) {
		filterShared[threadIdx.x + threadIdx.y * blockDim.x + filterCoefficientOffset] = filter[threadIdx.x + threadIdx.y * blockDim.x + filterCoefficientOffset];
	}
	numberOfFilterCoefficients -= BLOCK_SIZE_X * BLOCK_SIZE_Y;
	filterCoefficientOffset += BLOCK_SIZE_X * BLOCK_SIZE_Y;
}
__syncthreads();

if (idx < outputDataSize.dataWidth && idy < outputDataSize.dataHeight) {
	int xOffset;
	int yOffset;
	
	float sumDeformed = 0.0f;
	float sumTarget = 0.0f;
	
	xOffset = -(dataSizeFilter.dataWidth - 1)/2;
	for (int filterX = dataSizeFilter.dataWidth - 1; filterX >= 0; filterX--) {
		yOffset = -(dataSizeFilter.dataHeight - 1)/2;
		for (int filterY = dataSizeFilter.dataHeight - 1; filterY >= 0; filterY--) {
			sumDeformed += filterShared[filterX + filterY * dataSizeFilter.dataWidth] * tex2D(textureInputDataDeformed, 2*idx + xOffset + 0.5f, 2*idy + yOffset + 0.5f);
			sumTarget += filterShared[filterX + filterY * dataSizeFilter.dataWidth] * tex2D(textureInputDataTarget, 2*idx + xOffset + 0.5f, 2*idy + yOffset + 0.5f);
			
			yOffset++;
		}
		xOffset++;
	}
	
	deformedImage[idx + idy * outputDataSize.dataWidth] = sumDeformed;
	targetImage[idx + idy * outputDataSize.dataWidth] = sumTarget;
}

}

Where if I comment the sumDeformed += … row then targetImage will be correct and if I comment sumTarget += … then deformedImage will be ok.

Any suggestions?

To be noted is that this code runs ok on my computer with a GTX 285 but not on my laptop with a Quadro FX 770M. I have though encountered similar issues on my GTX 285.

Returning empty memory is usually a sign of launch failure or the kernel aborting. Do you do error checking after the kernel launch? (cudaGetLastError() straight after the call should tell what is going on, and cudaThreadSynchronize() after that will tell whether the kernel is aborting somehow).

One possibility (out of many) would be that including the code for both images is increasing the register usage of the compiled kernel, and the block size you are using exceeds the per block register file size. This would also explain the difference between laptop and GTX285 - the laptop GPU has a limit of 8192 registers per block, the GTX285 16384. A large block size could conceivably launch and run on the desktop card, but not the laptop card.

Another possibility would be that the driver watchdog timer is killing the kernel before it finishes. On display devices, there is a limit of 5 seconds wallclock time per single kernel launch. The laptop part should be about 10-20 times slower than the desktop part. If this is a long running kernel, it could be that the GTX285 finishes inside the watchdog limit most of the time, but the laptop card, which is slower doesn’t. Remove some code makes everything go faster and takes the driver time limit out of play.

There are lots of possibilities, but error checking is the first place to start.

Thanks for the help. You were quite right the error handling was the right place to start. It appears that I’ve run out of registers just as you suggested. I don’t know why but I thought I had the error handling covered by some other part of my code.

But when it comes to running out of registers I thought the compiler automatically assigned the remaining variables to local memory.

There is spilling of registers to local memory at compile time if you exceed the maximum per thread register limit (or a user imposed limit which is less than that), but that mechanism cannot know about run time parameters like block size, which is what controls success or failure in this case.

Ok, thanks for your help.