CUDA 2D textures and bilinear interpolation at border texels

Hi,

we have a bit of a problem right now using 2D textures in CUDA. Our plan was to use normalized texture coordinates and cudaAddressModeWrap to smoothly interpolate (and periodically wrap around) the values we read along the texture’s X axis.

It seems whenever the fractional part of our texture x coordinates becomes smaller than 1/(2texture dimension X) or larger than 1-1/(2texture dimension X) the bilinear interpolation in hardware is interpolating the texel situated at the border with the border color (which is black).

Can anyone confirm that the hardware interpolation at border texels is done against black (0 values)? If so, we’d have to bound the fractional part of our texture coordinates to be between 1/(2texture dimension X) and > 1 - 1/(2texture dimension X) - so while wrapping around between the texture’s border pixels only a point sampling is done.

Is there a way to change the behavior of the interpolation hardware with respect to the border colors? Ideally it would interpolate between the rightmost and leftmost texel, and not against a black border.

We don’t currently use the hardware interpolation on the Y axis, but I suppose the problem there would be identical.

Christian

Border color is zero: [url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#texture-memory[/url] See The addressing mode. section.

Check your wrap mode again.
You should not be able to access the border color when using wrap (repeat) mode at all.
Do the experiment with both coordinates set to wrap.

What’s the exact setup of all fields in your cudaTextureDesc?
[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#texture-object-api[/url]

What is your GPU hardware and driver version?

We’re using CUDA’s texture objects API.

We’re reading from a half float texture and the internal storage is ushort2. Each texel holds a complex value (real and imaginary part). To get read access to bilinearly interpolated float values, we use these fields in the cudaTextureDesc

readMode is cudaReadModeElementType
filterMode is cudaFilterModeLinear
addressMode[0] and [1] is cudaAddressModeWrap
normalizedCoords is 1
The other fields are left at null values

we access the texture with tex2D, giving us direct float access to texture’s half float data without requiring additional intrinsics.

It could be that this particular combination of texture properties is causing trouble.

We’ve just implemented a workaround by appending and prepending two extra texels on the X axis (taking data from the border texels on the opposide side of the texture), and then offsetting and re-scaling the fractional part of the texture X coord to not fall onto these additional texels.

I may post a more complete repro code a bit later.

Regarding your question about GPU hardware and driver version.

My colleague is on GTX 960, I am on GTX 970 (both Maxwell). I am on driver version 346.46, my colleague who experienced the wraparound problem is on 346.72. Both running Ubuntu 12.04 with latest official hardware enablement stack kernel 3.13.

Our CUDA toolkit is still at 5.0, as we’re not currently targeting any Maxwell specific features at the moment.

Christian

This is supposed to be a repro code ( use nvcc -o test test.cu -arch=compute_30 ) , but on my Windows laptop running a Kepler chip and CUDA 7.5RC it doesn’t exhibit the problem.

The top printout shows the texel center values, the second line shows the values interpolated
inbetween 1.0 and 2.0 (1.5), 2.0 and 4.0 (3.0) and the wraparound case 4.0 and 1.0 (2.5). If the border color was used to interpolate against, the value 2.0 would show up in place of 2.5.

Next I’ll try on the Linux machines that exhibited the problem in our production code. If the repro code still doesn’t reproduce the problem there, the problem must lie deeper (it may be some stupid texture initialization bug on our end)

proper output (expected result)

1.00000000+i* 1.00000000  2.00000000+i* 2.00000000  4.00000000+i* 4.00000000

1.50000000+i* 1.50000000  3.00000000+i* 3.00000000  2.50000000+i* 2.50000000

test.cu

#include <cuda_runtime.h>
#include "device_launch_parameters.h"

#include <stdio.h>
#include <memory.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                              \
	do {                                                                  \
		cudaError_t err = call;                                           \
		if (cudaSuccess != err) {                                         \
			fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
					 __FILE__, __LINE__, cudaGetErrorString(err) );       \
			exit(EXIT_FAILURE);                                           \
		}                                                                 \
	} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                              \
	do {                                                                  \
		/* Check synchronous errors, i.e. pre-launch */                   \
		cudaError_t err = cudaGetLastError();                             \
		if (cudaSuccess != err) {                                         \
			fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
				__FILE__, __LINE__, cudaGetErrorString(err) );       \
			exit(EXIT_FAILURE);                                           \
		}                                                                 \
		/* Check asynchronous errors, i.e. kernel failed (ULF) */         \
		err = cudaThreadSynchronize();                                    \
		if (cudaSuccess != err) {                                         \
			fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
					 __FILE__, __LINE__, cudaGetErrorString( err) );      \
			exit(EXIT_FAILURE);                                           \
		}                                                                 \
	} while (0)

__global__ void kernel(cudaTextureObject_t tex, int m, int n)
{
	float2 val;

	// texture access at texel center coordinatres
	for (int row = 0; row < m; row++) {
		for (int col = 0; col < n; col++) {
			val = tex2D<float2>(tex, (col + 0.5f)/n, (row + 0.5f)/m);
			printf("% 9.8f+i*% 9.8f ", val.x, val.y);
		}
		printf("\n");
	}

	printf("\n");

	// texture access to the right of texel center (halfway between texels, horizontally)
	for (int row = 0; row < m; row++) {
		for (int col = 0; col < n; col++) {
			val = tex2D<float2>(tex, (col + 0.5f + 0.499f)/n, (row + 0.5f)/m);
			printf("% 9.8f+i*% 9.8f ", val.x, val.y);
		}
		printf("\n");
	}
}

int main(void)
{
	const int m = 1; // height = #rows
	const int n = 3; // width  = #columns
	ushort2 arr[m][n] = {
		{ make_ushort2(0x3c00, 0x3c00), make_ushort2(0x4000, 0x4000), make_ushort2(0x4400, 0x4400) },   // 1.0      , 2.0      , 4.0
	};

	cudaArray_t array;
	cudaChannelFormatDesc channelDesc = cudaCreateChannelDescHalf2();
	CUDA_SAFE_CALL(cudaMallocArray(&array, &channelDesc, n, m, 0));
	CUDA_SAFE_CALL(cudaMemcpy2DToArray(array, 0, 0, arr, n*sizeof(arr[0][0]), n*sizeof(arr[0][0]), m, cudaMemcpyHostToDevice));

	// create resource description
	cudaResourceDesc resDesc;
	memset(&resDesc, 0, sizeof(resDesc));
	resDesc.resType = cudaResourceTypeArray;
	resDesc.res.array.array = array;

	// create texture description
	cudaTextureDesc texDesc;
	memset(&texDesc, 0, sizeof(texDesc));
	texDesc.readMode = cudaReadModeElementType;
	texDesc.filterMode = cudaFilterModeLinear;
	texDesc.addressMode[0] = cudaAddressModeWrap;
	texDesc.addressMode[1] = cudaAddressModeWrap;
	texDesc.normalizedCoords = 1;

	// create texture object
	cudaTextureObject_t tex = 0;
	cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

	kernel << <1, 1 >> >(tex, m, n);
	CHECK_LAUNCH_ERROR();
	CUDA_SAFE_CALL(cudaDeviceSynchronize());
	CUDA_SAFE_CALL(cudaFreeArray(array));
	CUDA_SAFE_CALL(cudaDestroyTextureObject(tex));
	return EXIT_SUCCESS;
}

There was at least one issue with the edge-case handling of textures in CUDA 5.0. If memory serves, it had to do with un-normalized texture coordinates, where the “border” mode incorrectly was turned into “clamp to edge”. This was discussed in this forum, and I recall filing a bug to have it fixed for 5.5.

The “wrap” mode should work with normalized texture coordinates, so given that CUDA 5 had a known bug with texture edge cases, there might have been another (or maybe the bug that existed back then affected multiple modes).

I would recommend upgrading the CUDA version. If CUDA 7.5 seems too risky on account of its recent release, I would suggest trying 6.5, as it has been stable in my personal experience and supports all shipping hardware (make sure to download the version enabled for GTX 9xx!).

As I haven’t been able to get the above code to reproduce the problem on our Linux systems with CUDA 5.0 (neither with pitch linear memory instead of cudaArrays, nor with the Y axis used for interpolation instead of the X axis) I am beginning to think that the problem is something else.

Maybe some mistake setting up the textures or with the cudaMemcpy, or maybe some memory corruption problem. I’ll find out.

Christian

Incorrectly set texture dimensions in the cudaResourceDesc were the cause of strange effects accessing the border texels in our production code.

Case closed. Sorry for the confusion.