Problems when reading from 2D Texture

Hello guys, need some help with a (hopefully trivial) problem.

I want to read values from a 2D Texture and updating them by using a device pointer. I use the same approach as in the book “CUDA by example” to achieve this.

float* device_ptr;

texture<float,2> tex;

int main()

{

	const int DIM = 4;

	int size = DIM * DIM * sizeof(float);

	cudaMalloc( (void**)&device_ptr , size );

	cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();

	cudaBindTexture2D( NULL , tex , device_ptr , desc , DIM , DIM , sizeof(float) * DIM ) ;

	float temp[16];

	for (int y = 0; y < DIM ; y++)

		for (int x = 0; x < DIM ; x++)

			temp[ x + y * DIM ] = x + y * DIM;

	cudaMemcpy( device_ptr , temp , size , cudaMemcpyHostToDevice );

	dim3 grids(DIM/2,DIM/2);

	dim3 threads(2,2);

	kernel<<<grids,threads>>>(device_ptr);

	cudaUnbindTexture( tex );

	cudaFree( device_ptr );

	return 0;

}

The problem is that somehow the 2D texture doesn’t seem to work properly for me. Instead of getting a 4x4 Texture, I only seem to get a 1x4 Texture. When I try to fetch values from row 2 3 4, it seems I get clamped values from row 1.

I used cuPrintf (to write values in the console) together with following kernel:

__global__ void kernel (float* device_ptr)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	int offset = x + y * blockDim.x * gridDim.x;

	cuPrintf("Texture value at x:%i y:%i  =  %f\n" , x , y, tex2D( tex , x , y ) );

	cuPrintf("device_ptr value at x:%i y:%i  =  %f \n\n" , x , y , device_ptr[offset] );

}

The output:

[0, 0]: Texture value at x:0 y:0  =  0.000000

[0, 0]: device_ptr value at x:0 y:0  =  0.000000

[0, 1]: Texture value at x:1 y:0  =  1.000000

[0, 1]: device_ptr value at x:1 y:0  =  1.000000

[0, 2]: Texture value at x:0 y:1  =  0.000000

[0, 2]: device_ptr value at x:0 y:1  =  4.000000

[0, 3]: Texture value at x:1 y:1  =  1.000000

[0, 3]: device_ptr value at x:1 y:1  =  5.000000

[1, 0]: Texture value at x:2 y:0  =  2.000000

[1, 0]: device_ptr value at x:2 y:0  =  2.000000

[1, 1]: Texture value at x:3 y:0  =  3.000000

[1, 1]: device_ptr value at x:3 y:0  =  3.000000

[1, 2]: Texture value at x:2 y:1  =  2.000000

[1, 2]: device_ptr value at x:2 y:1  =  6.000000

[1, 3]: Texture value at x:3 y:1  =  3.000000

[1, 3]: device_ptr value at x:3 y:1  =  7.000000

[2, 0]: Texture value at x:0 y:2  =  0.000000

[2, 0]: device_ptr value at x:0 y:2  =  8.000000

[2, 1]: Texture value at x:1 y:2  =  1.000000

[2, 1]: device_ptr value at x:1 y:2  =  9.000000

[2, 2]: Texture value at x:0 y:3  =  0.000000

[2, 2]: device_ptr value at x:0 y:3  =  12.000000

[2, 3]: Texture value at x:1 y:3  =  1.000000

[2, 3]: device_ptr value at x:1 y:3  =  13.000000

[3, 0]: Texture value at x:2 y:2  =  2.000000

[3, 0]: device_ptr value at x:2 y:2  =  10.000000

[3, 1]: Texture value at x:3 y:2  =  3.000000

[3, 1]: device_ptr value at x:3 y:2  =  11.000000

[3, 2]: Texture value at x:2 y:3  =  2.000000

[3, 2]: device_ptr value at x:2 y:3  =  14.000000

[3, 3]: Texture value at x:3 y:3  =  3.000000

[3, 3]: device_ptr value at x:3 y:3  =  15.000000

So, did I miss anything?

Hmm!

If I change

const int DIM = 4;

to

const int DIM = 16;

OR

const int DIM = 32;

OR 

const int DIM = 64;

etc

… it works. Only worked with DIM = 2^x, x > 3 (i.e. 25,26,27 didnt work)

I´m having the same problem, hopefully someone will solve this soon… External Image

Hi,

have you checked cudaMalloc2D and cudaMallocPitch?

–pium

I have read about them but didn’t try to use them. I read about 2d textures in the Cuda by example book and they do exactly as kerp describes.

But i´m am wondering about one thing, the pitch. In the exampels in the book they set the pitch as sizeof(float)*width. That makes me think that the pitch is the size in bytes of each row.

In cudaMalloc2D u have to pass both pitch off the dst and src so what should be passed there?

Api ref on malloc2d

//Chris

cudaMallocPitch() returns you the pitch. The “memory width” is at less the with of your table but generally it is bigger to respect hardware restriction on memory aligment (certainly something around power of 2).
If you are copying from the host, certainly the src pitch is the width (in bytes).

Note I am not familiar with these functions but it is what I understand by reading the doc.

I am not an expert on textures, but to my knowledge, due to layout restrictions, 2D textures cannot generally be bound to linear memory. Either use pitch-linear memory allocated via cudaMallocPitch() or a cudaArray allocated via cudaMallocArray(). Below is a modified app that uses pitch-linear memory. Note that I have changed the numbering of the matrix elements versus the original app, for ease of testing on my side. I am using device-side printf(), which requires a sm_2x platform; simply change back to cuPrintf() if needed.

#include <stdio.h>

#include <stdlib.h>

#define DIM 4

#if (DIM % 2)

#error DIM must be a multiple of 2

#endif

// Macro to catch CUDA errors in CUDA runtime calls

#define CUDA_SAFE_CALL(call)                                          \

do {                                                                  \

    cudaError_t err = call;                                           \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString(err) );       \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

} while (0)

texture<float,2> tex;

__global__ void kernel (float* device_ptr, int pitch)

{

    int x = threadIdx.x + blockIdx.x * blockDim.x;

    int y = threadIdx.y + blockIdx.y * blockDim.y;

    int offset = x + y * (pitch/sizeof(float));

    printf ("Texture value at x:%i y:%i  =  %f\n" , 

            x , y, tex2D( tex , x+0.5f , y+0.5f ) );   

    printf ("device_ptr value at x:%i y:%i  =  %f\n" , 

            x , y , device_ptr[offset] );

}

int main()

{

    float temp[DIM*DIM];

    size_t tex_ofs = 0;

    size_t pitch = 0;

    float *device_ptr = 0;

    CUDA_SAFE_CALL (cudaMallocPitch ((void**)&device_ptr, &pitch, DIM, DIM));

    CUDA_SAFE_CALL (cudaMemset (device_ptr, 0xff, DIM*pitch));

    for (int y = 0; y < DIM ; y++) {

        for (int x = 0; x < DIM ; x++) {

            temp[ x + y * DIM] = y + x * DIM;

        }

    }

    CUDA_SAFE_CALL (cudaMemcpy2D (device_ptr,

                                  pitch,

                                  temp,

                                  DIM*sizeof(float),

                                  DIM*sizeof(float),

                                  DIM,

                                  cudaMemcpyHostToDevice));

    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs,

                                       &tex,

                                       device_ptr,

                                       &tex.channelDesc,

                                       DIM,

                                       DIM, 

                                       pitch));

    if (tex_ofs != 0) {

        printf ("texture offset is not 0\n");

        exit(EXIT_FAILURE);

    }

    dim3 grids(DIM/2,DIM/2);

    dim3 threads(2,2);

    kernel<<<grids,threads>>>(device_ptr, pitch);

    CUDA_SAFE_CALL (cudaUnbindTexture (tex));

    CUDA_SAFE_CALL (cudaFree (device_ptr));

    return 0;

}

Hey again!

I played around some with your code this morning and found out that it didn’t work for dimensions greater than 2^9 (returning invalid argument in the safe call macro).

I think the error is in the cudaMallocPitch, so just changed

cudaMallocPitch ((void**)&device_ptr, &pitch, DIM, DIM);

to

cudaMallocPitch ((void**)&device_ptr, &pitch, DIM*sizeof(float), DIM);

Lazy as I am, I just googled the cudaMallocPitch function to get a link to the api and got this link that says nothing about width (in bytes)

http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g80d689bc903792f906e49be4a0b6d8db.html

and did obviously not notice the /cuda/2_3/ in the url. In 3.2 documentation, width in bytes is mentioned!

http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/online/group__CUDART__MEMORY_g80d689bc903792f906e49be4a0b6d8db.html

This also works for arbitrary dimensions, i.e:

cudaMallocPitch ((void**)&device_ptr, &pitch, DIM_X*sizeof(float), DIM_Y);