How to define texture properly

I have a question on how to define texture memory properly.

In the host memory, I have a float array and a point 2Darray, where each point is composed of a (x, y, z, w) struct.

I define the texture as:
texture<float, 1, cudaReadModeElementType> tex_U;
texture<float4, 2, cudaReadModeElementType> tex_p;

In the host function, I define the texture to get a copy of the float array:

cudaChannelFormatDesc channelDescU = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray* cu_arrayU;
CUDA_SAFE_CALL( cudaMallocArray( &cu_arrayU, &channelDescU, n, 1 )); 
CUDA_SAFE_CALL( cudaMemcpyToArray( cu_arrayU, 0, 0, U, U_mem_size, cudaMemcpyHostToDevice));
// set texture parameters
tex_U.addressMode[0] = cudaAddressModeWrap;
tex_U.filterMode = cudaFilterModePoint;
tex_U.normalized = false;    
// Bind the array to the texture
CUDA_SAFE_CALL( cudaBindTextureToArray( tex_U, cu_arrayU, channelDescU));

I also define the texture to get a copy of the point 2D array:
cudaChannelFormatDesc channelDescP = cudaCreateChannelDesc();
cudaArray* cu_arrayP;
CUDA_SAFE_CALL( cudaMallocArray( &cu_arrayP, &channelDescP, n, m ));
CUDA_SAFE_CALL( cudaMemcpyToArray( cu_arrayP, 0, 0, P, p_mem_size, cudaMemcpyHostToDevice));
// set texture parameters
tex_p.addressMode[0] = cudaAddressModeWrap;
tex_p.addressMode[1] = cudaAddressModeWrap;
tex_p.filterMode = cudaFilterModePoint;
tex_p.normalized = false;
// Bind the array to the texture
CUDA_SAFE_CALL( cudaBindTextureToArray( tex_p, cu_arrayP, channelDescPole));

When I tried to access these texture in a global function:
tex1D(tex_U, 1.0f);
tex2D(tex_p, 1.0f, 1.0f);

They throw cudaError and all the threads quit in the device simulation mode.

I guess I did not define the texture correctly. How to define it? Thank you,

What is the error reported by cudaGetErrorString()?

The only error I can see is that you are using addressModeWrap and normalized=false. This is not supported. Wrapping can only be done with normalized texture coords. From your texture reads (reading coords 1.0f), perhaps normalized=true is what you intended after all?

And a tip: You can access the channel descriptor with tex_U.channelDesc for the cudaMallocArray instead of using createChannelDesc. It’s a small thing, but it can help prevent errors if you mistype something in createChannelDesc (which you don’t appear to have done).

Thanks for the reply!

I changed the cudaAddressModeWrap to cudaAddressModeClamp, and try to run the program again. The threads quit with the same error.

The error message from cudaGetErrorString is:

cannot fetch from a texture that is not bound

I actually do not want to use normalized texture. I just want to store the data in the texture memory, so that it is quicker for the global function to access the constant data.

By the way, I saw that every time we want to use texture memory, it needs ‘warm-up’ by running the kernel once first. What is the purpose? Is it to load the texture into cache in the multi-processor?

Any “warm up” runs are for timing purposes. The first call of any kernel is going to cause it to be compiled and downloaded to the device. That takes a while and you don’t want to include that time in any benchmark.

As for the error message you are getting: Are you by chance calling cudaBindTextureToArray in one .cu file and then calling the kernel from another .cu file that was compiled separately and then linked into the exectuable?

Yes. My files are organized as follows:

  1. mylib.cuh: definition of the texture tex_U, tex_p, and declaration of global function;

  2. mylib.cu: definition of a global function wherein tex1D() and tex2D() are called;

  3. caller.cu: in mycaller(): definition of a host function, which inits CUDA, allocates the cuda array, copies into the cuda array the float array, and call the global function defined in mylib.cu.

  4. In the main(), I call mycaller();

What should be corrected? Thank you!

The simplest way to get this working is to create a “main.cu” file which simply includes all of your other .cu files. This “main.cu” is then the only one you compile with nvcc.This is how I do it in my code.

You are right. It works now. The simpleTexture example in CUDA SDK actually compiles only one .cu file.

I still have some trouble in the point 2D array.

The point is of a struct:

typedef struct {

float XX;

float YY;

float ZZ;

float WW;

} WPoint;

P is a 2D array of WPoint. The size of P is n x m.

The corresponding texture is defined as:

cudaChannelFormatDesc channelDescP = cudaCreateChannelDesc();

cudaArray* cu_arrayP;

CUDA_SAFE_CALL( cudaMallocArray( &cu_arrayP, &channelDescP, n, m ));

unsigned int p_mem_size = sizeof( WPoint) * (n) * (m);

CUDA_SAFE_CALL( cudaMemcpyToArray( cu_arrayP, 0, 0, P, p_mem_size, cudaMemcpyHostToDevice));

// set texture parameters

tex_p.addressMode[0] = cudaAddressModeClamp;

tex_p.addressMode[1] = cudaAddressModeClamp;

tex_p.filterMode = cudaFilterModePoint;

tex_p.normalized = false;

// Bind the array to the texture

CUDA_SAFE_CALL( cudaBindTextureToArray( tex_p, cu_arrayP, channelDescP));

I thought float4 would be able to accept WPoint struct data verbatim. Is it not true? If not, what could be a good workaround? I guess it is not a good idea to use float4 elsewhere in the main C++ program?

By the way, why can’t I read the content of cu_arrayP after I copied the content of P array to cu_arrayP? I had tried to read cu_arrayP[0]. Is there any way to read it?

Thank you!

You need to use float4, it is designed to be used anywhere in the c++ program. Just include cuda_runtime_api.h to get it. It has enforced alignment which is likely the cause of the problem you see converting your point to the float4.

You cannot read cudaArrays directly. They are stored in a “special” way to improve 2D texture cache lookups. If you need to access the same memory structure in both a coalesced and a textured fashion, you need to keep a “mirror” copy in normal device memory. Updates to this are double buffered, so you can write to the device memory version and then perform a (very fast) device->device copy to update the Array from the device memory.

I took your advice and I think I am almost there. The kernel is working.

There is still two more problems in getting the output correctly. I am using the following command to get the output:

float4* d_odata;

size_t pitch;

cudaMallocPitch( (void**)&d_odata, &pitch, (n)*sizeof(float4), (m) );

d_odata is the device memory which holds the output of the global function.

The d_odata and pitch are passed as arguments into the global function.

In the global function, I am using the following command to get the output:

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

float4* row = (float4*) ( (char*)odata + ty * pitch);

row[tx].x = x;

row[tx].y = y;

row[tx].z = z;

row[tx].w = 1.0f;

The above code was adapted from Section 4.5.2.3 of the CUDA programming guide v1.0.

Here, according to cudaMallocPitch(), the pitch value is 128. However, this pitch value seems to be wrong. The true pitch value seems to be 96. Why? How to get the correct pitch value with a program command?

And why do I need to put (char*) in front of odata?

Another question:

After kernel execution, I want to copy it back to a float4** 2D array (resultP) in the host function by using the following command:

CUDA_SAFE_CALL( cudaMemcpy2D( *resultP, pitch, d_odata, pitch, (n), (m), cudaMemcpyDeviceToHost));

This only gets me correct result in the first row of the resultP. How to copy d_odata correctly?

Thank you very much,

I guess internally cudaMallocPitch automatically pads data to the nearest mulitplies of 64, per my observation. Don’t know why it is 64. Anyway, cudaMallocPitch doesn’t work for me. I switched to cudaMalloc and use 1D array to represent 2D array. It works finally. Thank MisterAnderson42 for the help!

It’s 64 because the device alignment for the current hardware is 256, and you used a data type of 4 bytes. (Search the API documentation for textureAlignment or textureAlign for more info.)

/Pyry