Using Textures

I am trying to map some data to texture and am having difficulty understanding the API document. I basically have a large piece of data I want stored as a 2-D texture, 1 32-bit float 1 element per coordinate. I understand that texture is good for this as I have 2D spacial locality.

I have loaded data to the device and have a pointer to this memory and want to make it a texture by using BindTexture. (is this what I am supposed to do?)

texture<float, 2, cudaReadModeElementType> image;
struct cudaChannelFormatDesc myChanFormDesc =
cudaCreateChannelDesc(32,0,0,0, cudaChannelFormatKindFloat);

cudaBindTexture(image, data, myChanFormDesc, size, 0);

^ This call is what confuses me - where I put “image” the docs say to use a char* symbol - what is this piece of information? I assume it is not something that I define arbitrarilty but I have not found an API call that assigns one.

I also need to pass the texture reference to the device to use the texture data. I get a compiler complaint when I try to pass “image” as the parameter to the kernel.

I know I am making some simple mistakes, just can’t seem to find what I need in the docs very easily…

Here’s a simple snippet of some old test code I wrote that uses 2-D textures. Ignore everything but the texture stuff since this is a very slow version of the code.


// declare texture reference for the 2-D float4 texture

texture<float4, 2, cudaReadModeElementType> tex;

__global__ static void cenergy(int numatoms, float gridspacing, int z, float * energygrid) {

  int xindex  = (blockIdx.x * blockDim.x) + threadIdx.x;

  int yindex  = (blockIdx.y * blockDim.y) + threadIdx.y;

  int outaddr = (gridDim.x * blockDim.x * yindex) + xindex;

 float3 coor;

  coor.x = gridspacing * xindex;

  coor.y = gridspacing * yindex;

  coor.z = gridspacing * z;

 int atomid, tx, ty;

  float energyval=0.0f;

  for (atomid=0,tx=0,ty=0; atomid < numatoms; ty++) {

    for (tx=0; tx < TEXROWSIZE && atomid < numatoms; tx++, atomid++) {

      float4 atominfo = texfetch(tex, tx, ty);

      float dx = coor.x - atominfo.x;

      float dy = coor.y - atominfo.y;

      float dz = coor.z - atominfo.z;

      energyval += atominfo.w / sqrtf(dx*dx + dy*dy + dz*dz);



 energygrid[outaddr] = energyval;


int copyatomstotexture(float *atoms, int count, cudaArray **darray) {

printf("tex: %d %d %d %d %d\n",






 cudaMallocArray(darray, &tex.channelDesc, TEXROWSIZE, (count / TEXROWSIZE));

 cudaMemcpyToArray(*darray, 0, 0, atoms, count*4*sizeof(float), cudaMemcpyHostToDevice);

 // set texture parameters

  tex.addressMode[0] = cudaAddressModeClamp;

  tex.addressMode[1] = cudaAddressModeClamp;

  tex.filterMode = cudaFilterModePoint;

  tex.normalized = false; // do not normalize coordinates

 // Bind the array to the texture

  cudaBindTexture(tex, *darray);

 return 0;



Note: The count parameter for cudaMemcpyToArray() appars to be number of bytes. The docs list it as number of array elements.

Good catch, thanks. We’ll fix the doc.

I am still curious how I can use the other call to cudaBindTexture that does not use an array (the other one in the doc). How is this call used?

use cudaMalloc2D or cudaMalloc. In either case, you need to use 1D addressing in your texfetch() call and declare your texture to be 1D (only CUDA Arrays allocated with cudaMallocArray work as 2D textures). To do this when you use cudaMalloc2D, you have to use the pitch returned by cudaMalloc2D in the addressing, something like this:

texture <float> input_tex;


void kernel_texture(float *output_d, int pitch) 


    int x = blockIdx.x * blockDim.x + threadIdx.x;

    int y = blockIdx.y * blockDim.y + threadIdx.y;

   output_d[y*pitch+x] = 2 * texfetch(input_tex, y*(pitch>>2)+x);


That pitch>>2 is important. This is basically the pitch in elements rather than in bytes. So if your elements are floats, then they are 4 bytes, so you shift right 2 bits. But if your elements are shorts, then you need to shift right only 1 bit.

Note also that unlike CUDA Array texture references, linear memory texture references will not benefit as well from the texture cache as CUDA Arrays due to memory layout.



If I consider the doc, 2D textures can be of integer types. But when I change the float type with unsigned short or unsigned int in the simpleTexture example and launch it in Emurelease mode the kernel is blocking with no treatment. Even a printf(“here\n”); at the beginning of the kernel is not printed. Do I have forgotten something? Can you help me?

here it is the modified code with unsigned short:


// declare texture reference for 2D float texture
texture<unsigned short, 2, cudaReadModeElementType> tex;

//! Transform an image using texture lookups
//! @param g_odata output data in global memory
global void
transformKernel(unsigned short* g_odata, int width, int height, float theta)
// calculate normalized texture coordinates
unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

float u = x / (float) width;
float v = y / (float) height;

// transform coordinates
u -= 0.5;
v -= 0.5;
float tu = u*cos(theta) - v*sin(theta) + 0.5;
float tv = v*cos(theta) + u*sin(theta) + 0.5;

// read from texture and write to global memory
g_odata[y*width + x] = texfetch(tex, tu, tv);


#endif // #ifndef SIMPLETEXTURE_KERNEL_H

and the main code:

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cutil.h>

// includes, kernels
#include <>

char *image_filename = “lena_bw.pgm”;
float angle = 0.5f; // angle to rotate image by (in radians)

// declaration, forward
void runTest( int argc, char** argv);

extern “C”
void computeGold( float* reference, float* idata, const unsigned int len);

// Program main
main( int argc, char** argv)
runTest( argc, argv);

CUT_EXIT(argc, argv);


//! Run a simple test for CUDA
runTest( int argc, char** argv)
// load image from disk
unsigned short* h_data = NULL;
unsigned int width, height;
char* image_path = cutFindFilePath(image_filename, argv[0]);
if (image_path == 0)
CUT_SAFE_CALL( cutLoadPGMs(image_path, &h_data, &width, &height));

unsigned int size = width * height * sizeof(unsigned short);
printf("Loaded '%s', %d x %d pixels\n", image_filename, width, height);

// allocate device memory for result
unsigned short* d_data = NULL;
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_data, size));

// allocate array and copy image data
cudaArray* cu_array;
CUDA_SAFE_CALL( cudaMallocArray( &cu_array, &tex.channelDesc, width, height )); 
CUDA_SAFE_CALL( cudaMemcpy( cu_array, h_data, size, cudaMemcpyHostToDevice));

// set texture parameters
tex.addressMode[0] = cudaAddressModeWrap;
tex.addressMode[1] = cudaAddressModeWrap;
tex.filterMode = cudaFilterModeLinear;
tex.normalized = true;    // access with normalized texture coordinates

// Bind the array to the texture
cudaBindTexture( tex, cu_array);

dim3 dimBlock(8, 8, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);

// warmup
transformKernel<<< dimGrid, dimBlock, 0 >>>( d_data, width, height, angle);

unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));

// execute the kernel
transformKernel<<< dimGrid, dimBlock, 0 >>>( d_data, width, height, angle);

// check if kernel execution generated an error
CUT_CHECK_ERROR("Kernel execution failed");

CUT_SAFE_CALL( cutStopTimer( timer));
printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));
printf("%.2f Mpixels/sec\n", (width*height / (cutGetTimerValue( timer) / 1000.0f)) / 1e6);
CUT_SAFE_CALL( cutDeleteTimer( timer));

// allocate mem for the result on host side
unsigned short* h_odata = (unsigned short*) malloc( size);
// copy result from device to host
CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_data, size, cudaMemcpyDeviceToHost) );

// write result to file
char output_filename[1024];
strcpy(output_filename, image_path);
strcpy(output_filename + strlen(image_path) - 4, "_out.pgm");
CUT_SAFE_CALL( cutSavePGMs( output_filename, h_odata, width, height));
printf("Wrote '%s'\n", output_filename);

// write regression file if necessary
if( cutCheckCmdLineFlag( argc, (const char**) argv, "regression")) 
    // write file for regression test
  //  CUT_SAFE_CALL( cutWriteFilef( "./data/regression.dat", h_odata, width*height, 0.0));


// cleanup memory


thanks for help!

Have you tried cudaMemcpyToArray() instead of cudaMemcpy()?

Odd – the SDK’s sample code ( uses cudaMemcpy() to copy host data to an array, instead of cudaMemcpyToArray()…

Is that allowed? or recommended? :unsure:

P.S. I for one am still unclear on the usage subtleties of cudaMemcpy() vs. cudaMemcpy2D() vs. cudaMemcpyToArray() vs. cudaMemcpy2DToArray(). The documentation and SDK sample code is pretty terse (and sometimes wrong!) on using these for initializing and using textures. A primer with a few examples (a sticky thread from one of the CUDA development team?) would go a long way to clarifying the topic. Thanks in advance!

I’ve just tried cudaMemcpyToArray(), the result is the same as cudaMemcpy (works with float but not with unsigned short).

I’ve also tried on the board (release mode) and it’s the same.

Do somebody has an example working fine with a texture array of unsigned short or int ?


I inserted the line:

// check if kernel execution generated an error

CUT_CHECK_ERROR("Kernel execution failed");

after the call to transformKernel(), and compiled with with -D_DEBUG.

I strongly recommend checking for errors after every CUDA call, and compiling with -D_DEBUG. You’ll catch a lot of problems that way.

The resulting error is:

Cuda error: Kernel execution failed in file ‘’ in line 110 : linear filtering not supported for non-float type.

This is mentioned on p. 24 of the CUDA Programming Guide (v0.8): “cudaFitlerModeLinear is only valid for returned values of floating-point type.”

Thanks very much. As I see, error messages are really useful!