I am trying to check different memory performance of digital image processing.
The main idea behind digital filters is a “moving window”, each processed pixel depends on its neighbours (x±1, y±1).
When i use standard, global memory I get a standard execution time. I guess the memory reads and writes are coalesced, because all threads do the same thing, no flow separation takes place.
Texture memory (2D) is … unfortunately twice slower. Very disappointing, since it is stated to be FASTER in such local-memory-dependent applications.
Is it possible or did I probably make a terrible mistake somewhere in the code?
What kind of CUDA device are you using? If it has an L1 cache, it is almost certainly bigger than the texture cache and therefore better at servicing the 1D sliding window pattern you describe.
Edit: Wait, I just realized you were talking about a 2D access pattern. I’m still curious if you have a cache, but my comment about the window is totally wrong.
Also, I assume you loaded your data into a cudaArray? Binding the texture to linear memory does not rearrange the data pattern to optimize for 2D access patterns.
Sometimes texture memory is slower, sometimes it is faster. In case of a sequential access pattern without skips or gaps, global memory should be faster. One thing to keep in mind about texture vs global, is that Fermi has 16 to 48 kb of L1 global memory cache per SM, but only 6 to 8 kb of texture memory cache.
Thank you for your replies. I have a GTX 460M … i know it’s probably not the best card to program CUDA apps, but it’s the best I have. It is a 2.1 device
Unfortunately, I guess i didn’t. I am a newbie at CUDA, so what I did was:
texture<uchar, 2, cudaReadModeElementType> texDevice; //declare a 2D texture (uchar = unsigned char)
cudaMemcpy(device,bitmap, s, cudaMemcpyHostToDevice ); //copy from host memory to device memory
cudaChannelFormatDesc desc = cudaCreateChannelDesc<uchar>(); //whatever ;-)
cudaBindTexture2D(NULL, texDevice, device, desc, w*3,h, w*3); //and the memory to texture
...
//and in the filter kernel, something like this:
rSum += tex2D(texDevice, 3*(x+i) , y+j) * mask->m[tmp]; //fetching etc.
Actually, that should be fairly reasonable for what you are testing. The L1 and L2 cache in that GPU probably helped the global memory version of your code.
Ah, when you bind a texture to a standard device pointer, you are getting the benefit of the texture cache, but no specific improvements for 2D access. (And, as mentioned, the texture cache is way smaller than the L1 and L2 cache on your GPU.) Take a look at the simpleTexture example in the SDK. It shows how to use cudaMallocArray() and cudaMemcpyToArray() to create and fill a cudaArray. CUDA Arrays store their data in a special order (something like this Z-order curve - Wikipedia) which might help speed up things.
Actually, how big are the raw images in memory that you are processing? If they fit entirely in the L2 cache, I’m not sure if textures will be much help at all.
Thank you for reply, i will look into cuda arrays right away.
The image sizes vary, from small 512x512 bitmaps (~800kB) to huge megapixel-heavy images.
Well, I’ve tried using textures in a different way
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindSigned);
//cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<uchar>(); // i use a (RGB)(RGB)(RGB) form, each 3 bytes is one pixel, row-major order
cudaArray* cu_array;
cudaMallocArray( &cu_array, &channelDesc, w*3, h); //w*3 because R,G and B
cudaMemcpyToArray( cu_array, 0, 0, bitmap, s, cudaMemcpyHostToDevice); //s is size, w*h*3..., bitmap is a host pointer
// set texture parameters
texDevice.addressMode[0] = cudaAddressModeClamp;
texDevice.addressMode[1] = cudaAddressModeClamp;
texDevice.filterMode = cudaFilterModePoint;
texDevice.normalized = 0; // access with standard coords
// Bind the array to the texture
cudaBindTextureToArray( texDevice, cu_array);
Unfortunately, texture is still 2x slower than standard global memory… Any ideas what could I improve?