Memory performance in image processing example

conicall · March 19, 2011, 9:25pm

Hello!

I am trying to check different memory performance of digital image processing.

The main idea behind digital filters is a “moving window”, each processed pixel depends on its neighbours (x±1, y±1).

When i use standard, global memory I get a standard execution time. I guess the memory reads and writes are coalesced, because all threads do the same thing, no flow separation takes place.

Texture memory (2D) is … unfortunately twice slower. Very disappointing, since it is stated to be FASTER in such local-memory-dependent applications.
Is it possible or did I probably make a terrible mistake somewhere in the code?

Regards,
Conicall

seibert · March 19, 2011, 11:48pm

What kind of CUDA device are you using? If it has an L1 cache, it is almost certainly bigger than the texture cache and therefore better at servicing the 1D sliding window pattern you describe.

Edit: Wait, I just realized you were talking about a 2D access pattern. I’m still curious if you have a cache, but my comment about the window is totally wrong.

Also, I assume you loaded your data into a cudaArray? Binding the texture to linear memory does not rearrange the data pattern to optimize for 2D access patterns.

tera · March 19, 2011, 11:56pm

What GPU do you use? On compute capability 2.x devices global memory is cached as well.

A commonly used technique is to load the moving window into shared memory so you have full control about reuse of data loaded from global memory.

EDIT: Seibert was quicker…

hamster143 · March 20, 2011, 12:52am

Sometimes texture memory is slower, sometimes it is faster. In case of a sequential access pattern without skips or gaps, global memory should be faster. One thing to keep in mind about texture vs global, is that Fermi has 16 to 48 kb of L1 global memory cache per SM, but only 6 to 8 kb of texture memory cache.

conicall · March 20, 2011, 7:36am

Thank you for your replies. I have a GTX 460M … i know it’s probably not the best card to program CUDA apps, but it’s the best I have. It is a 2.1 device

Unfortunately, I guess i didn’t. I am a newbie at CUDA, so what I did was:

texture<uchar, 2, cudaReadModeElementType> texDevice; //declare a 2D texture (uchar = unsigned char)

cudaMemcpy(device,bitmap, s, cudaMemcpyHostToDevice ); //copy from host memory to device memory

cudaChannelFormatDesc desc = cudaCreateChannelDesc<uchar>(); //whatever ;-)

cudaBindTexture2D(NULL, texDevice, device, desc, w*3,h, w*3); //and the memory to texture

...

//and in the filter kernel, something like this:

rSum += tex2D(texDevice, 3*(x+i)  , y+j) * mask->m[tmp]; //fetching etc.

Regards,

Conicall

seibert · March 20, 2011, 1:17pm

Actually, that should be fairly reasonable for what you are testing. The L1 and L2 cache in that GPU probably helped the global memory version of your code.

Ah, when you bind a texture to a standard device pointer, you are getting the benefit of the texture cache, but no specific improvements for 2D access. (And, as mentioned, the texture cache is way smaller than the L1 and L2 cache on your GPU.) Take a look at the simpleTexture example in the SDK. It shows how to use cudaMallocArray() and cudaMemcpyToArray() to create and fill a cudaArray. CUDA Arrays store their data in a special order (something like this Z-order curve - Wikipedia) which might help speed up things.

Actually, how big are the raw images in memory that you are processing? If they fit entirely in the L2 cache, I’m not sure if textures will be much help at all.

conicall · March 22, 2011, 8:37am

Thank you for reply, i will look into cuda arrays right away.
The image sizes vary, from small 512x512 bitmaps (~800kB) to huge megapixel-heavy images.

conicall · March 24, 2011, 10:09am

Browser error, please delete.

conicall · March 24, 2011, 10:09am

Browser error, please delete

conicall · March 24, 2011, 10:14am

Well, I’ve tried using textures in a different way

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindSigned);

    //cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<uchar>(); // i use a (RGB)(RGB)(RGB) form, each 3 bytes is one pixel, row-major order

    cudaArray* cu_array;

    cudaMallocArray( &cu_array, &channelDesc, w*3, h); //w*3 because R,G and B 

    cudaMemcpyToArray( cu_array, 0, 0, bitmap, s, cudaMemcpyHostToDevice); //s is size, w*h*3..., bitmap is a host pointer

// set texture parameters

    texDevice.addressMode[0] = cudaAddressModeClamp;

    texDevice.addressMode[1] = cudaAddressModeClamp;

    texDevice.filterMode = cudaFilterModePoint;

    texDevice.normalized = 0;							// access with standard coords

// Bind the array to the texture

	cudaBindTextureToArray( texDevice, cu_array);

Unfortunately, texture is still 2x slower than standard global memory… Any ideas what could I improve?

Topic		Replies	Views
Textures CUDA Programming and Performance	2	1620	July 22, 2008
Textures: linear memory vs cudaArrays CUDA Programming and Performance	9	7776	October 16, 2007
CUDA texture memory performance CUDA Programming and Performance	4	33520	January 13, 2009
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2245	July 19, 2010
Texture vs. Global Memory CUDA Programming and Performance	4	2010	August 6, 2009
Convenience of 2D CUDA texture memory against global memory CUDA Programming and Performance	4	4321	January 21, 2013
Texture memory fetch extremely slow CUDA Programming and Performance	13	3091	December 21, 2017
Texture Memory Does Not Improve Speed CUDA Programming and Performance	1	585	February 23, 2017
Filter method by using texture memory CUDA Programming and Performance	2	367	October 15, 2023
Question about texture memory CUDA Programming and Performance	3	4443	May 27, 2009

Memory performance in image processing example

Related topics