Why texture/constant memory under FERMI architecture

qianmi · November 3, 2010, 4:13pm

Hello:

I am using a C2050 and I suppose it has the FERMI architecture.

My question is that, under the FERMI architecture, why do we still need the texture and constant memory? My understanding is that we can benefit from them (texture/constant memory) because they can cache data. But the FERMI architecture had the cache capabilities on the GPU so why we still need texture and constant memory for?

Thanks for any of your insights,

Ming Qian

qianmi · November 3, 2010, 4:13pm

Hello:

I am using a C2050 and I suppose it has the FERMI architecture.

My question is that, under the FERMI architecture, why do we still need the texture and constant memory? My understanding is that we can benefit from them (texture/constant memory) because they can cache data. But the FERMI architecture had the cache capabilities on the GPU so why we still need texture and constant memory for?

Thanks for any of your insights,

Ming Qian

AlexanderMalishev · November 3, 2010, 5:26pm

Constant memory sometimes could be a bit faster, because it doesn’t require separate load instruction (sometimes!).

Random load from texture could be a lot faster, than load from global memory. FERMI L1 cache to L2 datapath is 128 bytes wide, so with random access you could not use all L1 to L2(and memory) bandwidth. FERMI L1 texture cache to L2 datapath is 32 bytes wide (my guess!) , so random load is not a problem.

Global memory accesses also could be 32 bytes wide, if they bypass L1 cache at all(compiler switch). With texture cache you get both: and L1, and small load granularity.

You have C2050, so you could try to get official answer from nv. Please, don’t forget to repost answer here :)

See also The Official NVIDIA Forums | NVIDIA

AlexanderMalishev · November 3, 2010, 5:26pm

Constant memory sometimes could be a bit faster, because it doesn’t require separate load instruction (sometimes!).

Random load from texture could be a lot faster, than load from global memory. FERMI L1 cache to L2 datapath is 128 bytes wide, so with random access you could not use all L1 to L2(and memory) bandwidth. FERMI L1 texture cache to L2 datapath is 32 bytes wide (my guess!) , so random load is not a problem.

Global memory accesses also could be 32 bytes wide, if they bypass L1 cache at all(compiler switch). With texture cache you get both: and L1, and small load granularity.

You have C2050, so you could try to get official answer from nv. Please, don’t forget to repost answer here :)

See also http://forums.nvidia.com/index.php?showtopic=184388&pid=1140929&start=&st=#entry1140929

qianmi · November 3, 2010, 5:52pm

Alex:

Thanks for the reply. I am doing a 5x5 Gaussian image filtering and the each consecutive thread will calculate and one pixel. I suppose, this is not random access???

As you said, under random access assumption, texture memory could perform better. I am very new to this type of processing, can you give me a simple example of random access?

Thanks,

Ming

qianmi · November 3, 2010, 5:52pm

Alex:

Thanks for the reply. I am doing a 5x5 Gaussian image filtering and the each consecutive thread will calculate and one pixel. I suppose, this is not random access???

As you said, under random access assumption, texture memory could perform better. I am very new to this type of processing, can you give me a simple example of random access?

Thanks,

Ming

qianmi · November 3, 2010, 6:10pm

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level

In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…

The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.

For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…

Correct me if my understanding is wrong.

Thanks,

Ming Qian

qianmi · November 3, 2010, 6:10pm

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level

In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…

The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.

For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…

Correct me if my understanding is wrong.

Thanks,

Ming Qian

AlexanderMalishev · November 3, 2010, 6:43pm

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level

In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…

The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.

For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…

Correct me if my understanding is wrong.

Thanks,

Ming Qian

You are right, L1 cache is quite good here.

May be, constant cache for coefficients would be slightly better. Not sure if it is worth the trouble.

By the way, there is CUDA SDK example “convolutionSeparable” - exactly your problem. It uses shared memory to emulate cache on old devices.

AlexanderMalishev · November 3, 2010, 6:43pm

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level

In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…

The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.

For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…

Correct me if my understanding is wrong.

Thanks,

Ming Qian

You are right, L1 cache is quite good here.

May be, constant cache for coefficients would be slightly better. Not sure if it is worth the trouble.

By the way, there is CUDA SDK example “convolutionSeparable” - exactly your problem. It uses shared memory to emulate cache on old devices.

happyjack272 · November 3, 2010, 6:53pm

random access isn’t neccessarily “random”, but it’s in any case not linear. neither is it square nor cubic. “random” in this context really only means non-regular, and more specifically access that doesn’t line up with the memory architecture.

in your case the access is pretty regular so it shouldn’t be too difficult to make it line up pretty well. since you have a 5x5 filter presumably you’re reading each pixel 5x5=25 times. in that case i’d recommend reading sections of the image into shared memory and working with it there. you could get something like a 25x speedup.

in that case store the 5x5 filter in constant memory, which should be as fast as shared memory, without using any shared memory.
and i believe the speedup of texture memory (besides free interpolation) is when you access it linearly or in a blockwise fashion (2-d). if you’re storing to shared memory you can definitely read it in linearly. maybe in each block read in (x-2,y-2) to (x+w+2,y+h+2) and then run the filter in parallel from (x,y) to (x+w,y+h).

an advantage of this is that varying your filter size will have a fairly neglible effect on your i/o utilization. and it’s deliberate and exact instead of cacheing which is more heuristic and might make the wrong decisions.

happyjack272 · November 3, 2010, 6:53pm

random access isn’t neccessarily “random”, but it’s in any case not linear. neither is it square nor cubic. “random” in this context really only means non-regular, and more specifically access that doesn’t line up with the memory architecture.

in your case the access is pretty regular so it shouldn’t be too difficult to make it line up pretty well. since you have a 5x5 filter presumably you’re reading each pixel 5x5=25 times. in that case i’d recommend reading sections of the image into shared memory and working with it there. you could get something like a 25x speedup.

in that case store the 5x5 filter in constant memory, which should be as fast as shared memory, without using any shared memory.
and i believe the speedup of texture memory (besides free interpolation) is when you access it linearly or in a blockwise fashion (2-d). if you’re storing to shared memory you can definitely read it in linearly. maybe in each block read in (x-2,y-2) to (x+w+2,y+h+2) and then run the filter in parallel from (x,y) to (x+w,y+h).

an advantage of this is that varying your filter size will have a fairly neglible effect on your i/o utilization. and it’s deliberate and exact instead of cacheing which is more heuristic and might make the wrong decisions.

AlexanderMalishev · November 3, 2010, 6:55pm

Yes, it is not random access. It is really perfect memory access pattern for gpu: i-th thread reads a[i+const] element.

Random access (or something like that) :

int offset = random()%SIZE;

int value = array[offset];

AlexanderMalishev · November 3, 2010, 6:55pm

Yes, it is not random access. It is really perfect memory access pattern for gpu: i-th thread reads a[i+const] element.

Random access (or something like that) :

int offset = random()%SIZE;

int value = array[offset];

AlexanderMalishev · November 3, 2010, 7:03pm

L1 has the same bandwidth and the same size, as shared memory. So I think, it will be no difference between them.

AlexanderMalishev · November 3, 2010, 7:03pm

L1 has the same bandwidth and the same size, as shared memory. So I think, it will be no difference between them.

SPWorley · November 3, 2010, 8:16pm

Actually, Alexander, you’re the one who showed me the bandwidths were not the same. Probably because of L1 lookup overhead.

Shared is universally much faster than L1 cache hit reads in GF100. In GF 104, they’re about the same if there’s access of the same cache line at once but L1 is slower if different cache lines are read simultaneously.

SPWorley · November 3, 2010, 8:16pm

Actually, Alexander, you’re the one who showed me the bandwidths were not the same. Probably because of L1 lookup overhead.

Shared is universally much faster than L1 cache hit reads in GF100. In GF 104, they’re about the same if there’s access of the same cache line at once but L1 is slower if different cache lines are read simultaneously.

AlexanderMalishev · November 3, 2010, 8:45pm

Oh, yes. I forgot about GF100. So CUDA SDK code is the fastest solution.

By the way, does anybody know why even shared memory is so slow(~400Gb/s real vs 1000Gb/s advertised peak). http://www.beyond3d.com/images/reviews/Slimer-arch/SharedMemBandwidth-big.jpg

AlexanderMalishev · November 3, 2010, 8:45pm

Oh, yes. I forgot about GF100. So CUDA SDK code is the fastest solution.

By the way, does anybody know why even shared memory is so slow(~400Gb/s real vs 1000Gb/s advertised peak). http://www.beyond3d.com/images/reviews/Slimer-arch/SharedMemBandwidth-big.jpg

Topic		Replies	Views
Why texture memory is better on Fermi? CUDA Programming and Performance	62	20829	January 28, 2011
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9794	December 14, 2011
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7129	April 17, 2013
Fermi? Sounds interesting... CUDA Programming and Performance	58	15505	October 18, 2009
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2788	August 9, 2010
Fermi question CUDA Programming and Performance	30	5550	May 26, 2010
how is reconfigurable cache/memory implemented? CUDA Programming and Performance	10	3014	December 22, 2010
Doubts related to CUDA CUDA Programming and Performance	17	11801	November 18, 2010
[Fermi] Number of registers CUDA Programming and Performance	36	20161	September 15, 2010
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26113	December 2, 2011

Why texture/constant memory under FERMI architecture

Related topics