Why texture/constant memory under FERMI architecture

Hello:

I am using a C2050 and I suppose it has the FERMI architecture.

My question is that, under the FERMI architecture, why do we still need the texture and constant memory? My understanding is that we can benefit from them (texture/constant memory) because they can cache data. But the FERMI architecture had the cache capabilities on the GPU so why we still need texture and constant memory for?

Thanks for any of your insights,

Ming Qian

Hello:

I am using a C2050 and I suppose it has the FERMI architecture.

My question is that, under the FERMI architecture, why do we still need the texture and constant memory? My understanding is that we can benefit from them (texture/constant memory) because they can cache data. But the FERMI architecture had the cache capabilities on the GPU so why we still need texture and constant memory for?

Thanks for any of your insights,

Ming Qian

Constant memory sometimes could be a bit faster, because it doesn’t require separate load instruction (sometimes!).

Random load from texture could be a lot faster, than load from global memory. FERMI L1 cache to L2 datapath is 128 bytes wide, so with random access you could not use all L1 to L2(and memory) bandwidth. FERMI L1 texture cache to L2 datapath is 32 bytes wide (my guess!) , so random load is not a problem.

Global memory accesses also could be 32 bytes wide, if they bypass L1 cache at all(compiler switch). With texture cache you get both: and L1, and small load granularity.

You have C2050, so you could try to get official answer from nv. Please, don’t forget to repost answer here :)

See also The Official NVIDIA Forums | NVIDIA

Constant memory sometimes could be a bit faster, because it doesn’t require separate load instruction (sometimes!).

Random load from texture could be a lot faster, than load from global memory. FERMI L1 cache to L2 datapath is 128 bytes wide, so with random access you could not use all L1 to L2(and memory) bandwidth. FERMI L1 texture cache to L2 datapath is 32 bytes wide (my guess!) , so random load is not a problem.

Global memory accesses also could be 32 bytes wide, if they bypass L1 cache at all(compiler switch). With texture cache you get both: and L1, and small load granularity.

You have C2050, so you could try to get official answer from nv. Please, don’t forget to repost answer here :)

See also http://forums.nvidia.com/index.php?showtopic=184388&pid=1140929&start=&st=#entry1140929

Alex:

Thanks for the reply. I am doing a 5x5 Gaussian image filtering and the each consecutive thread will calculate and one pixel. I suppose, this is not random access???

As you said, under random access assumption, texture memory could perform better. I am very new to this type of processing, can you give me a simple example of random access?

Thanks,

Ming

Alex:

Thanks for the reply. I am doing a 5x5 Gaussian image filtering and the each consecutive thread will calculate and one pixel. I suppose, this is not random access???

As you said, under random access assumption, texture memory could perform better. I am very new to this type of processing, can you give me a simple example of random access?

Thanks,

Ming

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level

In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…

The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.

For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…

Correct me if my understanding is wrong.

Thanks,

Ming Qian

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level

In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…

The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.

For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…

Correct me if my understanding is wrong.

Thanks,

Ming Qian

You are right, L1 cache is quite good here.

May be, constant cache for coefficients would be slightly better. Not sure if it is worth the trouble.

By the way, there is CUDA SDK example “convolutionSeparable” - exactly your problem. It uses shared memory to emulate cache on old devices.

You are right, L1 cache is quite good here.

May be, constant cache for coefficients would be slightly better. Not sure if it is worth the trouble.

By the way, there is CUDA SDK example “convolutionSeparable” - exactly your problem. It uses shared memory to emulate cache on old devices.

random access isn’t neccessarily “random”, but it’s in any case not linear. neither is it square nor cubic. “random” in this context really only means non-regular, and more specifically access that doesn’t line up with the memory architecture.

in your case the access is pretty regular so it shouldn’t be too difficult to make it line up pretty well. since you have a 5x5 filter presumably you’re reading each pixel 5x5=25 times. in that case i’d recommend reading sections of the image into shared memory and working with it there. you could get something like a 25x speedup.

in that case store the 5x5 filter in constant memory, which should be as fast as shared memory, without using any shared memory.
and i believe the speedup of texture memory (besides free interpolation) is when you access it linearly or in a blockwise fashion (2-d). if you’re storing to shared memory you can definitely read it in linearly. maybe in each block read in (x-2,y-2) to (x+w+2,y+h+2) and then run the filter in parallel from (x,y) to (x+w,y+h).

an advantage of this is that varying your filter size will have a fairly neglible effect on your i/o utilization. and it’s deliberate and exact instead of cacheing which is more heuristic and might make the wrong decisions.

random access isn’t neccessarily “random”, but it’s in any case not linear. neither is it square nor cubic. “random” in this context really only means non-regular, and more specifically access that doesn’t line up with the memory architecture.

in your case the access is pretty regular so it shouldn’t be too difficult to make it line up pretty well. since you have a 5x5 filter presumably you’re reading each pixel 5x5=25 times. in that case i’d recommend reading sections of the image into shared memory and working with it there. you could get something like a 25x speedup.

in that case store the 5x5 filter in constant memory, which should be as fast as shared memory, without using any shared memory.
and i believe the speedup of texture memory (besides free interpolation) is when you access it linearly or in a blockwise fashion (2-d). if you’re storing to shared memory you can definitely read it in linearly. maybe in each block read in (x-2,y-2) to (x+w+2,y+h+2) and then run the filter in parallel from (x,y) to (x+w,y+h).

an advantage of this is that varying your filter size will have a fairly neglible effect on your i/o utilization. and it’s deliberate and exact instead of cacheing which is more heuristic and might make the wrong decisions.

Yes, it is not random access. It is really perfect memory access pattern for gpu: i-th thread reads a[i+const] element.

Random access (or something like that) :

int offset = random()%SIZE;

int value = array[offset];

Yes, it is not random access. It is really perfect memory access pattern for gpu: i-th thread reads a[i+const] element.

Random access (or something like that) :

int offset = random()%SIZE;

int value = array[offset];

L1 has the same bandwidth and the same size, as shared memory. So I think, it will be no difference between them.

L1 has the same bandwidth and the same size, as shared memory. So I think, it will be no difference between them.

Actually, Alexander, you’re the one who showed me the bandwidths were not the same. Probably because of L1 lookup overhead.

Shared is universally much faster than L1 cache hit reads in GF100. In GF 104, they’re about the same if there’s access of the same cache line at once but L1 is slower if different cache lines are read simultaneously.

Actually, Alexander, you’re the one who showed me the bandwidths were not the same. Probably because of L1 lookup overhead.

Shared is universally much faster than L1 cache hit reads in GF100. In GF 104, they’re about the same if there’s access of the same cache line at once but L1 is slower if different cache lines are read simultaneously.

Oh, yes. I forgot about GF100. So CUDA SDK code is the fastest solution.

By the way, does anybody know why even shared memory is so slow(~400Gb/s real vs 1000Gb/s advertised peak). http://www.beyond3d.com/images/reviews/Slimer-arch/SharedMemBandwidth-big.jpg

Oh, yes. I forgot about GF100. So CUDA SDK code is the fastest solution.

By the way, does anybody know why even shared memory is so slow(~400Gb/s real vs 1000Gb/s advertised peak). http://www.beyond3d.com/images/reviews/Slimer-arch/SharedMemBandwidth-big.jpg