I am using a C2050 and I suppose it has the FERMI architecture.
My question is that, under the FERMI architecture, why do we still need the texture and constant memory? My understanding is that we can benefit from them (texture/constant memory) because they can cache data. But the FERMI architecture had the cache capabilities on the GPU so why we still need texture and constant memory for?
I am using a C2050 and I suppose it has the FERMI architecture.
My question is that, under the FERMI architecture, why do we still need the texture and constant memory? My understanding is that we can benefit from them (texture/constant memory) because they can cache data. But the FERMI architecture had the cache capabilities on the GPU so why we still need texture and constant memory for?
Constant memory sometimes could be a bit faster, because it doesn’t require separate load instruction (sometimes!).
Random load from texture could be a lot faster, than load from global memory. FERMI L1 cache to L2 datapath is 128 bytes wide, so with random access you could not use all L1 to L2(and memory) bandwidth. FERMI L1 texture cache to L2 datapath is 32 bytes wide (my guess!) , so random load is not a problem.
Global memory accesses also could be 32 bytes wide, if they bypass L1 cache at all(compiler switch). With texture cache you get both: and L1, and small load granularity.
You have C2050, so you could try to get official answer from nv. Please, don’t forget to repost answer here :)
Constant memory sometimes could be a bit faster, because it doesn’t require separate load instruction (sometimes!).
Random load from texture could be a lot faster, than load from global memory. FERMI L1 cache to L2 datapath is 128 bytes wide, so with random access you could not use all L1 to L2(and memory) bandwidth. FERMI L1 texture cache to L2 datapath is 32 bytes wide (my guess!) , so random load is not a problem.
Global memory accesses also could be 32 bytes wide, if they bypass L1 cache at all(compiler switch). With texture cache you get both: and L1, and small load granularity.
You have C2050, so you could try to get official answer from nv. Please, don’t forget to repost answer here :)
Thanks for the reply. I am doing a 5x5 Gaussian image filtering and the each consecutive thread will calculate and one pixel. I suppose, this is not random access???
As you said, under random access assumption, texture memory could perform better. I am very new to this type of processing, can you give me a simple example of random access?
Thanks for the reply. I am doing a 5x5 Gaussian image filtering and the each consecutive thread will calculate and one pixel. I suppose, this is not random access???
As you said, under random access assumption, texture memory could perform better. I am very new to this type of processing, can you give me a simple example of random access?
Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level
In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…
The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.
For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…
Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you >load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only >32 bytes at L2 and memory level
In my case, I believe that the global memory load is better because my next thread will need to access the next 4 bytes, which as you suggested, should already be prefetched…
The same thing is for the constant, I have 25 filter coefficients that will be used by all the threads. I only need to read it from the global memory once and it will stay in the cache for all the upcoming threads.
For random access, it might be a different story. As stated in your example, the 124 neighbors won’t be useful because the next thread won’t use them…
random access isn’t neccessarily “random”, but it’s in any case not linear. neither is it square nor cubic. “random” in this context really only means non-regular, and more specifically access that doesn’t line up with the memory architecture.
in your case the access is pretty regular so it shouldn’t be too difficult to make it line up pretty well. since you have a 5x5 filter presumably you’re reading each pixel 5x5=25 times. in that case i’d recommend reading sections of the image into shared memory and working with it there. you could get something like a 25x speedup.
in that case store the 5x5 filter in constant memory, which should be as fast as shared memory, without using any shared memory.
and i believe the speedup of texture memory (besides free interpolation) is when you access it linearly or in a blockwise fashion (2-d). if you’re storing to shared memory you can definitely read it in linearly. maybe in each block read in (x-2,y-2) to (x+w+2,y+h+2) and then run the filter in parallel from (x,y) to (x+w,y+h).
an advantage of this is that varying your filter size will have a fairly neglible effect on your i/o utilization. and it’s deliberate and exact instead of cacheing which is more heuristic and might make the wrong decisions.
random access isn’t neccessarily “random”, but it’s in any case not linear. neither is it square nor cubic. “random” in this context really only means non-regular, and more specifically access that doesn’t line up with the memory architecture.
in your case the access is pretty regular so it shouldn’t be too difficult to make it line up pretty well. since you have a 5x5 filter presumably you’re reading each pixel 5x5=25 times. in that case i’d recommend reading sections of the image into shared memory and working with it there. you could get something like a 25x speedup.
in that case store the 5x5 filter in constant memory, which should be as fast as shared memory, without using any shared memory.
and i believe the speedup of texture memory (besides free interpolation) is when you access it linearly or in a blockwise fashion (2-d). if you’re storing to shared memory you can definitely read it in linearly. maybe in each block read in (x-2,y-2) to (x+w+2,y+h+2) and then run the filter in parallel from (x,y) to (x+w,y+h).
an advantage of this is that varying your filter size will have a fairly neglible effect on your i/o utilization. and it’s deliberate and exact instead of cacheing which is more heuristic and might make the wrong decisions.
Shared is universally much faster than L1 cache hit reads in GF100. In GF 104, they’re about the same if there’s access of the same cache line at once but L1 is slower if different cache lines are read simultaneously.
Shared is universally much faster than L1 cache hit reads in GF100. In GF 104, they’re about the same if there’s access of the same cache line at once but L1 is slower if different cache lines are read simultaneously.