GTX 470 performance gains too low ? (texture operations)

Anybody from NVIDIA, could you please comment on this ?

We tried our application on GTX 470 (with CUDA 3.0) and observed only 20% speed up over GTX 285 (CUDA 2.3).
We were expecting more performance gain with the new card, so we don’t understand if this is what we should have or we miss something.

Our application is compute bound, the computations are dominated by 2D texture interpolation operations “tex2D” using large matrix sizes.
The kernel configurations are optimized for best number of threads per block, for both cases under comparison.

Any comments or suggestions ?

Are you sure it is still compute bound on the GTX 470?

The peak floating point performance of the GTX 470 should be 50% higher than the GTX 285, but the theoretical peak memory bandwidth is 20% lower on the GTX 470 compared to the GTX 285. I saw some evidence (that I can’t test until I get a card) in the GTX 470/480 benchmarking thread that this shift in the compute/bandwidth ratio was turning some of my compute-bound test kernels into more bandwidth-limited kernels.

Do you use interpolation in 2D texture? You may try to convert it to ordinal memory to get usage of level1 cache, it has much bigger size than texture cache and could be configurated to 48KB. Btw, did you know number of texture units per multiprocessor or multiprocessor cluster? Btw, could we discuss it or are you waiting official answer?

Not only that, in the fermi performance tuning guide it is stated that L1 cache is faster than texture cache

I vaguely remember reading somewhere that texture is going to under-perform on FERMI… But its very distant memory… Cross check…

Sarnath is right. From the tuning guide:

I only have a GTX 480 to play with, and I’m seeing pretty good speedups in my texture-heavy and bandwidth limited code. 60% performance boost over a GTX 285. Not bad when the theoretical bandwidth only went up 10% (from 160 to 177 GiB/s).

I’m hoping for more when I convert from texture fetches to L1 cache reads. I think the OPs mistake is in comparing a GTX 285 (top of the line, fastest single G200 GPU) to GTX 470, which is one step down from the fastest. A comparison of GTX 275 and GTX 470 would make more sense.

Thanks,

Could you please give me a reference to this number “GTX 470 should be 50% higher than the GTX 285” ?

I am not sure how to determine exactly, if my kernel is compute bound or not in case of texture fetching operations. Should I assume the whole texture is transferred to cash once ? Than it is not memory transfer bound…

Thanks for your reply,

Yes, I am using interpolation (2D) so it seems I am bound to use texture fetching … There is no way I can reach obtain similar performance if I do 2D interpolations by “manually” (more than 10 FLOPs per one texture fetch)

Do you know the insights on texture units and how it may affect performance in texture fetching application ? Please share if you do…

Thanks!

Thanks,

We shall try GTX 480.

Unfortunately, we have to use texture fetching since we need 2D interpolation.

Is there any workaround to do L1 cache reads with interpolation ? I suppose we can not get any close to tex2D performance if we do interpolations manually (more than 10 FLOPs per 1 fetch). Is this correct or I am missing something ?

"more than 10 FLOPs per 1 fetch). "

If your program is memory bound… 10 flops is nothing

How do you do that? just remove the texture reference and access the underlying device memory? even if it is not

perfect coallesced access?

eyal

Exactly. The idea is that we have a full L1/L2 cache hierarchy now for global mem reads, lets use it! With random patterns like this, calling cudaFuncSetCacheConfig and going to 48K L1 is essential. Of course, a new set of challenges arise and we now need to think about loading 128-byte cache lines instead of coalescing… I don’t want to derail this thread any further: I’ll start a new thread on the cache after I’ve had some time to get a feel for what it is capable of.

Get the numbers straight from NVIDIA: http://www.nvidia.com/object/product_geforce_gtx_470_us.html

http://www.nvidia.com/object/product_geforce_gtx_285_us.html

Peak FLOPS = # CUDA cores * flops/clock * clock rate

Also look at memory bandwidth: 159.0 / 133.9 = 1.187 => GTX 285 is 18% faster memory bandwidth than GTX 470. You should be lucky to get 20% faster performance on a card with 20% less bandwidth!

Try making straight texture reads and interpolating yourself. You will be surprised. One typically needs 40+ flops/memory operation to become compute bound.

Also with Fermi’s L1 cache: try just reading straight from the pointer and see what kind of benefit you can get. It is impossible to know a priori what configuration will give your kernel the best performance, you just have to benchmark them all!

You didnt really expected me to leave this issue??? ;)

Any educated guess as to how to make sure that 100Ks threads in a lot of blocks will not polute the L1 cache and

enforce the 128-byte cache lines?? how would that be possible to do?

thanks

eyal

I’ve only had a few hours to benchmark the new GTX 480. You can’t expect someone to have already squeezed out all of the performance of the new cache in that amount of time and already have all the answers. Maybe I shouldn’t start that new thread and keep all the new cache tricks I find to myself?

Regarding the cache pollution, there is only one thing to say to that: 48k (not to mention the 768k L2 cache). Sticking to L1: cache pollution was an issue on G200. With only 8k per MP. If each thread reads 16 bytes from the cache and occupancy is 100%, then a full hypothetical round-robin run of threads reads will read 1024*16 = 16 kb from the cache. The cache is thus polluted before you get back to thread 1 and all temporal locality is lost.

On Fermi, the magic 48k number is the difference. 100% occupancy is now 1536, but 1536*16 = 24k => the cache is not fully polluted right away and we can start to get temporal locality from it!

The 768k L2 is also a godsend, at least for my work. The resident set of data that my app typically accesses randomly ranges from 100k - 1000k. I just wish that I could configure what reads are cached in L1 and which aren’t from C++ instead of an all or nothing change at compile time. Then I could prevent other once-only reads from polluting any of that 48k cache.

Ooh MisterAnderson, judging from your past assistance, I’m sure you’re going to give us all the red pill and show us how deep

the Fermi-rabbit-hole goes… ;)

eyal

Texture cache latency (at least at gt200 and should be on Fermi) is about ten times bigger than Fermi L1-L2 cache latency and its size is smaller, so need to use global memory access more.