GTX 470 performance gains too low ? (texture operations)

cudacuda321 · April 19, 2010, 4:26pm

Anybody from NVIDIA, could you please comment on this ?

We tried our application on GTX 470 (with CUDA 3.0) and observed only 20% speed up over GTX 285 (CUDA 2.3).
We were expecting more performance gain with the new card, so we don’t understand if this is what we should have or we miss something.

Our application is compute bound, the computations are dominated by 2D texture interpolation operations “tex2D” using large matrix sizes.
The kernel configurations are optimized for best number of threads per block, for both cases under comparison.

Any comments or suggestions ?

seibert · April 19, 2010, 4:53pm

Are you sure it is still compute bound on the GTX 470?

The peak floating point performance of the GTX 470 should be 50% higher than the GTX 285, but the theoretical peak memory bandwidth is 20% lower on the GTX 470 compared to the GTX 285. I saw some evidence (that I can’t test until I get a card) in the GTX 470/480 benchmarking thread that this shift in the compute/bandwidth ratio was turning some of my compute-bound test kernels into more bandwidth-limited kernels.

Lev · April 19, 2010, 5:57pm

Do you use interpolation in 2D texture? You may try to convert it to ordinal memory to get usage of level1 cache, it has much bigger size than texture cache and could be configurated to 48KB. Btw, did you know number of texture units per multiprocessor or multiprocessor cluster? Btw, could we discuss it or are you waiting official answer?

E.D_Riedijk · April 20, 2010, 5:10am

Not only that, in the fermi performance tuning guide it is stated that L1 cache is faster than texture cache

Sarnath · April 20, 2010, 6:52am

I vaguely remember reading somewhere that texture is going to under-perform on FERMI… But its very distant memory… Cross check…

Ailleur · April 20, 2010, 3:19pm

Sarnath is right. From the tuning guide:

MisterAnderson42 · April 20, 2010, 5:39pm

I only have a GTX 480 to play with, and I’m seeing pretty good speedups in my texture-heavy and bandwidth limited code. 60% performance boost over a GTX 285. Not bad when the theoretical bandwidth only went up 10% (from 160 to 177 GiB/s).

I’m hoping for more when I convert from texture fetches to L1 cache reads. I think the OPs mistake is in comparing a GTX 285 (top of the line, fastest single G200 GPU) to GTX 470, which is one step down from the fastest. A comparison of GTX 275 and GTX 470 would make more sense.

cudacuda321 · April 20, 2010, 6:25pm

Thanks,

Could you please give me a reference to this number “GTX 470 should be 50% higher than the GTX 285” ?

I am not sure how to determine exactly, if my kernel is compute bound or not in case of texture fetching operations. Should I assume the whole texture is transferred to cash once ? Than it is not memory transfer bound…

cudacuda321 · April 20, 2010, 6:30pm

Thanks for your reply,

Yes, I am using interpolation (2D) so it seems I am bound to use texture fetching … There is no way I can reach obtain similar performance if I do 2D interpolations by “manually” (more than 10 FLOPs per one texture fetch)

Do you know the insights on texture units and how it may affect performance in texture fetching application ? Please share if you do…

Thanks!

cudacuda321 · April 20, 2010, 6:39pm

Thanks,

We shall try GTX 480.

Unfortunately, we have to use texture fetching since we need 2D interpolation.

Is there any workaround to do L1 cache reads with interpolation ? I suppose we can not get any close to tex2D performance if we do interpolations manually (more than 10 FLOPs per 1 fetch). Is this correct or I am missing something ?

Lev · April 20, 2010, 8:46pm

"more than 10 FLOPs per 1 fetch). "

If your program is memory bound… 10 flops is nothing

eyalhir74 · April 21, 2010, 7:46am

How do you do that? just remove the texture reference and access the underlying device memory? even if it is not

perfect coallesced access?

eyal

MisterAnderson42 · April 21, 2010, 12:03pm

Exactly. The idea is that we have a full L1/L2 cache hierarchy now for global mem reads, lets use it! With random patterns like this, calling cudaFuncSetCacheConfig and going to 48K L1 is essential. Of course, a new set of challenges arise and we now need to think about loading 128-byte cache lines instead of coalescing… I don’t want to derail this thread any further: I’ll start a new thread on the cache after I’ve had some time to get a feel for what it is capable of.

Get the numbers straight from NVIDIA: http://www.nvidia.com/object/product_geforce_gtx_470_us.html

http://www.nvidia.com/object/product_geforce_gtx_285_us.html

Peak FLOPS = # CUDA cores * flops/clock * clock rate

Also look at memory bandwidth: 159.0 / 133.9 = 1.187 => GTX 285 is 18% faster memory bandwidth than GTX 470. You should be lucky to get 20% faster performance on a card with 20% less bandwidth!

Try making straight texture reads and interpolating yourself. You will be surprised. One typically needs 40+ flops/memory operation to become compute bound.

Also with Fermi’s L1 cache: try just reading straight from the pointer and see what kind of benefit you can get. It is impossible to know a priori what configuration will give your kernel the best performance, you just have to benchmark them all!

eyalhir74 · April 21, 2010, 12:19pm

You didnt really expected me to leave this issue??? ;)

Any educated guess as to how to make sure that 100Ks threads in a lot of blocks will not polute the L1 cache and

enforce the 128-byte cache lines?? how would that be possible to do?

thanks

eyal

MisterAnderson42 · April 21, 2010, 1:01pm

I’ve only had a few hours to benchmark the new GTX 480. You can’t expect someone to have already squeezed out all of the performance of the new cache in that amount of time and already have all the answers. Maybe I shouldn’t start that new thread and keep all the new cache tricks I find to myself?

Regarding the cache pollution, there is only one thing to say to that: 48k (not to mention the 768k L2 cache). Sticking to L1: cache pollution was an issue on G200. With only 8k per MP. If each thread reads 16 bytes from the cache and occupancy is 100%, then a full hypothetical round-robin run of threads reads will read 1024*16 = 16 kb from the cache. The cache is thus polluted before you get back to thread 1 and all temporal locality is lost.

On Fermi, the magic 48k number is the difference. 100% occupancy is now 1536, but 1536*16 = 24k => the cache is not fully polluted right away and we can start to get temporal locality from it!

The 768k L2 is also a godsend, at least for my work. The resident set of data that my app typically accesses randomly ranges from 100k - 1000k. I just wish that I could configure what reads are cached in L1 and which aren’t from C++ instead of an all or nothing change at compile time. Then I could prevent other once-only reads from polluting any of that 48k cache.

eyalhir74 · April 22, 2010, 4:47am

Ooh MisterAnderson, judging from your past assistance, I’m sure you’re going to give us all the red pill and show us how deep

the Fermi-rabbit-hole goes… ;)

eyal

Lev · April 22, 2010, 2:14pm

Texture cache latency (at least at gt200 and should be on Fermi) is about ten times bigger than Fermi L1-L2 cache latency and its size is smaller, so need to use global memory access more.

Topic		Replies	Views
GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2. CUDA Programming and Performance	15	46611	May 6, 2010
GTX 470 slower vs GTX 280 CUDA Programming and Performance	11	7111	May 13, 2010
How to run Fermi and non-Fermi cards in the same system ? In other words, how to migrate to CUDA 3.0 CUDA Programming and Performance	14	2281	June 17, 2010
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14631	January 29, 2013
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22355	May 5, 2010
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9797	December 14, 2011
Need help to choose either the gtx 295 or the gtx 480 for massive Lattice Boltzman simulations CUDA Programming and Performance	10	1304	December 9, 2010
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	8887	June 5, 2010
GTX 480 - performance CUDA Programming and Performance	8	6799	June 9, 2010
Fermi question CUDA Programming and Performance	30	5553	May 26, 2010

GTX 470 performance gains too low ? (texture operations)

Related topics