GTX 470 slower vs GTX 280

I have been using GTX 280 for sometime now and I recently got GTX 470… One of the kernel in my application uses texture fetching (tex1Dfetch and hence no interpolation and filtering). The same application when compiled without -arch=sm20 (no arch flag) on GTX 280 runs slightly faster than application compiled with -arch=sm20 on GTX 470. I tried changing the texture fetching code to regular global memory operation with L1 cache configured to 48Kb and compile option -Xptxas -dlcm=ca. But the performance degrades further. My belief was that with L1 and L2 cache kicking in, performance should have been better than texture cache. I also tried using contant memory but additional loss of performance was obtained…

I am also using -use_fast_math -ftz=true -prec-sqrt=false -prec-div=false

Could anyone please tell me why texture is working better than global memory read operations. Any help is greatly appreciated…(Sorry this is my first post, so I might be missing a lot of information)

Probably texture usage is not a bottle neck here. Check occupancy of gtx470 first. What is your shared and register usage and block size? gtx470 needs it own tweaking.

Thanks for the reply…
Block size is (15x7)…I do know this selection is not very good, but my application needs it this way…
Shared memory: sizeof(float)157
Register usage is : 10 registers per thread…

Also if I comment the statement where texture fetch is done (or fetch from global memory / constant memory)…speed of the program improves drastically…(nearly 2x)
Do let me know if you need more information…
Thanks again…

i have similar problem, but with out texture fetch, i use global memory read and gtx260 (192 cores) is faster than gtx470, it looks like gt200 is better with lots of very scattered reads than fermi.

Dont know if it is reliable, but checking in gpu-z gtx470 had 80% memory controler load and gtx260 being faster had 50% memory controler load :| after removing code which read this memory gtx470 was 80% faster than 260.

GF100 uses more precise floating point arithmetic by default. See this thread.

What about memory performance, am i doing something wrong, or it is possible that gtx260 work faster with scattered memory reads?

Block size is (15x7)…

It is not best size. You simply do not get good occupancy on Fermi. It has maximum block number 8, the same as gt200. So you have the same number of threads on Fermi and gt200. You need to set bigger blocks for Fermi, for example you can merge computations of two blocks in one. It is simple.

"i have similar problem, but with out texture fetch, i use global memory read and gtx260 (192 cores) is faster than gtx470, it looks like gt200 is better with lots of very scattered reads than fermi.

Dont know if it is reliable, but checking in gpu-z gtx470 had 80% memory controler load and gtx260 being faster had 50% memory controler load :| after removing code which read this memory gtx470 was 80% faster than 260. "

Most likely you get the same problem with low occupancy on Fermi. Something like it. Check occupancy calculator. Remember, Fermi has the same maximum block number as gt200.

I use 128 block size and in visual profiler it shows > 0.5 occupancy, so it is not that low. Wouldn’t kernel with low occupancy have lower memory controler load in gpu-z?

Yes that way something another is wrong. My suggestion was wrong. Btw, what is performance hit if you remove memory load on gtx260? And how faster gtx260 is? A few percents or twice?

if I remember it right, removeing memory load improves performance on gtx260 17fps → 44fps and on gtx470 15fps → 77fps.

I can rewrite this code to recalculate values instead load and it will probably resolve this, but it is strange that gtx260 works in this case better than gtx470, maybe more complex memory system result in longer reads on cache miss than in gtx260?

Ok… I did change the block size to 256…(I know jump from 7x15 to 256x1 is weird, but I had to restructure my computation inside the thread)… Anyways I did not compare the results with GTX 280, but I did get an increase in performance…Again I tried both texture fetching and global memory load fetching (Global memory is scattered)…and texture fetching still outperforms global memory loads…So for now, I am going to stick with texture fetches, but did I hope that L1 and L2 cache would increase the speed…I will try to post the performance difference between GTX 280 and GTX 470…
Thanks for your help