gtx 465 performance

stefan.tabaranu · August 12, 2010, 7:18pm

Hi everybody,

I ran into a strange issue, I developed a cuda app on a gtx 260, I bought a new gtx 465 a few days ago and the performance was the same. I checked the samples from sdk and things like scan eigenvalues trigger even wouse results.

gtx 260:

SCAN
scan-Large, Throughput = 205.5178 MElements/s, Time = 0.00128 s, Size = 262144 E
lements, NumDevsUsed = 1, Workgroup = 256

eigenvalues
Iterations to be timed: 100
Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 15.531688 ms
Average time step 2, one intervals: 4.953684 ms
Average time step 2, mult intervals: 0.017448 ms
Average time TOTAL: 20.831189 ms

gtx 465

SCAN
scan-Large, Throughput = 203.7271 MElements/s, Time = 0.00129 s, Size = 262144 E
lements, NumDevsUsed = 1, Workgroup = 256

Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 37.504597 ms
Average time step 2, one intervals: 12.808976 ms
Average time step 2, mult intervals: 0.019804 ms
Average time TOTAL: 50.474590 ms

I use the latest version of the drivers, the only thing difference is the OS, the gtx 465 is running on Windows Server 2003 64bit.
Can some help me in this matter? Maybe the fact the 465 has less SM?
In the device query from SDK has only 11 SMs compared with the gtx 260 with 27 SMs.

Best regards,

anthonyfmorse · August 14, 2010, 7:33pm

not sure I have an answer for you but…

first thing is are both cards in the same machine, and if so are you sure which card you are getting results from.

Next thing is I have 285’s 470’s and 1060’s and I don’t always see a significant difference between the cards, it depends very much on what CUDA code I run. I’m a novice here so probably someone will tell me why I’m wrong but my experience has been that running lots of blocks with fewer threads rather than matching the number of blocks to the number of SM’s results in massive speed increases for the fermi card over the others double the speed in some cases. I don’t know why this is or pretend to understand but thats the result I get. Play around with your own code and see what happens.

anthonyfmorse · August 14, 2010, 7:33pm

not sure I have an answer for you but…

first thing is are both cards in the same machine, and if so are you sure which card you are getting results from.

Next thing is I have 285’s 470’s and 1060’s and I don’t always see a significant difference between the cards, it depends very much on what CUDA code I run. I’m a novice here so probably someone will tell me why I’m wrong but my experience has been that running lots of blocks with fewer threads rather than matching the number of blocks to the number of SM’s results in massive speed increases for the fermi card over the others double the speed in some cases. I don’t know why this is or pretend to understand but thats the result I get. Play around with your own code and see what happens.

stefan.tabaranu · August 15, 2010, 3:55pm

As far as I know, having more threads on a block, the chip, is able to hide the latency of memory. The 465 chip has more shared memory and it should have a better occupancy of the chip, at lest in my application.

Can you please run those 2 samples that I mention on 470 and 285?

I have 2 machines so there is no way to mess the numbers.

stefan.tabaranu · August 15, 2010, 3:55pm

As far as I know, having more threads on a block, the chip, is able to hide the latency of memory. The 465 chip has more shared memory and it should have a better occupancy of the chip, at lest in my application.

Can you please run those 2 samples that I mention on 470 and 285?

I have 2 machines so there is no way to mess the numbers.

wanderine · August 15, 2010, 4:31pm

I don’t think that the SDK examples are reoptimized for Fermi yet.

wanderine · August 15, 2010, 4:31pm

I don’t think that the SDK examples are reoptimized for Fermi yet.

stefan.tabaranu · August 16, 2010, 8:07am

the number of cuda cores are almost double, maybe is something that I missed and should double check, but I’ll expect at something more.

stefan.tabaranu · August 16, 2010, 8:07am

the number of cuda cores are almost double, maybe is something that I missed and should double check, but I’ll expect at something more.

wanderine · August 16, 2010, 8:47am

Yes I know, but for example the shared memory has 32 banks in Fermi, 16 banks in the old cards. If the code is not changed for this, you will get a lot of bank conflicts that hurts performance.

wanderine · August 16, 2010, 8:47am

Yes I know, but for example the shared memory has 32 banks in Fermi, 16 banks in the old cards. If the code is not changed for this, you will get a lot of bank conflicts that hurts performance.

avidday · August 16, 2010, 9:02am

In all likelihood, the scan example is memory bandwidth limited. The GTX465 and GTX260 have essentially the same memory bandwidth (both should be slightly over 100Gb/s), so it would follow that the performance in memory bandwidth limited code should be pretty much the same.

I see similar results for the eigenvalue example comparing a GTX275 and GTX470, so there is probably something specific about that example that needs tuning. On my own codes I see a roughly 2x speed up with the GTX470 compared to the GTX275.

avidday · August 16, 2010, 9:02am

In all likelihood, the scan example is memory bandwidth limited. The GTX465 and GTX260 have essentially the same memory bandwidth (both should be slightly over 100Gb/s), so it would follow that the performance in memory bandwidth limited code should be pretty much the same.

I see similar results for the eigenvalue example comparing a GTX275 and GTX470, so there is probably something specific about that example that needs tuning. On my own codes I see a roughly 2x speed up with the GTX470 compared to the GTX275.

stefan.tabaranu · August 16, 2010, 9:24am

have a look at this(bandwidthTest.exe part of the SDK):

Running on…

Device 0: GeForce GTX 465

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1797.6

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1727.6

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 73305.7

[bandwidthTest] - Test results:

PASSED

and

Running on…

Device 0: GeForce GTX 260

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 551.0

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 752.5

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 113172.1

[bandwidthTest] - Test results:

PASSED

For 465 device to device is having less performance, if it’s DDR5 how this is possible? I’ll check that sample more closely. Maybe that the root of my problem …

stefan.tabaranu · August 16, 2010, 9:24am

have a look at this(bandwidthTest.exe part of the SDK):

Running on…

Device 0: GeForce GTX 465

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1797.6

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1727.6

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 73305.7

[bandwidthTest] - Test results:

PASSED

and

Running on…

Device 0: GeForce GTX 260

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 551.0

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 752.5

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 113172.1

[bandwidthTest] - Test results:

PASSED

For 465 device to device is having less performance, if it’s DDR5 how this is possible? I’ll check that sample more closely. Maybe that the root of my problem …

seibert · August 17, 2010, 4:36pm

Those device to host and host to device numbers look really low, even for paged memory. What computer/OS/NVIDIA driver are you using?

Also, DDR5 vs. DDR3 is only one factor in the memory bandwidth. The GTX 260 has a 448 bit wide memory bus, whereas the GTX 465 has a 256 bit wide memory bus.

seibert · August 17, 2010, 4:36pm

Those device to host and host to device numbers look really low, even for paged memory. What computer/OS/NVIDIA driver are you using?

Also, DDR5 vs. DDR3 is only one factor in the memory bandwidth. The GTX 260 has a 448 bit wide memory bus, whereas the GTX 465 has a 256 bit wide memory bus.

stefan.tabaranu · August 18, 2010, 7:27am

The OS is a windows 2003 64 bit and I installed the latest drivers from Nvidia site.

stefan.tabaranu · August 18, 2010, 7:27am

The OS is a windows 2003 64 bit and I installed the latest drivers from Nvidia site.

Magorath · August 18, 2010, 9:44am

I’m also testing my code on different devices and I found out that compiling my kernel with sm_12 on my GTX 465 leads to code more than 2.2 times faster than when I compile it with sm_20.

Topic		Replies	Views
GTX 460 CUDA Programming and Performance	58	60564	August 5, 2010
GTX460 number of multiprocessors CUDA Programming and Performance	16	10290	September 22, 2010
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22900	May 5, 2010
Disappointed performance using C2050 CUDA Programming and Performance	20	8005	September 2, 2010
GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2. CUDA Programming and Performance	15	46756	May 6, 2010
Comparing C1060, GTX470, GTX480 and C2050 Benchmark results of the Fermi Cards and Tesla generation CUDA Programming and Performance	9	26008	November 4, 2010
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10994	August 15, 2010
best CUDA-enabled card for $100 (or so) CUDA Programming and Performance	17	3302	March 20, 2011
GTX480 performance on different motherboards performance differs on AMD and INTEL motherboards CUDA Programming and Performance	15	18526	June 7, 2010
Code 4 times slower with "arch=sm_20" CUDA Programming and Performance	39	56135	June 15, 2010

gtx 465 performance

Related topics