gtx 465 performance

Hi everybody,

I ran into a strange issue, I developed a cuda app on a gtx 260, I bought a new gtx 465 a few days ago and the performance was the same. I checked the samples from sdk and things like scan eigenvalues trigger even wouse results.

gtx 260:

SCAN
scan-Large, Throughput = 205.5178 MElements/s, Time = 0.00128 s, Size = 262144 E
lements, NumDevsUsed = 1, Workgroup = 256

eigenvalues
Iterations to be timed: 100
Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 15.531688 ms
Average time step 2, one intervals: 4.953684 ms
Average time step 2, mult intervals: 0.017448 ms
Average time TOTAL: 20.831189 ms

gtx 465

SCAN
scan-Large, Throughput = 203.7271 MElements/s, Time = 0.00129 s, Size = 262144 E
lements, NumDevsUsed = 1, Workgroup = 256

Result filename: ‘eigenvalues.dat’
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 37.504597 ms
Average time step 2, one intervals: 12.808976 ms
Average time step 2, mult intervals: 0.019804 ms
Average time TOTAL: 50.474590 ms

I use the latest version of the drivers, the only thing difference is the OS, the gtx 465 is running on Windows Server 2003 64bit.
Can some help me in this matter? Maybe the fact the 465 has less SM?
In the device query from SDK has only 11 SMs compared with the gtx 260 with 27 SMs.

Best regards,

not sure I have an answer for you but…

first thing is are both cards in the same machine, and if so are you sure which card you are getting results from.

Next thing is I have 285’s 470’s and 1060’s and I don’t always see a significant difference between the cards, it depends very much on what CUDA code I run. I’m a novice here so probably someone will tell me why I’m wrong but my experience has been that running lots of blocks with fewer threads rather than matching the number of blocks to the number of SM’s results in massive speed increases for the fermi card over the others double the speed in some cases. I don’t know why this is or pretend to understand but thats the result I get. Play around with your own code and see what happens.

not sure I have an answer for you but…

first thing is are both cards in the same machine, and if so are you sure which card you are getting results from.

Next thing is I have 285’s 470’s and 1060’s and I don’t always see a significant difference between the cards, it depends very much on what CUDA code I run. I’m a novice here so probably someone will tell me why I’m wrong but my experience has been that running lots of blocks with fewer threads rather than matching the number of blocks to the number of SM’s results in massive speed increases for the fermi card over the others double the speed in some cases. I don’t know why this is or pretend to understand but thats the result I get. Play around with your own code and see what happens.

As far as I know, having more threads on a block, the chip, is able to hide the latency of memory. The 465 chip has more shared memory and it should have a better occupancy of the chip, at lest in my application.

Can you please run those 2 samples that I mention on 470 and 285?

I have 2 machines so there is no way to mess the numbers.

As far as I know, having more threads on a block, the chip, is able to hide the latency of memory. The 465 chip has more shared memory and it should have a better occupancy of the chip, at lest in my application.

Can you please run those 2 samples that I mention on 470 and 285?

I have 2 machines so there is no way to mess the numbers.

I don’t think that the SDK examples are reoptimized for Fermi yet.

I don’t think that the SDK examples are reoptimized for Fermi yet.

the number of cuda cores are almost double, maybe is something that I missed and should double check, but I’ll expect at something more.

the number of cuda cores are almost double, maybe is something that I missed and should double check, but I’ll expect at something more.

Yes I know, but for example the shared memory has 32 banks in Fermi, 16 banks in the old cards. If the code is not changed for this, you will get a lot of bank conflicts that hurts performance.

Yes I know, but for example the shared memory has 32 banks in Fermi, 16 banks in the old cards. If the code is not changed for this, you will get a lot of bank conflicts that hurts performance.

In all likelihood, the scan example is memory bandwidth limited. The GTX465 and GTX260 have essentially the same memory bandwidth (both should be slightly over 100Gb/s), so it would follow that the performance in memory bandwidth limited code should be pretty much the same.

I see similar results for the eigenvalue example comparing a GTX275 and GTX470, so there is probably something specific about that example that needs tuning. On my own codes I see a roughly 2x speed up with the GTX470 compared to the GTX275.

In all likelihood, the scan example is memory bandwidth limited. The GTX465 and GTX260 have essentially the same memory bandwidth (both should be slightly over 100Gb/s), so it would follow that the performance in memory bandwidth limited code should be pretty much the same.

I see similar results for the eigenvalue example comparing a GTX275 and GTX470, so there is probably something specific about that example that needs tuning. On my own codes I see a roughly 2x speed up with the GTX470 compared to the GTX275.

have a look at this(bandwidthTest.exe part of the SDK):

Running on…

Device 0: GeForce GTX 465

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1797.6

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1727.6

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 73305.7

[bandwidthTest] - Test results:

PASSED

and

Running on…

Device 0: GeForce GTX 260

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 551.0

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 752.5

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 113172.1

[bandwidthTest] - Test results:

PASSED

For 465 device to device is having less performance, if it’s DDR5 how this is possible? I’ll check that sample more closely. Maybe that the root of my problem …

have a look at this(bandwidthTest.exe part of the SDK):

Running on…

Device 0: GeForce GTX 465

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1797.6

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1727.6

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 73305.7

[bandwidthTest] - Test results:

PASSED

and

Running on…

Device 0: GeForce GTX 260

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 551.0

Device to Host Bandwidth, 1 Device(s), Paged memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 752.5

Device to Device Bandwidth, 1 Device(s)

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 113172.1

[bandwidthTest] - Test results:

PASSED

For 465 device to device is having less performance, if it’s DDR5 how this is possible? I’ll check that sample more closely. Maybe that the root of my problem …

Those device to host and host to device numbers look really low, even for paged memory. What computer/OS/NVIDIA driver are you using?

Also, DDR5 vs. DDR3 is only one factor in the memory bandwidth. The GTX 260 has a 448 bit wide memory bus, whereas the GTX 465 has a 256 bit wide memory bus.

Those device to host and host to device numbers look really low, even for paged memory. What computer/OS/NVIDIA driver are you using?

Also, DDR5 vs. DDR3 is only one factor in the memory bandwidth. The GTX 260 has a 448 bit wide memory bus, whereas the GTX 465 has a 256 bit wide memory bus.

The OS is a windows 2003 64 bit and I installed the latest drivers from Nvidia site.

The OS is a windows 2003 64 bit and I installed the latest drivers from Nvidia site.

I’m also testing my code on different devices and I found out that compiling my kernel with sm_12 on my GTX 465 leads to code more than 2.2 times faster than when I compile it with sm_20.