Perplexing CUDA performance experiment of GTX560(1G) and GTX1050TI(4G)

The setup

  • GTX 560 1G GDDR: Desktop i3-2100 3.1GHz with 8GB RAM, Windows 7 64bit
  • GTX 1050 TI 4G GDDR: Notebook i7-7700HQ 2.8GHz with 16GB RAM, Windows 10 64bit
  • Both machine uses Visual Studio 2015
  • CUDA 8.0 for GTX 560 machine for CC 2.1 compatibility
  • CUDA 9.1 for GTX 1050 machine, compiled with CC 6.1 setting

The test:

  • Two arrays A and B of 3 million double numbers were randomly generated then copied to CUDA arrays
  • Arbitrary calculations were performed for each pair of elements then stored to array C
  • Block size of 256 threads were used since it would result in 100% occupancy for both CC 2.1 and 6.1
  • The calculation loads were split between 256 threads (1 block) on 1st run, then 768 (3 block) on 2nd run, then... until 1 million threads or 4097 blocks
  • cudaDeviceSynchronize() were used to record computation time for each run

Here is the result:

  • As expected, as total threads increase at the beginning, the computation time quickly and drastically decreased
  • 1050 stabilizes to its best performance with around 315K threads
  • 560 stabilizes at around 185K threads
  • Interestingly, as thread counts continue to increase, both cards experienced (very small but increasing) performance lost
  • Biggest surprise: best stabilized performance for 1050 was around 34.5ms to 35ms; for 560 was around 35ms to 35.5ms

1050 has more than double CUDA cores than that of 560, not to mention other spec. With these setups, I expected 1050 is outperform the latter with obvious margin. But this is definitely not what the result shows. I would appreciate any explanation for the result of this experiment.

  • Code is probably bandwidth bound. Therefore cores and most other spec differences between the two GPUs don’t matter, only memory bandwidth difference. Run bandwidthTest on each to get a relative perf estimate
  • You might be building debug projects. That would be a mistake. Never benchmark debug code.
  • You might be making an error in timing measurement. Use a profiler to accurately measure kernel runtime.

My best guess is the first bullet above.

GTX 560 memory bandwidth: 128GB/s:

[url]https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-560/specifications[/url]

GTX 1050m memory bandwidth: 112GB/s:

[url]https://www.notebookcheck.net/NVIDIA-GeForce-GTX-1050-Notebook.178614.0.html[/url]

Even though the 1050 uses a much newer/faster memory technology, the bus width (128-bit) is only half the width of the GTX 560 (256-bit)

Appreciate the response. To confirm couple things:

  • The test programs were always executed in Release mode
  • The timing measurement was done with cudaEventReocrd

With the theory of memory bandwidth being the cause. I made the calculation of (A[i] and B[i]) → C[i] more complicated, under the assumption that in this case memory bandwidth would play less of the role in timing. Strangely enough:

  • GTX560 actually outperformed GTX1050 with the best stabilized execution time
  • While running the test program, GTX560 desktop experienced significant lag in all other programs. In fact, Windows 7 prompts me to lower the Windows Color Setting.
  • On the other hand, GTX1050 notebook was running perfectly smooth while the test program is running.

It almost seems like 1050 is not working at full capacity for some reason.
Any thought?

nvidia-smi actually shows 100% utilization of GPU for 1050 during testing.
Too bad GTX560 does not support nvidia-smi for utilization monitoring.

Also the lag with GTX560 desktop could be due to over-heating.

For Bandwidth test:

Device: GeForce GTX 560
Transfer size (MB): 16

Pageable transfers
Host to Device bandwidth (GB/s): 3.282812
Device to Host bandwidth (GB/s): 3.109435

Pinned transfers
Host to Device bandwidth (GB/s): 6.394224
Device to Host bandwidth (GB/s): 6.212090

Device: GeForce GTX 1050 Ti
Transfer size (MB): 16

Pageable transfers
Host to Device bandwidth (GB/s): 4.326380
Device to Host bandwidth (GB/s): 4.687840

Pinned transfers
Host to Device bandwidth (GB/s): 6.395004
Device to Host bandwidth (GB/s): 6.468220

I wouldn’t expect a simple/casual modification to the calculations to convert the memory bound code to a compute bound code unless you have some expertise with this. Run a comparison with sgemm from cublas if you wish to do that.

The bandwidthTest number we are interested in here is the device-to-device bandwidth, which you haven’t shown. We want a measure of the memory bandwidth. What you have shown is PCIE bandwidth. Also the default bandwidthTest transfer size is 32MB not 16MB, and the output looks different than what you are showing here, so I’m not really sure what you are doing.

The bandwidth test program was taken from https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
I modified it so it tests device to device speed:

Device: GeForce GTX 560
Transfer size (MB): 32
Device to Device (GB/s): 53.764859
Device to Device (GB/s): 53.972411

Device: GeForce GTX 1050 Ti
Transfer size (MB): 32
Device to Device (GB/s): 35.045989
Device to Device (GB/s): 35.083511

You were definitely right about memory speed.
So if this is the case, does it mean the only real advantage GTX 1050 Ti has over 560 is computation heavy tasks? Or certain work that requires more than 1GB of RAM?

Pascal has

-faster atomic operations
-warp shuffle
-more advanced and bigger L1/L2 caches
-a bigger register file
-better performance per Watt

overall the GTX 560 Fermi card is based on an obsolete architecture (maybe it is already deprecated in the latest CUDA 9 toolkits)

I was referring to the CUDA sample codes, which get installed with the CUDA toolkit. The sample codes include one called bandwidthTest which will test device-to-device memory bandwidth without any modification.

Here you go:

Device 0: GeForce GTX 560
Quick Mode

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes): 33554432
Bandwidth(MB/s): 104957.8

Device 0: GeForce GTX 1050 Ti
Quick Mode

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes): 33554432
Bandwidth(MB/s): 68132.0

You are right about deprecated in CUDA 9.
It seems the biggest lesson here is to minimize read/write to RAM while making the best use of shared memory or constant memory.
Obvious I am pretty new with CUDA programming. This has been quite helpful.

Also your 10xx will be locked into P2 mode not P0 during compute tasks
check nvidia-smi for proof
then go get nvidiaprofileinspector (Release v3.5.0.0 · DeadManWalkingTO/NVidiaProfileInspectorDmW · GitHub) and disable the “Force P2” garbage and reboot
then enjoy all sorts of performance because it will run P0 (P2 is usually underclocked memory / for “super accuracy” not speed)

Earlier cards did not have P2 locking silliness.

That should get your memory bandwidth back up at least but they may still just about tie.