Perplexing CUDA performance experiment of GTX560(1G) and GTX1050TI(4G)

Donk000 · June 27, 2018, 11:53pm

The setup

GTX 560 1G GDDR: Desktop i3-2100 3.1GHz with 8GB RAM, Windows 7 64bit
GTX 1050 TI 4G GDDR: Notebook i7-7700HQ 2.8GHz with 16GB RAM, Windows 10 64bit
Both machine uses Visual Studio 2015
CUDA 8.0 for GTX 560 machine for CC 2.1 compatibility
CUDA 9.1 for GTX 1050 machine, compiled with CC 6.1 setting

The test:

Two arrays A and B of 3 million double numbers were randomly generated then copied to CUDA arrays
Arbitrary calculations were performed for each pair of elements then stored to array C
Block size of 256 threads were used since it would result in 100% occupancy for both CC 2.1 and 6.1
The calculation loads were split between 256 threads (1 block) on 1st run, then 768 (3 block) on 2nd run, then... until 1 million threads or 4097 blocks
cudaDeviceSynchronize() were used to record computation time for each run

Here is the result:

As expected, as total threads increase at the beginning, the computation time quickly and drastically decreased
1050 stabilizes to its best performance with around 315K threads
560 stabilizes at around 185K threads
Interestingly, as thread counts continue to increase, both cards experienced (very small but increasing) performance lost
Biggest surprise: best stabilized performance for 1050 was around 34.5ms to 35ms; for 560 was around 35ms to 35.5ms

1050 has more than double CUDA cores than that of 560, not to mention other spec. With these setups, I expected 1050 is outperform the latter with obvious margin. But this is definitely not what the result shows. I would appreciate any explanation for the result of this experiment.

Robert_Crovella · June 28, 2018, 3:34am

Code is probably bandwidth bound. Therefore cores and most other spec differences between the two GPUs don’t matter, only memory bandwidth difference. Run bandwidthTest on each to get a relative perf estimate
You might be building debug projects. That would be a mistake. Never benchmark debug code.
You might be making an error in timing measurement. Use a profiler to accurately measure kernel runtime.

My best guess is the first bullet above.

GTX 560 memory bandwidth: 128GB/s:

[url]https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-560/specifications[/url]

GTX 1050m memory bandwidth: 112GB/s:

[url]https://www.notebookcheck.net/NVIDIA-GeForce-GTX-1050-Notebook.178614.0.html[/url]

Even though the 1050 uses a much newer/faster memory technology, the bus width (128-bit) is only half the width of the GTX 560 (256-bit)

Donk000 · June 29, 2018, 12:56am

Appreciate the response. To confirm couple things:

The test programs were always executed in Release mode
The timing measurement was done with cudaEventReocrd

With the theory of memory bandwidth being the cause. I made the calculation of (A[i] and B[i]) → C[i] more complicated, under the assumption that in this case memory bandwidth would play less of the role in timing. Strangely enough:

GTX560 actually outperformed GTX1050 with the best stabilized execution time
While running the test program, GTX560 desktop experienced significant lag in all other programs. In fact, Windows 7 prompts me to lower the Windows Color Setting.
On the other hand, GTX1050 notebook was running perfectly smooth while the test program is running.

It almost seems like 1050 is not working at full capacity for some reason.
Any thought?

Donk000 · June 29, 2018, 1:09am

nvidia-smi actually shows 100% utilization of GPU for 1050 during testing.
Too bad GTX560 does not support nvidia-smi for utilization monitoring.

Also the lag with GTX560 desktop could be due to over-heating.

For Bandwidth test:

Device: GeForce GTX 560
Transfer size (MB): 16

Pageable transfers
Host to Device bandwidth (GB/s): 3.282812
Device to Host bandwidth (GB/s): 3.109435

Pinned transfers
Host to Device bandwidth (GB/s): 6.394224
Device to Host bandwidth (GB/s): 6.212090

Device: GeForce GTX 1050 Ti
Transfer size (MB): 16

Pageable transfers
Host to Device bandwidth (GB/s): 4.326380
Device to Host bandwidth (GB/s): 4.687840

Pinned transfers
Host to Device bandwidth (GB/s): 6.395004
Device to Host bandwidth (GB/s): 6.468220

Robert_Crovella · June 29, 2018, 3:31am

I wouldn’t expect a simple/casual modification to the calculations to convert the memory bound code to a compute bound code unless you have some expertise with this. Run a comparison with sgemm from cublas if you wish to do that.

The bandwidthTest number we are interested in here is the device-to-device bandwidth, which you haven’t shown. We want a measure of the memory bandwidth. What you have shown is PCIE bandwidth. Also the default bandwidthTest transfer size is 32MB not 16MB, and the output looks different than what you are showing here, so I’m not really sure what you are doing.

Donk000 · June 30, 2018, 3:37am

The bandwidth test program was taken from https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/
I modified it so it tests device to device speed:

Device: GeForce GTX 560
Transfer size (MB): 32
Device to Device (GB/s): 53.764859
Device to Device (GB/s): 53.972411

Device: GeForce GTX 1050 Ti
Transfer size (MB): 32
Device to Device (GB/s): 35.045989
Device to Device (GB/s): 35.083511

You were definitely right about memory speed.
So if this is the case, does it mean the only real advantage GTX 1050 Ti has over 560 is computation heavy tasks? Or certain work that requires more than 1GB of RAM?

cbuchner1 · June 30, 2018, 9:32am

Pascal has

-faster atomic operations
-warp shuffle
-more advanced and bigger L1/L2 caches
-a bigger register file
-better performance per Watt

overall the GTX 560 Fermi card is based on an obsolete architecture (maybe it is already deprecated in the latest CUDA 9 toolkits)

Robert_Crovella · June 30, 2018, 2:22pm

I was referring to the CUDA sample codes, which get installed with the CUDA toolkit. The sample codes include one called bandwidthTest which will test device-to-device memory bandwidth without any modification.

Donk000 · July 2, 2018, 3:48pm

Here you go:

Device 0: GeForce GTX 560
Quick Mode
…
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes): 33554432
Bandwidth(MB/s): 104957.8

Device 0: GeForce GTX 1050 Ti
Quick Mode
…
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes): 33554432
Bandwidth(MB/s): 68132.0

You are right about deprecated in CUDA 9.
It seems the biggest lesson here is to minimize read/write to RAM while making the best use of shared memory or constant memory.
Obvious I am pretty new with CUDA programming. This has been quite helpful.

spudz76 · July 31, 2018, 2:37pm

Also your 10xx will be locked into P2 mode not P0 during compute tasks
check nvidia-smi for proof
then go get nvidiaprofileinspector (Release v3.5.0.0 · DeadManWalkingTO/NVidiaProfileInspectorDmW · GitHub) and disable the “Force P2” garbage and reboot
then enjoy all sorts of performance because it will run P0 (P2 is usually underclocked memory / for “super accuracy” not speed)

Earlier cards did not have P2 locking silliness.

That should get your memory bandwidth back up at least but they may still just about tie.

Topic		Replies	Views
Driver for GTX 1080 Ti CUDA Programming and Performance	21	19648	June 22, 2017
GTX750Ti and buffers > 1GB on Win7 CUDA Programming and Performance	91	19882	July 21, 2016
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26920	December 23, 2014
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9080	October 26, 2010
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1199	April 17, 2023
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11161	May 23, 2010
CUDA test performance issue CUDA Programming and Performance	7	1446	November 24, 2014
"RAMGATE" and Nai's Benchmark is somewhat suspicious. CUDA Programming and Performance	84	22650	March 12, 2015
Fermi? Sounds interesting... CUDA Programming and Performance	58	15508	October 18, 2009
Memory bandwidth CUDA Programming and Performance	31	38414	October 5, 2007

Perplexing CUDA performance experiment of GTX560(1G) and GTX1050TI(4G)

Related topics