GTX 560 1G GDDR: Desktop i3-2100 3.1GHz with 8GB RAM, Windows 7 64bit
GTX 1050 TI 4G GDDR: Notebook i7-7700HQ 2.8GHz with 16GB RAM, Windows 10 64bit
Both machine uses Visual Studio 2015
CUDA 8.0 for GTX 560 machine for CC 2.1 compatibility
CUDA 9.1 for GTX 1050 machine, compiled with CC 6.1 setting
The test:
Two arrays A and B of 3 million double numbers were randomly generated then copied to CUDA arrays
Arbitrary calculations were performed for each pair of elements then stored to array C
Block size of 256 threads were used since it would result in 100% occupancy for both CC 2.1 and 6.1
The calculation loads were split between 256 threads (1 block) on 1st run, then 768 (3 block) on 2nd run, then... until 1 million threads or 4097 blocks
cudaDeviceSynchronize() were used to record computation time for each run
Here is the result:
As expected, as total threads increase at the beginning, the computation time quickly and drastically decreased
1050 stabilizes to its best performance with around 315K threads
560 stabilizes at around 185K threads
Interestingly, as thread counts continue to increase, both cards experienced (very small but increasing) performance lost
Biggest surprise: best stabilized performance for 1050 was around 34.5ms to 35ms; for 560 was around 35ms to 35.5ms
1050 has more than double CUDA cores than that of 560, not to mention other spec. With these setups, I expected 1050 is outperform the latter with obvious margin. But this is definitely not what the result shows. I would appreciate any explanation for the result of this experiment.
Code is probably bandwidth bound. Therefore cores and most other spec differences between the two GPUs don’t matter, only memory bandwidth difference. Run bandwidthTest on each to get a relative perf estimate
You might be building debug projects. That would be a mistake. Never benchmark debug code.
You might be making an error in timing measurement. Use a profiler to accurately measure kernel runtime.
Appreciate the response. To confirm couple things:
The test programs were always executed in Release mode
The timing measurement was done with cudaEventReocrd
With the theory of memory bandwidth being the cause. I made the calculation of (A[i] and B[i]) → C[i] more complicated, under the assumption that in this case memory bandwidth would play less of the role in timing. Strangely enough:
GTX560 actually outperformed GTX1050 with the best stabilized execution time
While running the test program, GTX560 desktop experienced significant lag in all other programs. In fact, Windows 7 prompts me to lower the Windows Color Setting.
On the other hand, GTX1050 notebook was running perfectly smooth while the test program is running.
It almost seems like 1050 is not working at full capacity for some reason.
Any thought?
I wouldn’t expect a simple/casual modification to the calculations to convert the memory bound code to a compute bound code unless you have some expertise with this. Run a comparison with sgemm from cublas if you wish to do that.
The bandwidthTest number we are interested in here is the device-to-device bandwidth, which you haven’t shown. We want a measure of the memory bandwidth. What you have shown is PCIE bandwidth. Also the default bandwidthTest transfer size is 32MB not 16MB, and the output looks different than what you are showing here, so I’m not really sure what you are doing.
Device: GeForce GTX 560
Transfer size (MB): 32
Device to Device (GB/s): 53.764859
Device to Device (GB/s): 53.972411
Device: GeForce GTX 1050 Ti
Transfer size (MB): 32
Device to Device (GB/s): 35.045989
Device to Device (GB/s): 35.083511
You were definitely right about memory speed.
So if this is the case, does it mean the only real advantage GTX 1050 Ti has over 560 is computation heavy tasks? Or certain work that requires more than 1GB of RAM?
I was referring to the CUDA sample codes, which get installed with the CUDA toolkit. The sample codes include one called bandwidthTest which will test device-to-device memory bandwidth without any modification.
Device 0: GeForce GTX 1050 Ti
Quick Mode
…
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes): 33554432
Bandwidth(MB/s): 68132.0
You are right about deprecated in CUDA 9.
It seems the biggest lesson here is to minimize read/write to RAM while making the best use of shared memory or constant memory.
Obvious I am pretty new with CUDA programming. This has been quite helpful.
Also your 10xx will be locked into P2 mode not P0 during compute tasks
check nvidia-smi for proof
then go get nvidiaprofileinspector (Release v3.5.0.0 · DeadManWalkingTO/NVidiaProfileInspectorDmW · GitHub) and disable the “Force P2” garbage and reboot
then enjoy all sorts of performance because it will run P0 (P2 is usually underclocked memory / for “super accuracy” not speed)
Earlier cards did not have P2 locking silliness.
That should get your memory bandwidth back up at least but they may still just about tie.