Had the opportunity through Nvidia and their partner AMAX to test out the Tesla K40x for a day.
Other than the obvious 2x increase in host-device bandwidth (PCI-E 3.0 ) and the increase from 5GB to 12 GB, here are some tests I ran on my home machine vs. the AMAX K40x setup:
Comparison of Nvidia Tesla K20c vs. Nvidia Tesla K40x (calculation speed)
Windows 7 64 bit with TCC driver for Tesla, Intel i-3770k 3.9 Ghz
NOTE: All times include device memory allocations, host-device copies, device-device copies, device-host copies and de-allocations
K20c K40x speedup
sorting (thrust 32 bit floats)
134,217,728 elements array 32 bit floats 0.51 sec 0.341 sec 33%
268,435,456 elements array 32 bit floats 1.01 sec 0.684 sec 32%
sorting (thrust 64 bit doubles)
16,777,216 elements array 64 bit doubles 0.14 sec 0.102 sec 27%
33,554,432 elements array 64 bit doubles 0.27 sec 0.2 sec 26%
67,108,864 elements array 64 bit doubles 0.552 sec 0.397 sec 28%
brute force all permutations in local memory, then evaluation, optimization,scan, and reduction of best answer and respective permutation
13! fact in memory, evaluate, scan, reduce 23.6 sec 20.0 sec 18%
14! fact in memory, evaluate, scan, reduce 415.5 sec 327.48 sec 21%
15! fact in memory, evaluate, scan, reduce 6374 sec 5546.4 sec 13%
32 bit All-Pairs-Shortest-Path(with full path reconstruction using two adjacency matrices)
apx 25% non-zeros in full dense matrix form (32 bit integer)
Floyd Warshall dense graph 4000 vertices 5.09 sec 4.17 sec 18%
Floyd Warshall dense graph 8000 vertices 39.1 sec 32.09 sec 18%
Floyd Warshall dense graph 10000 vertices 77.8 sec 64.38 sec 17.2%
Floyd Warshall dense graph 11111 vertices 108.01 sec 88.55 sec 18%
Floyd Warshall dense graph 15000 vertices NA 211.4 sec NA
Floyd Warshall dense graph 20000 vertices NA 490.4 sec NA
note: NA because not enough CPU(host) memory to test on that PC
32 bit multi-step dynamic programming problem (hybrid cpu/gpu solution)
Dynamic Problem MEGA (50 x 20) 93 ms 83 ms 9%
Dynamic Problem MEGA (50 x 22) 405 ms 365 ms 9.8%
Dynamic Problem MEGA (50 x 23) 849 ms 765 ms 9.8%
64 bit dynamic programming problem (hybrid cpu/gpu solution)
Cities=100, K=60, Fans=100 101 ms 85 ms 15.8%
Cities=120, K=80, Fans=120 200 ms 188 ms 6%
ADMM Group Lasso (32 bit floats) convex optimization
m=1536, n=4096, K(num_blocks)=20 62 ms 49 ms 21%
m=8192,n=1024,K(num_blocks)=8 40 ms 33 ms 17%
m=16384, n=1024,K(num_blocks)=8 61 ms 51 ms 16%
m=512, n=8192,K(num_blocks)=60 31 ms 25 ms 19%
m=512,n=16384,K(num_blocks)=120 64 ms 58 ms 9%
m=768,n=32768,K(num_blocks)=250 147 ms 111 ms 24%
Please excuse the above formatting blunders.
In general the Tesla K40x (using Windows 7 TCC driver) was about 20-30% faster than the K20c. The greatest degree of difference was with sorting 32-bit floats. cuBLAS was about 25% faster on average, which can be seen in the group lasso times.
When it came down the the dynamic programming problems, there was less of a speedup mainly because the host cpu clock speed of my home PC is faster than the remote machine, and since up to 33% of that type of work is done on the host, the overall speedup was not as great.
My applications do not have frequent host-device data transfers, so if one had such an application the performance of the K40x would be even better. Also that 12GB of memory is useful for large combinatorial/brute force problems.
Much of the code uses newer features like __shfl() and I believe those times are rather good, especially for Windows.
Will be posting a more detailed blog later this week with source code.