Had the opportunity through Nvidia and their partner AMAX to test out the Tesla K40x for a day.
Other than the obvious 2x increase in host-device bandwidth (PCI-E 3.0 ) and the increase from 5GB to 12 GB, here are some tests I ran on my home machine vs. the AMAX K40x setup:
Comparison of Nvidia Tesla K20c vs. Nvidia Tesla K40x (calculation speed) Windows 7 64 bit with TCC driver for Tesla, Intel i-3770k 3.9 Ghz NOTE: All times include device memory allocations, host-device copies, device-device copies, device-host copies and de-allocations K20c K40x speedup sorting (thrust 32 bit floats) 134,217,728 elements array 32 bit floats 0.51 sec 0.341 sec 33% 268,435,456 elements array 32 bit floats 1.01 sec 0.684 sec 32% sorting (thrust 64 bit doubles) 16,777,216 elements array 64 bit doubles 0.14 sec 0.102 sec 27% 33,554,432 elements array 64 bit doubles 0.27 sec 0.2 sec 26% 67,108,864 elements array 64 bit doubles 0.552 sec 0.397 sec 28% brute force all permutations in local memory, then evaluation, optimization,scan, and reduction of best answer and respective permutation 13! fact in memory, evaluate, scan, reduce 23.6 sec 20.0 sec 18% 14! fact in memory, evaluate, scan, reduce 415.5 sec 327.48 sec 21% 15! fact in memory, evaluate, scan, reduce 6374 sec 5546.4 sec 13% 32 bit All-Pairs-Shortest-Path(with full path reconstruction using two adjacency matrices) apx 25% non-zeros in full dense matrix form (32 bit integer) Floyd Warshall dense graph 4000 vertices 5.09 sec 4.17 sec 18% Floyd Warshall dense graph 8000 vertices 39.1 sec 32.09 sec 18% Floyd Warshall dense graph 10000 vertices 77.8 sec 64.38 sec 17.2% Floyd Warshall dense graph 11111 vertices 108.01 sec 88.55 sec 18% Floyd Warshall dense graph 15000 vertices NA 211.4 sec NA Floyd Warshall dense graph 20000 vertices NA 490.4 sec NA note: NA because not enough CPU(host) memory to test on that PC 32 bit multi-step dynamic programming problem (hybrid cpu/gpu solution) Dynamic Problem MEGA (50 x 20) 93 ms 83 ms 9% Dynamic Problem MEGA (50 x 22) 405 ms 365 ms 9.8% Dynamic Problem MEGA (50 x 23) 849 ms 765 ms 9.8% 64 bit dynamic programming problem (hybrid cpu/gpu solution) Cities=100, K=60, Fans=100 101 ms 85 ms 15.8% Cities=120, K=80, Fans=120 200 ms 188 ms 6% ADMM Group Lasso (32 bit floats) convex optimization m=1536, n=4096, K(num_blocks)=20 62 ms 49 ms 21% m=8192,n=1024,K(num_blocks)=8 40 ms 33 ms 17% m=16384, n=1024,K(num_blocks)=8 61 ms 51 ms 16% m=512, n=8192,K(num_blocks)=60 31 ms 25 ms 19% m=512,n=16384,K(num_blocks)=120 64 ms 58 ms 9% m=768,n=32768,K(num_blocks)=250 147 ms 111 ms 24%
Please excuse the above formatting blunders.
In general the Tesla K40x (using Windows 7 TCC driver) was about 20-30% faster than the K20c. The greatest degree of difference was with sorting 32-bit floats. cuBLAS was about 25% faster on average, which can be seen in the group lasso times.
When it came down the the dynamic programming problems, there was less of a speedup mainly because the host cpu clock speed of my home PC is faster than the remote machine, and since up to 33% of that type of work is done on the host, the overall speedup was not as great.
My applications do not have frequent host-device data transfers, so if one had such an application the performance of the K40x would be even better. Also that 12GB of memory is useful for large combinatorial/brute force problems.
Much of the code uses newer features like __shfl() and I believe those times are rather good, especially for Windows.
Will be posting a more detailed blog later this week with source code.