Early comparison of Tesla K20c vs. Tesla K40x

Had the opportunity through Nvidia and their partner AMAX to test out the Tesla K40x for a day.

Other than the obvious 2x increase in host-device bandwidth (PCI-E 3.0 ) and the increase from 5GB to 12 GB, here are some tests I ran on my home machine vs. the AMAX K40x setup:

Comparison of Nvidia Tesla K20c vs. Nvidia Tesla K40x (calculation speed)

Windows 7 64 bit with TCC driver for Tesla, Intel i-3770k 3.9 Ghz

NOTE: All times include device memory allocations, host-device copies, device-device copies, device-host copies and de-allocations
						K20c		K40x		speedup
sorting (thrust 32 bit floats)

134,217,728 elements array 32 bit floats	0.51 sec	0.341 sec	33%
268,435,456 elements array 32 bit floats	1.01 sec	0.684 sec	32%

sorting (thrust 64 bit doubles)

16,777,216 elements array 64 bit doubles 	0.14 sec	0.102 sec	27%
33,554,432 elements array 64 bit doubles 	0.27 sec	0.2 sec		26%
67,108,864 elements array 64 bit doubles	0.552 sec	0.397 sec	28%

brute force all permutations in local memory, then evaluation, optimization,scan, and reduction of best answer and respective permutation

13! fact in memory, evaluate, scan, reduce	23.6 sec	20.0 sec	18%
14! fact in memory, evaluate, scan, reduce	415.5 sec	327.48 sec	21%
15! fact in memory, evaluate, scan, reduce	6374 sec	5546.4 sec	13%

32 bit All-Pairs-Shortest-Path(with full path reconstruction using two adjacency matrices)

apx 25% non-zeros in full dense matrix form (32 bit integer)

Floyd Warshall dense graph 4000 vertices	5.09 sec	4.17 sec	18%
Floyd Warshall dense graph 8000 vertices	39.1 sec	32.09 sec	18%
Floyd Warshall dense graph 10000 vertices 	77.8 sec	64.38 sec	17.2%
Floyd Warshall dense graph 11111 vertices 	108.01 sec	88.55 sec	18%	
Floyd Warshall dense graph 15000 vertices	NA		211.4 sec	NA
Floyd Warshall dense graph 20000 vertices	NA		490.4 sec	NA
note: NA because not enough CPU(host) memory to test on that PC

32 bit multi-step dynamic programming problem (hybrid cpu/gpu solution)

Dynamic Problem MEGA 	(50 x 20)	93 ms		83 ms		9%
Dynamic Problem MEGA 	(50 x 22)	405 ms		365 ms		9.8%
Dynamic Problem MEGA 	(50 x 23)	849 ms		765 ms		9.8%

64 bit dynamic programming problem (hybrid cpu/gpu solution)

Cities=100, K=60, Fans=100			101 ms		85 ms		15.8%
Cities=120, K=80, Fans=120			200 ms		188 ms		6%

ADMM Group Lasso (32 bit floats) convex optimization

m=1536, n=4096, K(num_blocks)=20	62 ms		49 ms		21%
m=8192,n=1024,K(num_blocks)=8		40 ms		33 ms		17%
m=16384, n=1024,K(num_blocks)=8		61 ms		51 ms		16%
m=512, n=8192,K(num_blocks)=60		31 ms		25 ms		19%
m=512,n=16384,K(num_blocks)=120		64 ms 		58 ms		9%
m=768,n=32768,K(num_blocks)=250		147 ms		111 ms		24%

Please excuse the above formatting blunders.

In general the Tesla K40x (using Windows 7 TCC driver) was about 20-30% faster than the K20c. The greatest degree of difference was with sorting 32-bit floats. cuBLAS was about 25% faster on average, which can be seen in the group lasso times.

When it came down the the dynamic programming problems, there was less of a speedup mainly because the host cpu clock speed of my home PC is faster than the remote machine, and since up to 33% of that type of work is done on the host, the overall speedup was not as great.

My applications do not have frequent host-device data transfers, so if one had such an application the performance of the K40x would be even better. Also that 12GB of memory is useful for large combinatorial/brute force problems.

Much of the code uses newer features like __shfl() and I believe those times are rather good, especially for Windows.

Will be posting a more detailed blog later this week with source code.

I should add that the non-thrust code run for both Tesla devices was the same, and I am not sure if there would be any reason to adjust the parameters/code to fit the K40x. Did not have enough time for optimizations on the machine, but there might be some which could have changed the overall results.

Curious to see how the K40x compares to the Quadro K6000.

You may also want to play with application clocks (settable via nvidia-smi), as these guys did:

http://blog.xcelerit.com/benchmarks-nvidia-tesla-k40-vs-k20x-gpu/

I’ll be happy to run through your benchmarks on the Quadro K6000 I have to see how it compares to the Tesla K40.

Even if you could test the first three that would be very helpful.

The sorting was done via thrust using a device pointer (not using vectors).

The Floyd-Warshall code I used was this:

https://github.com/OlegKonings/CUDA_Floyd_Warshall_/blob/master/WikiGraphCuda/WikiGraphCuda/WGCmain.cu

Need to adjust RANDOM_GSIZE for the number of vertices

The large permutation code I used:

https://github.com/OlegKonings/CUDA_permutations_large/blob/master/EXP3/EXP3/EXP3.cu

NUM_ELEMENTS is the key variable there.

The bool variable test_raw determines if the generated permutation should be evaluated(scan reduced etc). I tested with test_raw=false, but the code default is set to true.

I would imagine that the linux times would be a bit better than Windows…Will need to remove the Windows timing code, but other than that it should run in linux fine.

Some while ago I borrowed a K40c for a day and the speedup for various applications versus K20c was pretty exactly in the range stated in post #1 in this thread. This was with default clocks, and ECC enabled on both GPUs. I did not have time to test with application clocks on the K40, which should provide some additional boost for apps that are not completely memory bound.

Quadro K6000, stock BIOS/clocks, Windows 8 x64, i7-4930K @ 3.4GHz, TCC, ECC disabled, CUDA 5.5 (compiled for SM 3.0 & x64), VS2012, NVIDIA driver 332.21, MSI X79A-GD45 (8D) motherboard (12.4 BIOS, PCI-E 3.0 enabled)

Floyd Warshall dense graph 4000 vertices: 5.75 sec
Floyd Warshall dense graph 8000 vertices: 38.01 sec
Floyd Warshall dense graph 10000 vertices: 73.41 sec
Floyd Warshall dense graph 11111 vertices: 100.06 sec
Floyd Warshall dense graph 15000 vertices: 235.68 sec
Floyd Warshall dense graph 20000 vertices: 534.45 sec

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\Bin\win64\Release>bandwidthTest.exe
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro K6000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10765.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10764.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     217819.8

Result = PASS

I wonder why these results come out slower than the K40… I was under the impression that besides the extra 10W in maximum TDP that both cards are pretty much the same. I’ll see if I can run the other benchmarks later on.

vacaloca,

Thanks! I also would imagine those 2 GPUs would have similar performance.

Keep in mind that for this Floyd Warshall code, the CPU clock speed matters quite a bit(the outer k loop is done on Host, and the other two on the GPU).

What CPU setup did you use? That might explain the difference.

No problem. I updated the post above, running a Core i7-4930K @ 3.4 GHz, so perhaps that could be why if you were running @ 3.9 GHz. I’ll try out the others.

Not sure why, but with my configuration, I cannot compile the large permutation code without using the -G option in NVCC. Otherwise, when compiling, ptxas.exe keeps running for a long time @ roughly 8% CPU usage and doesn’t complete the compilation. If you have a machine that you can try VS2012 w/ CUDA 5.5, see if you experience the same issue.

hmm

hmm, did not have that problem with VS2012. It is fairly simple code, but I will look at it again.

Ok, I think the problem has to do with the use of templates in the ‘raw’ version. There was a warning generated for that, but it still ran on my PC.

The warning was related to the A and B arrays which held the permutation index values. Since that version of the code never evaluates the values in A, the compiler warned about that.There might have been some large amount of templated code generated.

Here is a version which generates the all permutations of arrays in memory, and evaluates/scans/reduces etc. This will take longer than the raw version because it has to evaluate, cache, optimize, copy to global, copy to host etc. Those are the numbers I used in the table anyway.

http://pastebin.com/5fY1nF7z

The answer for that code should be 1610 for 13! given the const dependency and values arrays.

Yup, I got those warnings, but it just didn’t finish compiling for me… must be an issue with generating a lot of templated code like you mention.

The one in the previous post compiled just fine in Release mode for me. Here’s the numbers:

Quadro K6000, stock BIOS/clocks, Windows 8 x64, i7-4930K @ 3.4GHz, TCC, ECC disabled, CUDA 5.5 (compiled for SM 3.0 & x64), VS2012, NVIDIA driver 332.21, MSI X79A-GD45 (8D) motherboard (12.4 BIOS, PCI-E 3.0 enabled)

brute force all permutations in local memory, then evaluation, optimization,scan, and reduction of best answer and respective permutation

13! fact in memory, evaluate, scan, reduce: 17.28 sec (1610 answer)
14! fact in memory, evaluate, scan, reduce: 272.29 sec (1721 answer)
15! fact in memory, evaluate, scan, reduce: 4608.83 sec (1722 answer)

These numbers make more sense, given that Quadro K6000 boosts to (and almost usually stays at) a max clock of 901 MHz from a 797 MHz stock clock. It would seem like the 10W max TDP increase on the K40 is the only ‘plus’ over the Quadro K6000.

Wow,that is impressive! Thanks for the test.