Computational speed of High-End and Medium-End Nvidia Tesla cards

mjarpaio · November 20, 2019, 11:04pm

Dear Community,
this is Max from Italy.
As part of my University research activities, I am working on a software for RF propagation simulations. The code developed by another research group in my University makes extensive use of GPU computing (TCC form) and CUDA code for parallelization.

For my activities, I have two workstations with nearly identical specs in terms of RAM and Xeon CPU. The only difference is one having a Tesla P100 and the other one having a Tesla K40 card installed. Generally speaking I would expect the P100 card performances in terms of computational speed to be better that the K40 one. According to my research activities, this seems to be true as long as the simulation environment is “demanding”, i.e: challenging urban scenario with many propagation rays to be calculated and managed. On the contrary, if the scenario is “simple”, i.e: few buildings and not so many rays, well then the P100 card is not as efficient as the K40, which proves to run the simulation faster. I am not that expert in CUDA programming and the Nvidia Tesla HW specs of these two cards, but can any of you provide a reasonable explanation of this behaviour ? How come a P100 card is slower than a K40 for small, simple scenarios ? Is there a kind of “overhead” to be aware of ?

cbuchner1 · November 20, 2019, 11:33pm

Do you run the same binaries on both cards? Have the binaries been built specifically to support the respective architectures (Pascal and Kepler)?

If you build this software from source code, make sure you’re targeting the right architecture with the nvcc options (-gencode and -arch options). Also try out various CUDA revisions, e.g., 7.5, 8.0 and 9.0 to see which ones work fastest on the respective architectures.

Maybe the software itself isn’t tuned right for simpler scenarios. It might only be partially utilizing the available GPU resources in case the scenario is very basic. It might end up launching a low number of threads and blocks, possibly running much less efficient on once card vs. another (depending on how many multiprocessors the respective card offers in total)

Different problem complexities may scale differently to the available L1 or L2 cache sizes on each GPU architecture. The basic scenario (e.g. the geometric building descriptions) might just fit into the cache on one card, but not so well on another.

Overall it’s hard to pinpoint efficiency problems without running a profiling session. It’s pure guesswork before one has seen the nVidia CUDA profiler output.

Topic		Replies	Views
Programing CUDA C with different GPU's CUDA Programming and Performance	9	1844	August 5, 2016
Tesla1060 vs GS8600 CUDA Programming and Performance	3	1582	March 11, 2010
Graphics Card or Tesla selection of hardware for computations CUDA Programming and Performance	14	5243	October 23, 2008
OpenCL execution on K4000 - slow execution and very laggy user interface response CUDA Programming and Performance	5	1229	November 25, 2014
CUDA Beginner questions CUDA Programming and Performance	3	1416	September 21, 2008
New to Tesla/CUDA questions Just a few questions. CUDA Programming and Performance	7	7918	October 24, 2007
Tesla C2070 vs. GX2 speed test CUDA Programming and Performance	4	1411	June 23, 2011
advice needed by a PhD student CUDA Programming and Performance	26	2869	December 4, 2011
Tesla C870 slower than GForce 9600 GT ? CUDA Programming and Performance	6	1480	May 23, 2010
Performance difference between Tesla and system where Cuda GPU is used as display device CUDA Programming and Performance	8	5906	September 2, 2009

Computational speed of High-End and Medium-End Nvidia Tesla cards

Related topics