TX2 performance improvement

Accelerated Computing CUDA CUDA Programming and Performance

wtiandong March 8, 2018, 10:37am 1

Hi all,
We are working on TX2 Deeplearning performance improvement. We now profile our code using nvvp. And we notice that the most time-consuming function is fermiPlusSgemmLDS64_batch, its memory efficiency is low. However, what is fermiPlusSgemmLDS64_batch? We didnt write this function.

BR,
Tiandong

Topic		Replies	Views
Fermi question CUDA Programming and Performance	30	5765	May 26, 2010
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29380	December 8, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8771	September 22, 2010
Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM CUDA Programming and Performance	39	69413	March 1, 2011
CUBLAS SGEMM performance CUDA Programming and Performance	5	10742	October 5, 2007
CULA's Initial Fermi (Tesla C2050) Benchmarks Plug and play double precision performance gains CUDA Programming and Performance	11	3978	April 23, 2010
my speedy SGEMM CUDA Programming and Performance	91	276263	May 29, 2013
Cuda SGEMM same speed as APPLE veclibs ? CUDA Programming and Performance	8	10687	May 8, 2008
CUDA vs DX execution times DX GPGPU code --> CUDA = slower CUDA Programming and Performance	15	13385	January 30, 2008
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17454	June 8, 2010

TX2 performance improvement

Related topics