TX2 performance improvement

Hi all,
We are working on TX2 Deeplearning performance improvement. We now profile our code using nvvp. And we notice that the most time-consuming function is fermiPlusSgemmLDS64_batch, its memory efficiency is low. However, what is fermiPlusSgemmLDS64_batch? We didnt write this function.

BR,
Tiandong