hello, NV’s experts, I have a problem about RTX A4000, I test the performance of RTX A4000 through cublass. and I found I cannot reach the peak performance, the result as following: [image] it alway cannot reach the peak value, and I guess the bandwidth of A4000’s global memory cannot support th…

About performance profile of RTX A4000

jmarusarz August 3, 2023, 8:31pm 2

I see that you got a reply on this thread. Why the performance of tf32 tensor_core is poor? - #7 by Shaquille I think that’s the best place to continue the discussion since there are more details there already.

Topic		Replies	Views
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1788	August 8, 2023
Global Memory Access Optimization, tex throttling Nsight Compute cuda , kernel	6	682	May 8, 2024
Jetson TK1 performance Jetson TK1	18	6426	June 18, 2014
High shared memory usage but low l1tex__data_bank_reads CUDA Programming and Performance	5	77	October 24, 2024
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	819	November 20, 2023
Shared Memory Bandwidth CUDA Programming and Performance	3	1409	August 3, 2013
Simple application not scaling well, trying to figure out reason(s) CUDA Programming and Performance	6	961	July 31, 2015
Why does the performance of using texture memory in the A4000 decrease compared to the RTX4000? CUDA Programming and Performance cuda	1	47	November 26, 2024
What does the "shared_efficiency" really mean? CUDA Programming and Performance	5	2349	November 16, 2023
Maximum Tensor Core utilization Nsight Compute cuda , kernel	4	155	March 20, 2025