nvprof shows DRAM throughput greater than theoretically possible

pyotr777 · December 27, 2017, 2:10am

I am measuring application performance with nvprof on M60 (actually on Amazon g3.4xlarge instance with only one GPU, which is a half of M60 board).
I have CUDA 9 installed.
The command I used for profiling is like the following:

nvprof --replay-mode application --csv --log-file nvprof_dram_write_throughput.log --metrics dram_write_throughput python tf_cnn_benchmarks.py <some arguments>

And the similar command for dram_read_throughput. The commands produce a log file in CSV format.

The data I see in these log files confuses me.

Theoretical DRAM througput for M60 is about 160GB/s. In the log files however, I see that for some kernels throughput is in the order of TB/s.

Does this mean that L1 or L2 caches are used?

Here are some lines from the log files:

"Device","Kernel","Invocations","Metric Name","Metric Description","Min","Max","Avg"
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",13,"dram_write_throughput","Device Memory Write Throughput",108.548296MB/s,9597.024243GB/s,2209.103356GB/s
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)",6,"dram_read_throughput","Device Memory Read Throughput",1.493997GB/s,6236.837731GB/s,2210.343396GB/s

By the way, if I don’t use --replay-mode application option, profiling a program that runs less than a minute takes hours.

veraj · December 28, 2017, 3:32am

Hi, peterbryz

Thanks for reporting this.

Regard to the metric value, it is an issue.
It would be better if you can provide us your app ( in this case, is python tf_cnn_benchmarks.py also the command to run it ), if you agree, I can send you the link to upload related file

For below problem
By the way, if I don’t use --replay-mode application option, profiling a program that runs less than a minute takes hours.

One of the potential reason is large gpu memory footprint, as we need to save and restore device memory for each kernel replay. In those cases application replay performs better. These details are documented on the docs portal as:
“In “application replay” mode, nvprof re-runs the whole application instead of replaying each kernel, in order to collect all events/metrics. In some cases this mode can be faster than kernel replay mode if the application allocates large amount of device memory.”

pyotr777 · December 30, 2017, 12:18pm

Hi, Veraj,

Thank you for your reply.
I am profiling latest HPCG benchmark [url]http://www.hpcg-benchmark.org/software/index.html[/url] and Tensorflow HP benchmark [url]https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks[/url].

How many metrics can nvprof collect in one run without replaying?

veraj · January 2, 2018, 2:47am

Hi, pyotr777

Thanks for the info.
We’ll check if we can reproduce on our side.

Any update, I will let you know.

Best Regards

VeraJ

veraj · January 2, 2018, 10:43am

Hi, pyotr777

I have prepared Tesla M60 + Cuda 9.0.176 and download http://www.hpcg-benchmark.org/software/view.html?id=254

But I fail to run it.

root@devtools-qa72:~/hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17# LD_LIBRARY_PATH=/opt/pgi/linux86-64/2017/mpi/openmpi/lib:$LD_LIBRARY_PATH ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

start of application (8 OMP threads)…
2018-01-02 18:40:01.531

Problem setup…
Setup time: 0.608166 sec
Killed

If I use GV100, the sample can run. So which command are you using on Tesla M60 ?

pyotr777 · January 5, 2018, 6:25am

Hi, Veraj,

For HPCG try use smaller problem size:

cp hpcg.dat_128x128x128_60 hpcg.dat

To run HPCG without profiling it should be possible just to run the executable like you did.
For profiling I use something like this:

$ nvprof --metrics dram_read_throughput,dram_utilization,dram_write_throughput ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

You can use installation scripts for HPCG on a new (cloud) Ubuntu machine:

Please, do check the Tensorflow HP benchmark also, as profiling works even worse for it.

You could try the following command:

~/benchmarks/scripts/tf_cnn_benchmarks$ nvprof  --metrics dram_read_throughput,dram_utilization,dram_write_throughput python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64  --model=resnet50

veraj · January 8, 2018, 10:49am

Hi, pyotr777

I can reproduce the issue using Tensorflow HP benchmark

“Tesla M60 (0)”,“void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)”,452,“dram_write_throughput”,“Device Memory Write Throughput”,0.000000B/s,43576.681300GB/s,1020.850923GB/s

I will update to the dev and let them have a check.

Thanks for reporting this again !

pyotr777 · January 9, 2018, 1:35am

Hi, Veraj,

Thank you!
May I expect an update to nvprof soon?

Peter

veraj · January 9, 2018, 2:49am

Hi, pyotr777

What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?

I’m afraid dev will fix in later toolkit release, not back to 9.0.

pyotr777 · January 11, 2018, 3:04am

Hi, Veraj,

What do you mean update to nvprof, you want us to share a fixed nvprof to you seperately ?

Nope. I’m looking forward an updated version of nvprof.

Peter

veraj · January 11, 2018, 3:13am

Oh, that depends the cuda toolkit release schedule.
I’m sorry I do not have the exact info.

Topic		Replies	Views
nvprof dram_write_throughput, dram_read_throughput Other Tools	0	695	December 16, 2017
[Jetson-TK1] nvprof, hardware performance counters and actual DRAM bandwidth usage Jetson TK1	2	1586	June 10, 2015
Wrong result of gld_throughput using nvprof Visual Profiler and nvprof nvbugs	0	579	August 4, 2023
nvprof is too slow Visual Profiler and nvprof	12	5003	January 25, 2022
"nvprof -m dram_read_bytes" has strange error? Visual Profiler and nvprof	1	1144	July 17, 2019
Is there a way to measure DRAM throughput and transactions? Jetson TX1	4	1549	July 14, 2016
Consistency of data collected by nvprof and nsight compute Nsight Compute	2	502	July 30, 2023
nvprof with tensorflow is suspiciously slow CUDA Programming and Performance	7	1632	January 19, 2019
Problem with nvprof CUDA Programming and Performance	0	347	November 12, 2017
Problem with nvprof CUDA Programming and Performance	0	439	November 12, 2017

nvprof shows DRAM throughput greater than theoretically possible

Related topics