nvprof dram_write_throughput, dram_read_throughput

I am measuring application performance with nvprof on M60 (actually on Amazon g3.4xlarge instance with only one GPU, which is a half of M60 board).
I have CUDA 9 installed.
The command I used for profiling is like the following:

nvprof --replay-mode application --csv --log-file nvprof_dram_write_throughput.log --metrics dram_write_throughput python tf_cnn_benchmarks.py <some command arguments>

And the similar command for dram_read_throughput. The commands produce a log file in CSV format, each line corresponding to a kernel.

The data I see in these log files confuses me.

Theoretical DRAM througput for M60 is about 160GB/s. In the log files however, I see that for some kernels throughput is in the order of TB/s.
Here are some lines from the log files:

"Device","Kernel","Invocations","Metric Name","Metric Description","Min","Max","Avg"
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",13,"dram_write_throughput","Device Memory Write Throughput",108.548296MB/s,9597.024243GB/s,2209.103356GB/s
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)",6,"dram_read_throughput","Device Memory Read Throughput",1.493997GB/s,6236.837731GB/s,2210.343396GB/s

Can anyone shed light on what’s going on here?

By the way, if I don’t use --replay-mode application option, profiling a program that runs less than a minute takes hours.