I am measuring application performance with nvprof on M60 (actually on Amazon g3.4xlarge instance with only one GPU, which is a half of M60 board).

I have CUDA 9 installed.

The command I used for profiling is like the following:

```
nvprof --replay-mode application --csv --log-file nvprof_dram_write_throughput.log --metrics dram_write_throughput python tf_cnn_benchmarks.py <some arguments>
```

And the similar command for dram_read_throughput. The commands produce a log file in CSV format.

The data I see in these log files confuses me.

Theoretical DRAM througput for M60 is about 160GB/s. In the log files however, I see that for some kernels throughput is in the order of **TB/s**.

Does this mean that **L1 or L2 caches** are used?

Here are some lines from the log files:

```
"Device","Kernel","Invocations","Metric Name","Metric Description","Min","Max","Avg"
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)",13,"dram_write_throughput","Device Memory Write Throughput",108.548296MB/s,9597.024243GB/s,2209.103356GB/s
"Tesla M60 (0)","void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)",6,"dram_read_throughput","Device Memory Read Throughput",1.493997GB/s,6236.837731GB/s,2210.343396GB/s
```

By the way, if I don’t use --replay-mode application option, profiling a program that runs less than a minute takes hours.