Nonsense results from memory bandwidth profile on TX1

There seems to be an issue with the CUDA 7 nvvp when showing memory bandwidth statistics from the TX1 (see attached image, or http://imgur.com/WsPTOfK).

It’s claiming an L2 cache rate of 14 PB/s and a unified cache rate of 99 PB/s - and I don’t believe TX1 is that fast :-). This is a trivial kernel that just copies one 65536-element array to another. Code is below. Using host CUDA 7 from cuda-repo-ubuntu1404-7-0-local_7.0-71_amd64.deb and device CUDA 7 installed by Jetpack.

#!/usr/bin/env python
import pycuda.autoinit
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray
import numpy as np

src = """
__global__ void stuff(const int * __restrict__ data, int * __restrict__ out)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    out[idx] = data[idx];
}
"""

module = pycuda.compiler.SourceModule(src)
stuff = module.get_function('stuff')
blocks = 256
blockdim = 256
N = blocks * blockdim
data = pycuda.gpuarray.GPUArray(N, np.int32)
out = pycuda.gpuarray.GPUArray(N, np.int32)
stuff(data, out, block=(blockdim, 1, 1), grid=(blocks, 1))

You may wish to file a bug at developer.nvidia.com