Weird profiling results from nsight system

I got a simple cuda program from DLI, which add two vectors to another with memory prefetched.

But when i try to profile it with nsight system, it tends out to be like this.
results.zip (413.8 KB)
which is weird as memory transfer happened far more earlier than prefetch was called.

Then i try for next time with nvidia visual profiler, the result looks resonable.I am really confused about this.

My System Info
Ubuntu 18.04.4 LTS with ryzen7 1700

nvidia-smi
Sat Aug 15 18:10:21 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:26:00.0 On | N/A |
| 38% 32C P8 10W / 160W | 1162MiB / 5926MiB | 7% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

nsys --version
NVIDIA Nsight Systems version 2020.3.2.6-87e152c

NVIDIA Visual Profiler
Version: 11.0
and i run nvvp with “nvvp -vm /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java”