使用两个流的计算时间并没有明显少于一个流的时间

你好,我在调用kernel函数时,尝试使用两个流替换之前使用一个流的代码。一个流时,我使用的是32个block和每个block设置32个thread。两个流时,分别使用了16个block和每个block设置32个thread。但是在使用nvvp查看结果时,使用两个流的总时间并没有减少一半。而是两个流分别计算的时间都接近一个流的时间。这是什么原因呢?以下是nvvp结果图。
一个流:


两个流:

Hi,
Please execute sudo nvpmodel -m 0 and sudo jetson_clocks. The execution will run GPU at max clock and please check if it brings improvement.

You can execute sudo tegrastats to check GPU loading. Maybe it is 100% loaded already.

Hi,
I did as you said, it shortened by 3 ms, but I do not know if GPU was 100% loaded.

tegrastats:

Hi,
It looks like your app uses CPU. One CPU core is at 100% and there is no loading on GPU(GR3d_FREQ). You may check with the app does not use GPU. Probably certain setting is not enabled.