Using bandwidthTest tool, D2D performance More than the official given bandwidth

Using bandwidthTest tool(/usr/local/cuda/samples/1_Utilities/bandwidthTest/), D2D performance(864.3GB/S) More than the official given bandwidth(NVIDIA GeForce RTX 3080 760GB/S),Whether it is reasonable or not? why?
When calculating bandwidth, why multiply by 2.0,test code as follows:
// calculate bandwidth in GB/s
float time_s = elapsedTimeInMs / (float)1e3;
bandwidthInGBs = (2.0f * memSize * (float)MEMCOPY_ITERATIONS) / (float)1e9;
bandwidthInGBs = bandwidthInGBs / time_s;

The test data is as follows:
bandwidthTest-D2D, Bandwidth = 0.8 GB/s, Time = 0.00000 s, Size = 1000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 0.7 GB/s, Time = 0.00000 s, Size = 2000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 1.4 GB/s, Time = 0.00000 s, Size = 3000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 3.1 GB/s, Time = 0.00000 s, Size = 4000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 3.8 GB/s, Time = 0.00000 s, Size = 5000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 4.5 GB/s, Time = 0.00000 s, Size = 6000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 5.3 GB/s, Time = 0.00000 s, Size = 7000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 6.2 GB/s, Time = 0.00000 s, Size = 8000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 7.1 GB/s, Time = 0.00000 s, Size = 9000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 7.7 GB/s, Time = 0.00000 s, Size = 10000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 8.6 GB/s, Time = 0.00000 s, Size = 11000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 9.3 GB/s, Time = 0.00000 s, Size = 12000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 10.3 GB/s, Time = 0.00000 s, Size = 13000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 10.9 GB/s, Time = 0.00000 s, Size = 14000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 11.7 GB/s, Time = 0.00000 s, Size = 15000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 12.4 GB/s, Time = 0.00000 s, Size = 16000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 13.2 GB/s, Time = 0.00000 s, Size = 17000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 14.0 GB/s, Time = 0.00000 s, Size = 18000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 14.9 GB/s, Time = 0.00000 s, Size = 19000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 15.5 GB/s, Time = 0.00000 s, Size = 20000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 17.2 GB/s, Time = 0.00000 s, Size = 22000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 18.6 GB/s, Time = 0.00000 s, Size = 24000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 20.3 GB/s, Time = 0.00000 s, Size = 26000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 21.9 GB/s, Time = 0.00000 s, Size = 28000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 23.6 GB/s, Time = 0.00000 s, Size = 30000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 24.7 GB/s, Time = 0.00000 s, Size = 32000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 26.3 GB/s, Time = 0.00000 s, Size = 34000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 28.1 GB/s, Time = 0.00000 s, Size = 36000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 29.5 GB/s, Time = 0.00000 s, Size = 38000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 30.9 GB/s, Time = 0.00000 s, Size = 40000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 33.1 GB/s, Time = 0.00000 s, Size = 42000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 34.0 GB/s, Time = 0.00000 s, Size = 44000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 35.9 GB/s, Time = 0.00000 s, Size = 46000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 37.0 GB/s, Time = 0.00000 s, Size = 48000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 39.0 GB/s, Time = 0.00000 s, Size = 50000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 46.7 GB/s, Time = 0.00000 s, Size = 60000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 55.1 GB/s, Time = 0.00000 s, Size = 70000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 63.4 GB/s, Time = 0.00000 s, Size = 80000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 70.3 GB/s, Time = 0.00000 s, Size = 90000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 78.9 GB/s, Time = 0.00000 s, Size = 100000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 157.6 GB/s, Time = 0.00000 s, Size = 200000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 236.1 GB/s, Time = 0.00000 s, Size = 300000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 314.2 GB/s, Time = 0.00000 s, Size = 400000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 395.9 GB/s, Time = 0.00000 s, Size = 500000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 473.6 GB/s, Time = 0.00000 s, Size = 600000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 557.8 GB/s, Time = 0.00000 s, Size = 700000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 637.8 GB/s, Time = 0.00000 s, Size = 800000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 716.3 GB/s, Time = 0.00000 s, Size = 900000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 796.3 GB/s, Time = 0.00000 s, Size = 1000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 864.3 GB/s, Time = 0.00000 s, Size = 2000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 589.9 GB/s, Time = 0.00001 s, Size = 3000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 526.0 GB/s, Time = 0.00001 s, Size = 4000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 514.1 GB/s, Time = 0.00001 s, Size = 5000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 533.1 GB/s, Time = 0.00001 s, Size = 6000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 544.8 GB/s, Time = 0.00001 s, Size = 7000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 560.1 GB/s, Time = 0.00001 s, Size = 8000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 577.2 GB/s, Time = 0.00002 s, Size = 9000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 586.3 GB/s, Time = 0.00002 s, Size = 10000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 604.1 GB/s, Time = 0.00002 s, Size = 11000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 609.4 GB/s, Time = 0.00002 s, Size = 12000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 606.8 GB/s, Time = 0.00002 s, Size = 13000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 611.0 GB/s, Time = 0.00002 s, Size = 14000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 619.5 GB/s, Time = 0.00002 s, Size = 15000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 623.6 GB/s, Time = 0.00003 s, Size = 16000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 626.4 GB/s, Time = 0.00003 s, Size = 18000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 634.0 GB/s, Time = 0.00003 s, Size = 20000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 637.9 GB/s, Time = 0.00003 s, Size = 22000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 644.1 GB/s, Time = 0.00004 s, Size = 24000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 645.0 GB/s, Time = 0.00004 s, Size = 26000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 649.1 GB/s, Time = 0.00004 s, Size = 28000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 649.9 GB/s, Time = 0.00005 s, Size = 30000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 653.9 GB/s, Time = 0.00005 s, Size = 32000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 655.3 GB/s, Time = 0.00005 s, Size = 36000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 659.1 GB/s, Time = 0.00006 s, Size = 40000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 661.9 GB/s, Time = 0.00007 s, Size = 44000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 664.1 GB/s, Time = 0.00007 s, Size = 48000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 665.6 GB/s, Time = 0.00008 s, Size = 52000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 666.9 GB/s, Time = 0.00008 s, Size = 56000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 668.4 GB/s, Time = 0.00009 s, Size = 60000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 676.1 GB/s, Time = 0.00009 s, Size = 64000000 bytes, NumDevsUsed = 1
bandwidthTest-D2D, Bandwidth = 670.5 GB/s, Time = 0.00010 s, Size = 68000000 bytes, NumDevsUsed = 1

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: “NVIDIA GeForce RTX 3080”
CUDA Driver Version / Runtime Version 11.7 / 11.7
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 10018 MBytes (10504437760 bytes)
(068) Multiprocessors, (128) CUDA Cores/MP: 8704 CUDA Cores
GPU Max Clock rate: 1710 MHz (1.71 GHz)
Memory Clock rate: 9501 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 5242880 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 101 / 0

overclocking?

each data movement operation involves one read and one write. Bandwidth is measured in bytes read and written, per second. The Size is the amount of bytes copied, but each byte copied requires one write and one read.

That could be an artifact of the timing methodology. The elapsed time for the highlighted log entry is on the order of a couple of microseconds. Looking at the SDK timer used for the example programs, it seems the underlying operating system timers should be able to support sub-microsecond granularity. However, I notice that the SDK timer keeps some data as float, which may add additional numerical artifacts that increase clock jitter as seen by the user, but I have not reviewed this in detail. For that reason I keep timing data as double in my own work.

NVIDIA may want to review their SDK timing methodology with regard to (1) reliability of measuring events of extremely short duration, possibly changing float data items to double (2) avoiding measuring of events of extremely short duration, possibly increasing the minimum sizes used in various shmoo tests.

also I believe the number commonly reported by NVIDIA for these types of specs is GB = 2^30 bytes, whereas the GB number used by bandwidth test is 1 billion bytes. That would not give a complete accounting of the difference, however. That could account for ~7% difference.

For bandwidth tests it is customary to use SI prefixes in their ordinary meaning, so GB/s means 109 bytes/second. The usage in this CUDA program is consistent with that. I assume that the speeds & feeds marketing numbers (here: 760 GB/s) quoted for NVIDIA GPUs make use of the same metric, because in my experience marketing will always quote the highest number available.

Only in measurements of capacity should the alternate binary prefixes be used, such as MiB (mebi bytes) and GiB (gibi bytes). Unless you are a mass storage vendor, but I digress …

sorry, you are correct, please disregard my comment

Thank you for your answer.

no overclocking

Using double data items and linux clock_gettime ns timer, The problem is not repeated in NVIDIA GeForce RTX 3080 ;

But i use the same banwidth test (smoo mode) in orin(double and clock_gettime),peek bandwidth is 234GB/S)(The maximum value displayed is 204.8GB/S in orin spec) ,The size is mainly concentrated in 800000/900000/1000000.
test data is as follows:

Device 0: “Orin”
CUDA Driver Version / Runtime Version 11.4 / 11.4
CUDA Capability Major/Minor version number: 8.7
Total amount of global memory: 30623 MBytes (32110190592 bytes)
(016) Multiprocessors, (128) CUDA Cores/MP: 2048 CUDA Cores
GPU Max Clock rate: 1300 MHz (1.30 GHz)
Memory Clock rate: 1300 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 167936 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

bandwidthTest-D2D, Bandwidth = 187.679636)GB/s, Time = 0.000699(0.000746)) s, Size = 700000 bytes
bandwidthTest-D2D, Bandwidth = 191.094974)GB/s, Time = 0.000786(0.000837)) s, Size = 800000 bytes
bandwidthTest-D2D, Bandwidth = 216.229722)GB/s, Time = 0.000776(0.000832)) s, Size = 900000 bytes
bandwidthTest-D2D, Bandwidth = 234.389649)GB/s, Time = 0.000794(0.000853)) s, Size = 1000000 bytes

RAM 5938/30623MB (lfb 4094x4MB) SWAP 0/15311MB (cached 0MB) CPU [0%/2192,0%/2193,0%/2190,0%/2191,0%/2191,65%/2189,1%/2237,0%/2191,0%/2192,0%/2295,0%/2372,0%/2192] EMC_FREQ 125%/3199 GR3D_FREQ 99%/1295 GR3D2_FREQ 99%/1294 NVJPG1 729 VIC_FREQ 729 APE 233 CV0/-256C CPU/51.093C Tdiode/40.5C SOC2/47.093C SOC0/47.187C CV1/-256C GPU/48.625C SOC1/47.875C CV2/-256C VDD_GPU_SOC 21216mW/20820mW VDD_CPU_CV 2001mW/2001mW VIN_SYS_5V0 11022mW/10819mW NC 0mW/0mW VDDQ_VDD2_1V8AO 4334mW/4183mW NC 0mW/0mW