50GB/s Transfer Speeds on 2x Titan RTX with ASUS ROG NVLink

I’m looking for specialized support for my RTX Titan cards on Ubuntu 18.04.2 LTS. Currently I’m achieving only 50GB/s transfer speed with NVLink betwen my two cards, while I should be getting up to 100GB/s.

As far as the commands below goes, it seems to me that I’m missing half of the Links, since Links 0, 1, 2 and 3 should be displayed but I’m only seeing Links 0 and 1.

nvidia-smi nvlink -s
GPU 0: TITAN RTX
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
GPU 1: TITAN RTX
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s

nvidia-smi nvlink -c
GPU 0: TITAN RTX
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
GPU 1: TITAN RTX
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false

nvidia-smi nvlink -p
GPU 0: TITAN RTX
Link 0: 00000000:82:00.0
Link 1: 00000000:82:00.0
GPU 1: TITAN RTX
Link 0: 00000000:02:00.0
Link 1: 00000000:02:00.0

nvidia-smi nvlink --status -i 0
GPU 0: TITAN RTX
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s

NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1

What can I do to speed up my transfer speeds? Im left to believe that this is a limitation of the ASUS ROG NVLink. Is it?
nvidia-bug-report.log.gz (1.82 MB)

This is perfectly fine. The TitanRTX only has two links of 25GB/s=50GB/s per direction. 100GB/s is the bidirectional transfer speed. Use the bandwidthTest cuda demo to get the combined numbers.

I ran the following command as requested. Top bandwith was 1827704.6 MB/s, or 1784.87GB/s… ?!

bandwidthTest --device=all --mode=shmoo
[CUDA Bandwidth Test] - Starting…

!!!Cumulative Bandwidth to be computed from all the devices !!!

Running on…

Device 0: TITAN RTX
Device 1: TITAN RTX
Shmoo Mode


Host to Device Bandwidth, 2 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
1024 1128.5
2048 691.0
3072 1100.9
4096 1149.7
5120 2183.8
6144 3201.8
7168 2792.1
8192 2519.7
9216 3737.2
10240 5005.2
11264 6649.4
12288 4695.6
13312 6158.2
14336 3934.8
15360 8375.1
16384 7688.0
17408 6324.7
18432 6227.4
19456 8411.5
20480 2753.6
22528 7088.8
24576 9928.4
26624 10495.9
28672 11147.0
30720 12462.3
32768 12655.4
34816 12822.8
36864 12901.5
38912 13482.1
40960 11830.7
43008 14284.1
45056 12590.1
47104 15353.8
49152 14955.8
51200 15764.3
61440 16157.8
71680 15786.0
81920 17062.9
92160 18273.1
102400 17363.9
204800 20327.9
307200 18959.3
409600 20657.5
512000 21155.4
614400 21296.8
716800 20965.8
819200 21429.0
921600 21691.3
1024000 22082.1
1126400 22179.5
2174976 22496.4
3223552 21968.4
4272128 21824.3
5320704 22525.2
6369280 22322.7
7417856 22559.0
8466432 21588.7
9515008 22278.6
10563584 22561.0
11612160 22344.7
12660736 22239.5
13709312 22541.3
14757888 22292.6
15806464 22181.1
16855040 22540.9
18952192 22620.2
21049344 21965.4
23146496 22362.1
25243648 22335.4
27340800 22209.8
29437952 22311.8
31535104 22148.0
33632256 22445.4
37826560 22082.6
42020864 22039.1
46215168 22319.8
50409472 22465.9
54603776 22365.3
58798080 22195.2
62992384 22332.1
67186688 22557.7


Device to Host Bandwidth, 2 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
1024 583.2
2048 2250.2
3072 3403.6
4096 4112.3
5120 5191.7
6144 6915.1
7168 6425.8
8192 7971.6
9216 9255.8
10240 8398.6
11264 9496.4
12288 9851.9
13312 10127.6
14336 10152.5
15360 11661.2
16384 13115.9
17408 12168.8
18432 12110.8
19456 14054.4
20480 12520.0
22528 14245.5
24576 13404.1
26624 14119.4
28672 9885.4
30720 14549.1
32768 16769.2
34816 17089.5
36864 17281.6
38912 17369.8
40960 17578.7
43008 18122.2
45056 17810.9
47104 18412.1
49152 17030.4
51200 18886.5
61440 16968.6
71680 20414.4
81920 19467.0
92160 16841.1
102400 21540.1
204800 21549.6
307200 22262.6
409600 23848.8
512000 24112.5
614400 25001.5
716800 25009.4
819200 25046.3
921600 24139.4
1024000 24332.8
1126400 24497.2
2174976 25056.5
3223552 24957.6
4272128 25062.5
5320704 24566.7
6369280 24432.1
7417856 24404.2
8466432 25194.9
9515008 25104.6
10563584 25058.6
11612160 24967.3
12660736 24472.9
13709312 25177.7
14757888 24762.2
15806464 25167.6
16855040 23798.4
18952192 24530.6
21049344 24900.1
23146496 25200.7
25243648 24884.5
27340800 25006.6
29437952 24883.6
31535104 24945.4
33632256 24983.4
37826560 24894.2
42020864 25198.5
46215168 25033.9
50409472 24620.1
54603776 25192.2
58798080 24968.1
62992384 25208.2
67186688 25114.7


Device to Device Bandwidth, 2 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
1024 1315.7
2048 3310.4
3072 5360.5
4096 6640.4
5120 8754.9
6144 10006.4
7168 8051.3
8192 13693.3
9216 15517.3
10240 16125.3
11264 17827.7
12288 20273.2
13312 20553.6
14336 24116.0
15360 24679.4
16384 26764.6
17408 29494.2
18432 29469.9
19456 30110.2
20480 32616.2
22528 36503.1
24576 39388.4
26624 42696.0
28672 44332.5
30720 49777.7
32768 56980.1
34816 52166.2
36864 64367.5
38912 60948.0
40960 69801.5
43008 72355.3
45056 74724.4
47104 74563.2
49152 79690.8
51200 87536.2
61440 106933.3
71680 45414.0
81920 52857.2
92160 52215.1
102400 67327.6
204800 132324.1
307200 304278.4
409600 721073.0
512000 564592.5
614400 682991.1
716800 1019975.3
819200 1151493.6
921600 1203293.8
1024000 1260949.9
1126400 1504254.1
2174976 1827704.6
3223552 1295196.3
4272128 896354.9
5320704 916636.4
6369280 944489.5
7417856 959990.2
8466432 975654.4
9515008 743652.4
10563584 794903.3
11612160 521704.0
12660736 589057.5
13709312 908203.6
14757888 1018320.7
15806464 894900.0
16855040 329853.6
18952192 1022189.8
21049344 721870.4
23146496 900466.4
25243648 1043503.9
27340800 1020354.0
29437952 1043258.0
31535104 945771.5
33632256 1033674.7
37826560 1046557.2
42020864 1002398.0
46215168 1051364.2
50409472 1047868.0
54603776 1064306.0
58798080 715770.4
62992384 936505.5
67186688 861046.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Sorry, wrong test, I really meant
p2pBandwidthLatencyTest