Extremely bad availability of China CDN

xkszltl · August 25, 2020, 4:03am

When downloading packages from https://developer.download.nvidia.com/compute/cuda and https://developer.download.nvidia.com/compute/machine-learning in China (which is redirected to developer.download.nvidia.cn), >90% of my connection either failed to establish, or get “Failed to connect to origin, please retry” as response.
This means:

It is extremely hard to download the package, failure rate is way too high and basically forbids any automation.
System with yum-axelget does not work because when some connections returned the correct binary while others returned text “Failed to connect to origin, please retry” and essentially invalidate the entire download, causing an amplification of traffic in retry by >100x

This issue has been there for years without a fix, bringing so much pain to us every time we update nvidia packages.
We even end up creating local mirror of repo manually just to make automation (e.g. deployment scripts/docker builds/…) possible.

njuffa · August 25, 2020, 7:13pm

This reply is speculative but based on observation. Over the past several (5+, if memory serves) years various users from within the PRC have reported issues accessing various portions of NVIDIA’s web site, including downloads of machine-learning related materials. Often these issues would manifest as “strange” error messages.

In all cases that I remember these issues magically disappeared when those users switched to access via VPN (via Hongkong?). This suggests that these access issues were related to the Great Firewall of China. My understanding is that (1) use of VPN in the PRC is highly regulated; (2) the Great Firewall has been enhanced over the years to be largely impervious to attempts to bypass it with VPN.

I note that at this time there may also be a potentially generic issue with international access to NVIDIA’s site that seems tentatively associated with the CDN, as discussed in this thread: Online Documentation for old CUDA Versions

xkszltl · August 26, 2020, 6:42am

I doubt GFW is the cause.

For GFW related issue (i.e. GFW standing between user and website gateway), it will show up as broken SSL connections, not strange messages.

First I’m not seeing any man-in-the-middle attack because cert is valid.

Then the “failed to connect to origin, please retry” we got is not an error code, it is plain text in HTTP response body.

It’s probably saying CDN gateway has trouble accessing upstream server.

If you use VPN, yes it can resolve this issue sometimes, but not because you jump over GFW.
It’s just because you show up in another geo location and connect to a different gateway.

Here’s the trace from curl of redirections in China:
As you can see:

We call developer.download.nvidia.com
301 to .cn
301 to .com again
301 to .cn again
200 from .cn
Seems all these are handled by krill.zenlogic.net.
With VPN nvidia gateway simply gives us 200 from .com, that’s what solves the issue, not “jumping over GFW”.
And interestingly, we usually get error when talking to .cn server (200 but timeout with partial response, 200 with error msg in body, “Failed to ssl_handshake: closed”, …)

$ curl -Lv https://developer.download.nvidia.com/compute/cuda/repos

Trying 129.227.66.139…
TCP_NODELAY set
Connected to developer.download.nvidia.com (129.227.66.139) port 443 (#0)
ALPN, offering h2
ALPN, offering http/1.1
successfully set certificate verify locations:
CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
TLSv1.3 (OUT), TLS handshake, Client hello (1):
TLSv1.3 (IN), TLS handshake, Server hello (2):
TLSv1.3 (OUT), TLS change cipher, Client hello (1):
TLSv1.3 (OUT), TLS handshake, Client hello (1):
TLSv1.3 (IN), TLS handshake, Server hello (2):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Unknown (8):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Certificate (11):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, CERT verify (15):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Finished (20):
TLSv1.3 (OUT), TLS Unknown, Certificate Status (22):
TLSv1.3 (OUT), TLS handshake, Finished (20):
SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
ALPN, server accepted to use h2
Server certificate:
start date: Jul 24 00:00:00 2020 GMT
expire date: Jul 25 12:00:00 2021 GMT
subjectAltName: host “developer.download.nvidia.com” matched cert’s “developer.download.nvidia.com”
issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=RapidSSL TLS RSA CA G1
SSL certificate verify ok.
Using HTTP2, server supports multi-use
Connection state changed (HTTP/2 confirmed)
Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
Using Stream ID: 1 (easy handle 0x7fffbc98f580)
TLSv1.3 (OUT), TLS Unknown, Unknown (23):

GET /compute/cuda/repos HTTP/2
Host: developer.download.nvidia.com
User-Agent: curl/7.58.0
Accept: /

TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
TLSv1.3 (IN), TLS Unknown, Unknown (23):
Connection state changed (MAX_CONCURRENT_STREAMS updated)!
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
TLSv1.3 (IN), TLS Unknown, Unknown (23):
< HTTP/2 301
< server: nginx/1.17.9
< date: Wed, 26 Aug 2020 06:19:09 GMT
< content-type: text/html
< content-length: 169
< location: https://developer.download.nvidia.cn/compute/cuda/repos
< strict-transport-security: max-age=31536000; preload
< x-orca-edge-logic-revision: NV-20200628-1
< x-edge-location: sin
< x-orca-accelerator: from s08.mul.sin01.sg.krill.zenlogic.net
< x-cache: from s08.mul.sin01.sg.krill.zenlogic.net
<
Ignoring the response-body
Connection #0 to host developer.download.nvidia.com left intact
Issue another request to this URL: ‘https://developer.download.nvidia.cn/compute/cuda/repos’
Trying 61.155.167.2…
TCP_NODELAY set
Connected to developer.download.nvidia.cn (61.155.167.2) port 443 (#1)
ALPN, offering h2
ALPN, offering http/1.1
successfully set certificate verify locations:
CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
TLSv1.3 (OUT), TLS handshake, Client hello (1):
TLSv1.3 (IN), TLS handshake, Server hello (2):
TLSv1.3 (OUT), TLS change cipher, Client hello (1):
TLSv1.3 (OUT), TLS handshake, Client hello (1):
TLSv1.3 (IN), TLS handshake, Server hello (2):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Unknown (8):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Certificate (11):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, CERT verify (15):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Finished (20):
TLSv1.3 (OUT), TLS Unknown, Certificate Status (22):
TLSv1.3 (OUT), TLS handshake, Finished (20):
SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
ALPN, server accepted to use h2
Server certificate:
start date: Dec 13 00:00:00 2018 GMT
expire date: Dec 13 12:00:00 2020 GMT
subjectAltName: host “developer.download.nvidia.cn” matched cert’s “*.download.nvidia.cn”
issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=RapidSSL TLS RSA CA G1
SSL certificate verify ok.
Using HTTP2, server supports multi-use
Connection state changed (HTTP/2 confirmed)
Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
Using Stream ID: 1 (easy handle 0x7fffbc98f580)
TLSv1.3 (OUT), TLS Unknown, Unknown (23):

GET /compute/cuda/repos HTTP/2
Host: developer.download.nvidia.cn
User-Agent: curl/7.58.0
Accept: /

TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
TLSv1.3 (IN), TLS Unknown, Unknown (23):
Connection state changed (MAX_CONCURRENT_STREAMS updated)!
TLSv1.3 (OUT), TLS Unknown, Unknown (23):
TLSv1.3 (IN), TLS Unknown, Unknown (23):
TLSv1.3 (IN), TLS Unknown, Unknown (23):
< HTTP/2 301
< server: nginx/1.17.9
< date: Wed, 26 Aug 2020 06:19:16 GMT
< content-type: text/html
< content-length: 169
< location: https://developer.download.nvidia.com/compute/cuda/repos/
< cache-control: max-age=0
< expries: -1
< strict-transport-security: max-age=31536000; preload
< x-orca-cache-key-suffix: 301
< x-orca-edge-logic-revision: NV-20200628-1
< x-edge-location: szt
< x-orca-accelerator: from k02.chn.szt01.cn.krill.zenlogic.net
< x-cache: from k02.chn.szt01.cn.krill.zenlogic.net
<
Ignoring the response-body
Connection #1 to host developer.download.nvidia.cn left intact
Issue another request to this URL: ‘https://developer.download.nvidia.com/compute/cuda/repos/’
Found bundle for host developer.download.nvidia.com: 0x7fffbc98d360 [can multiplex]
Re-using existing connection! (#0) with host developer.download.nvidia.com
Connected to developer.download.nvidia.com (129.227.66.139) port 443 (#0)
Using Stream ID: 3 (easy handle 0x7fffbc98f580)
TLSv1.3 (OUT), TLS Unknown, Unknown (23):

GET /compute/cuda/repos/ HTTP/2
Host: developer.download.nvidia.com
User-Agent: curl/7.58.0
Accept: /

TLSv1.3 (IN), TLS Unknown, Unknown (23):
< HTTP/2 301
< server: nginx/1.17.9
< date: Wed, 26 Aug 2020 06:19:16 GMT
< content-type: text/html
< content-length: 169
< location: https://developer.download.nvidia.cn/compute/cuda/repos/
< strict-transport-security: max-age=31536000; preload
< x-orca-edge-logic-revision: NV-20200628-1
< x-edge-location: sin
< x-orca-accelerator: from s08.mul.sin01.sg.krill.zenlogic.net
< x-cache: from s08.mul.sin01.sg.krill.zenlogic.net
<
Ignoring the response-body
Connection #0 to host developer.download.nvidia.com left intact
Issue another request to this URL: ‘https://developer.download.nvidia.cn/compute/cuda/repos/’
Found bundle for host developer.download.nvidia.cn: 0x7fffbca777f0 [can multiplex]
Re-using existing connection! (#1) with host developer.download.nvidia.cn
Connected to developer.download.nvidia.cn (61.155.167.2) port 443 (#1)
Using Stream ID: 3 (easy handle 0x7fffbc98f580)
TLSv1.3 (OUT), TLS Unknown, Unknown (23):

GET /compute/cuda/repos/ HTTP/2
Host: developer.download.nvidia.cn
User-Agent: curl/7.58.0
Accept: /

TLSv1.3 (IN), TLS Unknown, Unknown (23):
< HTTP/2 200
< date: Wed, 26 Aug 2020 06:19:24 GMT
< content-type: text/html
< content-length: 3904
< age: 481073
< cache-control: max-age=604800
< etag: “3077532576+ident”
< expires: Wed, 02 Sep 2020 06:19:23 GMT
< last-modified: Thu, 06 Aug 2020 16:05:53 GMT
< server: ECAcc (tka/899E)
< vary: Accept-Encoding
< strict-transport-security: max-age=31536000; preload
< x-orca-cache-key-suffix: 200
< x-orca-edge-logic-revision: NV-20200628-1
< x-edge-location: hnd
< x-orca-accelerator: MISS from k01.mul.hnd01.jp.krill.zenlogic.net
< x-edge-location: szt
< x-orca-accelerator: MISS from k02.chn.szt01.cn.krill.zenlogic.net
< x-cache: MISS from k02.chn.szt01.cn.krill.zenlogic.net
<

<!doctype html>

Index of /compute/cuda/repos/

..
GPGKEY 4.0KB 2014-05-09 01:12
fedora18/
fedora19/
fedora20/
fedora21/
fedora23/
fedora25/
fedora27/
fedora29/
opensuse15/
opensuse122/
opensuse123/
opensuse131/
opensuse132/
opensuse422/
opensuse423/
rhel6/
rhel7/
rhel8/
sles11/
sles12/
sles15/
sles113/
sles114/
sles122/
sles123/
sles124/
ubuntu1204/
ubuntu1210/
ubuntu1304/
ubuntu1404/
ubuntu1410/
ubuntu1504/
ubuntu1604/
ubuntu1704/
ubuntu1710/
ubuntu1804/
ubuntu1810/
ubuntu2004/

* Connection #1 to host [developer.download.nvidia.cn](http://developer.download.nvidia.cn) left intact

xkszltl · September 20, 2020, 12:26pm

Today we hit a new round of Failed to ssl_handshake: timeout (not always, but > 90%)
This is what safari got for https://developer.download.nvidia.cn/compute/machine-learning/repos (redirected from https://developer.download.nvidia.com/compute/machine-learning/repos)

Occasionally I can get this when lucky:

xkszltl · September 26, 2020, 6:10pm

And there’re more corrupted cases:
For example I can randomly get different “valid” response.

This example shows result returned by different ISP (China Telcom/China Unicom, 2 largest ISP in China) can be different, China Unicom result does not have Fedora 32 listed (response was not truncated, it simply don’t have that entry in list).

This example shows sometimes the result can be good from China Unicom, but randomly.

Result returned from CERNET (another major ISP mainly for university in China) seems to be good.

xkszltl · September 29, 2020, 12:27am

And some pkgs also keep come and go.
For example, only a small potion of response has cuda-runtime-11-1, majority of them only have up to 11-0 as below.
This is purely random, every curl has a different answer.

Topic		Replies	Views
Cuda previous versions are not accessible in China main land. General	0	597	May 18, 2018
I can not download the cuda or cudnn from official website download links! CUDA Setup and Installation	2	868	April 8, 2019
I cannot downlowd cudnn 9 ，it redirect to “NVIDIA Developer Site is under going maintenance.” CUDA Setup and Installation	6	1708	November 3, 2017
Download cudnn, but the website can't be loaded at last. cuDNN	6	2045	October 12, 2021
I meet the problem that NVIDIA Developer Site is under going maintenance CUDA Setup and Installation	1	710	October 30, 2017
Suggestion: Please add repository or mirror for installation of JetPack stuffs in Mainland China Jetson TX2	3	756	June 15, 2018
cudnn library download so slowly cuDNN	0	538	June 12, 2019
I can't download specific version of CUDNN cuDNN cuda	4	1335	April 4, 2022
Can't download CUDA toolkit for windows？！ CUDA Setup and Installation	7	4600	May 31, 2019
Unable to download package from repo.download.nvidia.com Jetson TX2 linux , yocto	7	975	February 5, 2024

Extremely bad availability of China CDN

Index of /compute/cuda/repos/

Related topics