Extremely bad availability of China CDN

When downloading packages from https://developer.download.nvidia.com/compute/cuda and https://developer.download.nvidia.com/compute/machine-learning in China (which is redirected to developer.download.nvidia.cn), >90% of my connection either failed to establish, or get “Failed to connect to origin, please retry” as response.
This means:

  1. It is extremely hard to download the package, failure rate is way too high and basically forbids any automation.
  2. System with yum-axelget does not work because when some connections returned the correct binary while others returned text “Failed to connect to origin, please retry” and essentially invalidate the entire download, causing an amplification of traffic in retry by >100x

This issue has been there for years without a fix, bringing so much pain to us every time we update nvidia packages.
We even end up creating local mirror of repo manually just to make automation (e.g. deployment scripts/docker builds/…) possible.

2 Likes

This reply is speculative but based on observation. Over the past several (5+, if memory serves) years various users from within the PRC have reported issues accessing various portions of NVIDIA’s web site, including downloads of machine-learning related materials. Often these issues would manifest as “strange” error messages.

In all cases that I remember these issues magically disappeared when those users switched to access via VPN (via Hongkong?). This suggests that these access issues were related to the Great Firewall of China. My understanding is that (1) use of VPN in the PRC is highly regulated; (2) the Great Firewall has been enhanced over the years to be largely impervious to attempts to bypass it with VPN.

I note that at this time there may also be a potentially generic issue with international access to NVIDIA’s site that seems tentatively associated with the CDN, as discussed in this thread: Online Documentation for old CUDA Versions

I doubt GFW is the cause.

For GFW related issue (i.e. GFW standing between user and website gateway), it will show up as broken SSL connections, not strange messages.

First I’m not seeing any man-in-the-middle attack because cert is valid.

Then the “failed to connect to origin, please retry” we got is not an error code, it is plain text in HTTP response body.

It’s probably saying CDN gateway has trouble accessing upstream server.

If you use VPN, yes it can resolve this issue sometimes, but not because you jump over GFW.
It’s just because you show up in another geo location and connect to a different gateway.

Here’s the trace from curl of redirections in China:
As you can see:

  1. We call developer.download.nvidia.com
  2. 301 to .cn
  3. 301 to .com again
  4. 301 to .cn again
  5. 200 from .cn
    Seems all these are handled by krill.zenlogic.net.
    With VPN nvidia gateway simply gives us 200 from .com, that’s what solves the issue, not “jumping over GFW”.
    And interestingly, we usually get error when talking to .cn server (200 but timeout with partial response, 200 with error msg in body, “Failed to ssl_handshake: closed”, …)

$ curl -Lv https://developer.download.nvidia.com/compute/cuda/repos

  • Trying 129.227.66.139…
  • TCP_NODELAY set
  • Connected to developer.download.nvidia.com (129.227.66.139) port 443 (#0)
  • ALPN, offering h2
  • ALPN, offering http/1.1
  • successfully set certificate verify locations:
  • CAfile: /etc/ssl/certs/ca-certificates.crt
    CApath: /etc/ssl/certs
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
  • TLSv1.3 (OUT), TLS change cipher, Client hello (1):
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Unknown (8):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Certificate (11):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, CERT verify (15):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Finished (20):
  • TLSv1.3 (OUT), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (OUT), TLS handshake, Finished (20):
  • SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
  • ALPN, server accepted to use h2
  • Server certificate:
  • start date: Jul 24 00:00:00 2020 GMT
  • expire date: Jul 25 12:00:00 2021 GMT
  • subjectAltName: host “developer.download.nvidia.com” matched cert’s “developer.download.nvidia.com
  • issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=RapidSSL TLS RSA CA G1
  • SSL certificate verify ok.
  • Using HTTP2, server supports multi-use
  • Connection state changed (HTTP/2 confirmed)
  • Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):
  • Using Stream ID: 1 (easy handle 0x7fffbc98f580)
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):

GET /compute/cuda/repos HTTP/2
Host: developer.download.nvidia.com
User-Agent: curl/7.58.0
Accept: /

  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
  • TLSv1.3 (IN), TLS Unknown, Unknown (23):
  • Connection state changed (MAX_CONCURRENT_STREAMS updated)!
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):
  • TLSv1.3 (IN), TLS Unknown, Unknown (23):
    < HTTP/2 301
    < server: nginx/1.17.9
    < date: Wed, 26 Aug 2020 06:19:09 GMT
    < content-type: text/html
    < content-length: 169
    < location: https://developer.download.nvidia.cn/compute/cuda/repos
    < strict-transport-security: max-age=31536000; preload
    < x-orca-edge-logic-revision: NV-20200628-1
    < x-edge-location: sin
    < x-orca-accelerator: from s08.mul.sin01.sg.krill.zenlogic.net
    < x-cache: from s08.mul.sin01.sg.krill.zenlogic.net
    <
  • Ignoring the response-body
  • Connection #0 to host developer.download.nvidia.com left intact
  • Issue another request to this URL: ‘https://developer.download.nvidia.cn/compute/cuda/repos
  • Trying 61.155.167.2…
  • TCP_NODELAY set
  • Connected to developer.download.nvidia.cn (61.155.167.2) port 443 (#1)
  • ALPN, offering h2
  • ALPN, offering http/1.1
  • successfully set certificate verify locations:
  • CAfile: /etc/ssl/certs/ca-certificates.crt
    CApath: /etc/ssl/certs
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
  • TLSv1.3 (OUT), TLS change cipher, Client hello (1):
  • TLSv1.3 (OUT), TLS handshake, Client hello (1):
  • TLSv1.3 (IN), TLS handshake, Server hello (2):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Unknown (8):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Certificate (11):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, CERT verify (15):
  • TLSv1.3 (IN), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (IN), TLS handshake, Finished (20):
  • TLSv1.3 (OUT), TLS Unknown, Certificate Status (22):
  • TLSv1.3 (OUT), TLS handshake, Finished (20):
  • SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
  • ALPN, server accepted to use h2
  • Server certificate:
  • start date: Dec 13 00:00:00 2018 GMT
  • expire date: Dec 13 12:00:00 2020 GMT
  • subjectAltName: host “developer.download.nvidia.cn” matched cert’s “*.download.nvidia.cn
  • issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=RapidSSL TLS RSA CA G1
  • SSL certificate verify ok.
  • Using HTTP2, server supports multi-use
  • Connection state changed (HTTP/2 confirmed)
  • Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):
  • Using Stream ID: 1 (easy handle 0x7fffbc98f580)
  • TLSv1.3 (OUT), TLS Unknown, Unknown (23):

GET /compute/cuda/repos HTTP/2
Host: developer.download.nvidia.cn
User-Agent: curl/7.58.0
Accept: /

GET /compute/cuda/repos/ HTTP/2
Host: developer.download.nvidia.com
User-Agent: curl/7.58.0
Accept: /

GET /compute/cuda/repos/ HTTP/2
Host: developer.download.nvidia.cn
User-Agent: curl/7.58.0
Accept: /

  • TLSv1.3 (IN), TLS Unknown, Unknown (23):
    < HTTP/2 200
    < date: Wed, 26 Aug 2020 06:19:24 GMT
    < content-type: text/html
    < content-length: 3904
    < age: 481073
    < cache-control: max-age=604800
    < etag: “3077532576+ident”
    < expires: Wed, 02 Sep 2020 06:19:23 GMT
    < last-modified: Thu, 06 Aug 2020 16:05:53 GMT
    < server: ECAcc (tka/899E)
    < vary: Accept-Encoding
    < strict-transport-security: max-age=31536000; preload
    < x-orca-cache-key-suffix: 200
    < x-orca-edge-logic-revision: NV-20200628-1
    < x-edge-location: hnd
    < x-orca-accelerator: MISS from k01.mul.hnd01.jp.krill.zenlogic.net
    < x-edge-location: szt
    < x-orca-accelerator: MISS from k02.chn.szt01.cn.krill.zenlogic.net
    < x-cache: MISS from k02.chn.szt01.cn.krill.zenlogic.net
    <

<!doctype html>

Index of /compute/cuda/repos/

Index of /compute/cuda/repos/

* Connection #1 to host [developer.download.nvidia.cn](http://developer.download.nvidia.cn) left intact

Today we hit a new round of Failed to ssl_handshake: timeout (not always, but > 90%)
This is what safari got for https://developer.download.nvidia.cn/compute/machine-learning/repos (redirected from https://developer.download.nvidia.com/compute/machine-learning/repos)

Occasionally I can get this when lucky:

1 Like

And there’re more corrupted cases:
For example I can randomly get different “valid” response.

This example shows result returned by different ISP (China Telcom/China Unicom, 2 largest ISP in China) can be different, China Unicom result does not have Fedora 32 listed (response was not truncated, it simply don’t have that entry in list).

This example shows sometimes the result can be good from China Unicom, but randomly.

Result returned from CERNET (another major ISP mainly for university in China) seems to be good.

And some pkgs also keep come and go.
For example, only a small potion of response has cuda-runtime-11-1, majority of them only have up to 11-0 as below.
This is purely random, every curl has a different answer.