Managed Google cloud A2 issues after installing CUDA driver

I am leading the first Google A2 deployment of an ML application at a financial giant. I am facing issues with the managed Google VM A2 instances (these are modified for enterprise login and VPN private cloud)

I’m wondering if anyone has had any success working with Google cloud A2 VMs in this type of environment - please let me know.

  • what VM image did you use
  • were you successful installing the CUDA drivers and the containers and how

The client IT team has diagnosed this to be a GCP issue, and the client GCP team has been informed but they do not seem to have add much success resolving this issue.

The issue - when I install the CUDA drivers, the VM is unavailable after a reboot and is not ping’able. Here is the startup trace after the A2 GPU is coming back up:

I’m using the recommended Ubuntu 16.x VM image that Google cloud has provided.

=========================================
[ 6.969070] cloud-init[1270]: Can not apply stage config, no datasource found! Likely bad things to come!
[[0;32m OK [0m] Stopped System Logger Daemon.
[ 6.996371] cloud-init[1270]: ------------------------------------------------------------
[[0;1;31mFAILED[0m] Failed to listen on Syslog Socket.
See ‘systemctl status syslog.socket’ for details.[ 7.028347]
cloud-init[1270]: Traceback (most recent call last):
Starting System Logger Daemon…
[ 7.064373] cloud-init[1270]: File “/usr/lib/python3/dist-packages/cloudinit/cmd/main.py”, line 485, in main_modules
[[0;32m OK [0m] Started System Security Services Daemon.
[ 7.096244] cloud-init[1270]: init.fetch(existing=“trust”)
[[0;32m OK [0m] Started Message of the Day.
[ 7.132251] cloud-init[1270]: File “/usr/lib/python3/dist-packages/cloudinit/stages.py”, line 350, in fetch
[[0;1;31mFAILED[0m] Failed to start Apply the settings specified in cloud-config.
See ‘systemctl status cloud-config.service’ for details.[ 7.168430]
cloud-init[1270]: return self._get_data_source(existing=existing)
[[0;1;31mFAILED[0m] Failed to start Google Compute Engine Instance Setup.
See ‘systemctl status google-instance-setup.service’ for details.[ 7.204237]
cloud-init[1270]: File “/usr/lib/python3/dist-packages/cloudinit/stages.py”, line 260, in _get_data_source
[[0;1;31mFAILED[0m] Failed to start System Logger Daemon.
[ 7.244206] cloud-init[1270]: pkg_list, self.reporter)
See ‘systemctl status syslog-ng.service’ for details.
[ 7.264137] cloud-init[1270]: File “/usr/lib/python3/dist-packages/cloudinit/sources/init.py”, line 780, in find_source
Starting OpenBSD Secure Shell server…
[ 7.288216] cloud-init[1270]: raise DataSourceNotFoundException(msg)
[[0;32m OK [0m] Started Google Compute Engine Network Daemon.
[ 7.320242] cloud-init[1270]: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: ()
[[0;32m OK [0m] Started Google Compute Engine Accounts Daemon.
[ 7.356294] cloud-init[1270]: ------------------------------------------------------------
Starting Google Compute Engine Shutdown Scripts…
[ 7.390305] google_instance_setup[1385]: Traceback (most recent call last):
[[0;32m OK [0m] Started Google Compute Engine Clock Skew Daemon.
[ 7.420151] google_instance_setup[1385]: File “/usr/lib/python3.5/logging/handlers.py”, line 823, in _connect_unixsocket
[[0;32m OK [0m] Reached target User and Group Name Lookups.
[ 7.448220] google_instance_setup[1385]: self.socket.connect(address)
Starting Accounts Service…
[ 7.468212] google_instance_setup[1385]: ConnectionRefusedError: [Errno 111] Connection refused
Starting Login Service…
[ 7.488159] google_instance_setup[1385]: During handling of the above exception, another exception occurred:
[[0;32m OK [0m] Started Google Compute Engine Shutdown Scripts.[ 7.512166]
google_instance_setup[1385]: Traceback (most recent call last):
[[0;32m OK [0m] Started OpenBSD Secure Shell server.
[ 7.536251] google_instance_setup[1385]: File “/usr/bin/google_instance_setup”, line 9, in
[[0;32m OK [0m] Started Login to default iSCSI targets.
[ 7.572173] google_instance_setup[1385]: load_entry_point(‘google-compute-engine==20190801.0’, ‘console_scripts’, ‘google_instance_setup’)()
[[0;32m OK [0m] Started Login Service.
[ 7.612164] google_instance_setup[1385]: File “/usr/lib/python3/dist-packages/google_compute_engine/instance_setup/instance_setup.py”, line 254, in main
[[0;32m OK [0m] Started Unattended Upgrades Shutdown.
[ 7.648316] google_instance_setup[1385]: InstanceSetup(debug=bool(options.debug))
Starting Authenticate and Authorize Users to Run Privileged Tasks…
[ 7.680345] google_instance_setup[1385]: File “/usr/lib/python3/dist-packages/google_compute_engine/instance_setup/instance_setup.py”, line 58, in init
[[0;32m OK [0m] Stopped System Logger Daemon.
[ 7.720542] google_instance_setup[1385]: name=‘instance-setup’, debug=self.debug, facility=facility)
[[0;1;31mFAILED[0m] Failed to listen on Syslog Socket.
See ‘systemctl status syslog.socket’ for details.[ 7.756314]
google_instance_setup[1385]: File “/usr/lib/python3/dist-packages/google_compute_engine/logger.py”, line 50, in Logger
Starting System Logger Daemon…
[[ 7.796313] [0;32m OK [0mgoogle_instance_setup] [1385]: Reached target Remote File Systems (Pre). address=constants.SYSLOG_SOCKET, facility=facility)

[[0;32m OK [0m] Reached target Remote File Systems.
[ 7.832275] google_instance_setup[1385]: File “/usr/lib/python3.5/logging/handlers.py”, line 806, in init
Starting LSB: Postfix Mail Transport Agent…
[ 7.868165] google_instance_setup[1385]: self._connect_unixsocket(address)
Starting LSB: automatic crash report generation…
[ 7.900157] google_instance_setup[1385]: File “/usr/lib/python3.5/logging/handlers.py”, line 834, in _connect_unixsocket
Starting LSB: Load kernel image with kexec…
[ 7.936143] google_instance_setup[1385]: self.socket.connect(address)
Starting LSB: Mount debugfs on /sys/kernel/debug…
[ 7.956245] google_instance_setup[1385]: ConnectionRefusedError: [Errno 111] Connection refused
Starting LSB: Start NTP daemon…
[ 7.993418] google_network_daemon[1659]: Traceback (most recent call last):
Starting Permit User Sessions…
[ 8.020224] google_network_daemon[1659]: File “/usr/lib/python3.5/logging/handlers.py”, line 823, in _connect_unixsocket
Starting LSB: Set the CPU Frequency Scaling governor to “ondemand”…
[ 8.052237] google_network_daemon[1659]: self.socket.connect(address)
Starting LSB: Start/stop sysstat’s sadc…
[ 8.084142] google_network_daemon[1659]: ConnectionRefusedError: [Errno 111] Connection refused
[[0;1;31mFAILED[0m] Failed to start System Logger Daemon.

If anyone from Google cloud monitors this group, please ping me. I’d appreciate some help to resolve this issue - it’s gating our development progress