Amazon Ubuntu 16.04 P3 instances only run kernel once then crash server

Hi,

I installed cuda on a fresh Ubuntu 16.04

Now, i can run a test kernel once, eg deviceQuery. but any subsequent execution will crash the server. So badly that AWS takes many minutes to reboot the server.

I have witnessed this repeated times on repeated fresh installations.

I am at a loss. Any one else able to use P3 servers at the moment? Windows seems to not work at all. Maybe P3 is just broken at the moment.

ubuntu@ip-172-31-6-241:~/NVIDIA_CUDA-9.1_Samples/bin/x86_64/linux/release$ sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla V100-SXM2-16GB"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 16152 MBytes (16936861696 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1530 MHz (1.53 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1
Result = PASS

repeat it:

[ubuntu@ip-172-31-6-241:~/NVIDIA_CUDA-9.1_Samples/bin/x86_64/linux/release$ sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Server crashes!

Many people use P3 instances successfully. It is a core system for the NVIDIA NGC product. If you have to run deviceQuery with sudo, there is something broken about your install.

I would suggest trying the NVIDIA Volta AMI (for linux) on a P3 instance:

http://docs.nvidia.com/ngc/ngc-aws-setup-guide/launching-vm-instance-from-console.html#selecting-nv-volta-deeplearning-ami

i should have deleted the sudo it probably threw you off.

no need to be defensive though. its AWS - they have always had bad gpu compute support over the last 6 years or so ive been using them. their windows servers dont work with p3 either.

thankyou for the Volta AMI hopefully they ironed out all the quirks in that. ill give it a go.

its frustrating for a professional because i just want to turn on, follow the installation guides, pay my money and do the work! impossible so far.

thanks for pointing me in the right direction

it requires NGC registration whatever the hell that is.

just let me run cuda on linux! no fuss haha

it supports spot instances

I used it just last week on a spot instance.

Use of the AMI itself should not require NGC registration. Yes, using NGC requires NGC registration. But it should not be required to use NGC just to select a p3 instance type and spin up that AMI on it.

And I’m pretty certain you can just run CUDA on linux. I pointed out the AMI because setting up CUDA on linux can have various pitfalls, and the AMI avoids some of them.

edit: ill have another look thanks for the clarification.

OK sorry I had forgotten about the start-up request for NGC key. NGC is free, though. You might want to try it.

Otherwise, I refer you to the CUDA linux install guide:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

nope that AMI does not have CUDA 9.1 toolkit installed or current NVIDIA driver or samples.

i tried installing them the standard way, and same issues as every other ubuntu 16.04 AMI ive tried.

i gave up trying to set up cuda on P3. if its working for anyone please let me know how you installed cuda 9.1.

same issue on different dist,

https://devtalk.nvidia.com/default/topic/1028018/cuda-setup-and-installation/amazon-p3-doesnt-work-on-linux-17-04/

The issue now appears to be understood and affects all currently available r387 drivers.

It should be fixed in a future r387 driver that is 387.41 or later.

As an interim workaround, use CUDA 9.0/r384 drivers on AWS P3 instances.

There is a 390.12 beta driver which has been posted since January 4th that is believed to fix this issue.

Future 390.xx drivers should also be posted. It appears now that there may not be any further 387.xx drivers posted, but any 390.xx driver should fix this issue.