DIGITS settings for AWS p2.16xlarge instance

Hello,

I’m about to run a p2.16xlarge instance on AWS. I was wondering if anyone had any insight into the best settings for object detection.

I’m following the instructions here. DIGITS/examples/object-detection at master · NVIDIA/DIGITS · GitHub to build a vehicle detection model.

but I think the settings are only running a single GPU for the testing.

The p2.16xlarge includes

16 Nvidia K80 GPU’s
64 vCPU’s
732 RAM
192 GB GPU memory

as I wanted to get this trained as fast as possible. But what would be the settings to really take advantage of that.

You should be able to select the number of GPUs you want to use near the bottom of the “New Object Detection Model” web page
(DIGITS/select-gpus.jpg at master · NVIDIA/DIGITS · GitHub)

Speedups in training can vary based on a few factors, such as the number samples processed per GPU, algorithm, dataset, and memory. You will have to experiment with your dataset to find out what works best for you with respect to your setup.

Here are a two links on detection using DIGITS you might find useful if you are getting started

https://devblogs.nvidia.com/parallelforall/detectnet-deep-neural-network-object-detection-digits/

https://devblogs.nvidia.com/parallelforall/exploring-spacenet-dataset-using-digits/

If you are experimenting with different configurations on AWS you may want to consider using NVIDIA Volta Deep Learning AMI and NVIDIA GPU Cloud (NGC)
NGC: GPU-Optimized Software for DL, ML and HPC Workflows | NVIDIA. This AMI is only supported on Volta instances (p3) and comes with our driver, Docker, and nvidia-docker installed. NGC has a catalog of optimized deep learning containers, one of which is DIGITS.

allygray,

hey thanks for the references. I’ll dig deeper into them for more information.

I was able to get the 16 GPU’s to run. I had to mess with batch size and learning rate as I kept getting a balance error that wanted to make sure the batch sizes were divisible by total samples, or something along those lines. I don’t remember now.

So there are no really standards, it seems it depends on different factors. Hopefully, your resources will give me a better idea of how to optimize rather than just blindly throwing numbers at it.

again thanks for the help.

Michael