Slow startup and model loading time

julian.ngvc · March 26, 2021, 4:26am

System information

OS Platform and Distribution: Linux Ubuntu 18.04, 5.4.0-66-generic
Device: MPG B550I GAMING EDGE MAX WIFI (MS-7C92)
- CPU: AMD Ryzen 9 5950X 16-Core Processor
- GPU: GeForce RTX 3090, 24265MiB
NVIDIA Driver: 460.39

Current behavior
I was experiencing very slow speed when loading tensorflow models into the GPU memory (about 1MiB/s when watching with nvidia-smi) and tried investigating this separately by using tensorflow/tensorflow:2.1.0-gpu-py3 docker image to run this command:
time python -c "import tensorflow as tf; tf.test.is_gpu_available()"

and it prints the following logs:

2021-03-25 07:39:11.387959: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2021-03-25 07:39:11.388827: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6 WARNING:tensorflow:From <string>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use ` tf.config.list_physical_devices('GPU') ` instead.
2021-03-25 07:39:11.747442: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-25 07:39:11.771337: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3399945000 Hz
2021-03-25 07:39:11.772645: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47819f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-25 07:39:11.772667: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-25 07:39:11.774848: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-25 07:39:11.863115: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 07:39:11.863658: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4783c30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-25 07:39:11.863670: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3090, Compute Capability 8.6
2021-03-25 07:39:11.863767: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 07:39:11.864237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: pciBusID: 0000:2b:00.0 name: GeForce RTX 3090 computeCapability: 8.6 coreClock: 1.785GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-03-25 07:39:11.864264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-25 07:39:11.864281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-25 07:39:11.865203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-25 07:39:11.865342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-25 07:39:11.866162: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-25 07:39:11.866588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-25 07:39:11.866606: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-25 07:39:11.866644: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 07:39:11.867128: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 07:39:11.867588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-25 07:39:11.867606: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1

after which it doesn’t print anything and after about 2.5 minutes prints this:

2021-03-25 07:41:58.768624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-25 07:41:58.768648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2021-03-25 07:41:58.768653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2021-03-25 07:41:58.768811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 07:41:58.769347: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-25 07:41:58.769954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/device:GPU:0 with 22243 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:2b:00.0, compute capability: 8.6)

real 2m48.205s
user 2m47.844s
sys 0m2.368s

Running the same command again within the docker container finishes execution in 2 seconds, but using a new container results in the slow loading time again.

Expected behavior
There should not be such a long startup and model loading time. I am not entirely sure if this issue lies with tensorflow, the driver or the hardware itself. I would greatly appreciate any help I can get.

pigubaozaiebcu · April 8, 2021, 5:29am

I am facing a similar problem as well, with a Zen 3 CPU and a GTX3090.
We do have production software working on Tensorflow 1.15 where we also experience this issue.
Could this be an issue with the Linux Kernel or Nvidia Driver? Anything we can do to change these for compatibility?

tarnakin · April 11, 2021, 6:29pm

Are you using tf.dataset API for data processing?

julian.ngvc · April 13, 2021, 6:58am

No, but I don’t think that should matter because this stalling happens during the step where the model is being loaded into the GPU, even before any inferencing or data processing.

julian.ngvc · April 13, 2021, 7:02am

As an update to this post, I managed to partially solve the above issue for tensorflow 2.x models by using tensorflow 2.4. However, I still face the same issue for tensorflow 1.x models and have no easy way to update all of them to use the newer version.

tarnakin · April 13, 2021, 9:15am

I struggled with similar problems and eventually discovered a couple of hacks. They are somehow unrelated but may help you in the future.

Try using Horovod even if you have 1 GPU. There is something buggy in Tensorflow model loading mechanism that leads to linear increase in loading times with linear increase in number of GPUs or size of the model (the latter is understandable, but Horovod somehow is almost free of the former delay).
There is almost undiscussed and poorly documented feature in tf.Dataset API, the buffer_size parameter in tf.data.TFRecordDataset, like in

dataset_train = tf.data.TFRecordDataset(files_train, buffer_size=10000, num_parallel_reads=tf.data.experimental.AUTOTUNE)

Increasing this value helped me decrease epoch time by an order of magnitude(!) on a server with 16 GPUs.