Scaling Keras Model Training to Multiple GPUs

Originally published at:

Keras is a powerful deep learning meta-framework which sits on top of existing frameworks such as TensorFlow and Theano. Keras is highly productive for developers; it often requires 50% less code to define a model than native APIs of deep learning frameworks require (here’s an example of LeNet-5 trained on MNIST data in Keras (71 lines) and TensorFlow…

The following code works for me. Simply drop the code in and it's good to go. As far as I can tell it only scales on a single box, not a distributed cluster.

@nor_he:disqus Yes, these approaches do work. In fact, I did mention this in the blog post. Recall the following sentence:

"Keras’s official blog also demonstrates that by breaking the backend-independent abstraction and exposing TensorFlow’s multi-GPU primitives, it’s possible to get Keras to scale. "

There was also a hyperlink associated with the above:

The main problem I see here is that this requires the user to mix the interface (Keras) with the implementation ("with tf.device()" wrappers), which effectively makes the code non-portable to other backends.

The simplest analogy that I can see here is the SLF4J logging wrapper (Simple Logging Facade for Java). The user only need to concern themselves with writing API calls using the facade, and can keep swapping backends (log4j, java.util.logging, logback) at will. Also thinking in terms of object-oriented programming, having an interface shouldn't require the user to look into the implementation, which "with tf.device()" requries us to do.

Granted, the GPU list provided to Keras's model.compile() only works in the MxNet Keras fork now, but I would argue that this is the right approach. The change has been made at the interface level, which will hopefully soon become absorbed into mainstream Keras, and it's the Keras backends' job to detemine how to make multi-GPU data parallelism happen. That way one can have one abstraction that's stable, and can swap out the backends while maintaining a portable multi-GPU solution.

The second issue I mentioned in my post is that TensorFlow performance varies widely, e.g. the difference between this ResNet-50 tutorial implementation ( and this performance-optimized implementation ( is about 2x on a DGX-1. Since TensorFlow has many ways of doing the same thing, it can't always be expected that a highly abstract implementation will be fast, even if the backend can support that. MxNet's approach is usually to have one "mainstream" way of doing a particular thing, so it is easier to map the Keras frontend to the MxNet backend while keeping performance reasonably close to that of the native API.

I encourage you to try both backends and see what performance differences you get.

What I didn't really get:

Is the performance issue of Keras for single GPUs mostly based on the need for efficient data pipelines (could also just use Dataflow from Tensorpack then) or is it primarily based on how Keras builds a network internally?

Framework-native Keras input is currently also in the works:

It's actually both a pipelines and a backend issue.

The pipeline is certainly quite important, but it probably matters more for multiple GPUs. The single-GPU performance may still be affected by the pipeline, particularly in case of large GPUs (e.g. Tesla GP100, as opposed to say a GeForce GTX 1070) - one needs to provide the data quickly for the GPU not to stall waiting on data.

Regarding the backend, Keras constructs the TF graph very differently than a highly optimized TF graph. It seems to me that the Keras backend for TF uses high-level ops that are not particularly efficient, and to get full perf in TF, one has to use low-level ops such as the "staging area" node, which eessentially does explicit data prefetch (which is implicit in other frameworks, rather than part of the graph definition). See here, for example ( I provided links to the "unoptimized" and "optimized" TF ResNet-50 implementations above, that should be helpful. The interested reader can run these examples and note the significant perf difference, even using TF's native API. Since Keras is not tuned to any particular network but is rather a general abstraction, perhaps using all of the optimized ops is non-trivial (unless the graph is optimized by TF after construction, during optimization passes). That said, some aspects such as handling data prefetch are probably common to all models.

The backend issue involves many details, such as the fact that the original Keras was based on Theano, which had the NCHW image layout, while TF's native layout is NHWC. cuDNN convolution kernels are NCHW (up to Volta), and if the layout conversion requires many transposes, that will come at a significant cost (transposes are very quick, they are bandwidth-bound, but there is a kernel launch overhead to perform them, unless the transpose is fused with some other kernel that will use the transposed data).

IMHO the TF backend for Keras can definitely be improved to be competitive. I heard that this is already in the works, beyond the framework-native input pipeline component.

Unfortunately, Keras is quite slow in terms of single-GPU training and inference time (regardless of the backend). It is also hard to get it to work on multiple GPUs without breaking its framework-independent abstraction.