Storage Performance Basics for Deep Learning

Originally published at:

Introduction When production systems are not delivering expected levels of performance, it can be a challenging and time-consuming task to root-cause the issue(s). Especially in today’s complex environments, where the workload is comprised of many software components, libraries, etc, and rely on virtually all of the underlying hardware subsystems (CPU, memory, disk IO, network IO)…

Great write up ! I enjoyed reading that...

Thanks for the article, it was a nice read.

For CUDA developers that need very low latency disk access and do not require a file system, I have made a library for creating CUDA storage applications:

I've also made a synthetic benchmark for it, comparing it to among other things memory mapping a file. It's still very much a work in progress, so don't expect too much from it, but it shows some interesting concepts like directly accessing a disk using GPUDirect RDMA/Async.

Thanks very much Tim. Much more to come!

Thanks very much. Having a look at your code this afternoon - very interesting!

On a related note, applying some basic sanity checking on several white-box storage nodes we have in our lab is time well spent. These nodes each have 6 NVMe SSD's, and on one of the storage nodes, one of the NVMe devices gets less than half the random 4k read IOPS as the other five NVMe SSD's. I have not yet root-caused this, but it's one of those things that would potentially cause a lot of hair-pulling once in production. The NVMe SSD's are getting near 500k random 4k reads, but the 'bad' NVMe SSD sustains less than 200k. Huge difference, and something that would have dragged a RAID group down for sure.

Thanks James. Really help us in our Testing of NVMe drives.