After upgrading to version 384.66 of the NVIDIA Linux kernel driver (the kernel is version 3.10.0-693.2.2.el7.x86_64 on RHEL 7.4), we observe very size-dependent allocation latency when using cudaMalloc(). For example, we use the program at http://lpaste.net/2039592059479785472 to measure how long it takes to allocate memory blocks of various sizes. On a machine with a GTX 780Ti and driver version 384.66, the last three lines of output are the following:
134217728 bytes; average: 8144us; min: 4120us; max: 8208us
268435456 bytes; average: 16248us; min: 8201us; max: 16346us
536870912 bytes; average: 32461us; min: 16390us; max: 32738us
On an otherwise identical machine running driver version 375.66, we obtain:
134217728 bytes; average: 254us; min: 248us; max: 301us
268435456 bytes; average: 372us; min: 356us; max: 426us
536870912 bytes; average: 600us; min: 587us; max: 645us
While there is still mild sensitivity to the allocation size (as is understandable), allocations run much faster overall. We observe the same slowdown on other (non-contrived) programs, to the degree that much of our CUDA code is now bottlenecked by the time it takes to allocate memory.