Abysmal performance with Unified Memory and CUBLAS

I thought I’d try using Unified Memory with CUBLAS, but I’m seeing extremely poor performance with it, as the code below illustrates.

The code just multiplies 4096x4096 matrices R=100 times. Using Unified Memory, this takes 56s on GTX980SC, while with explicit memory management this takes 5s. Moving copying out of the loop would reduce the time to 3s.

Am I doing something wrong here, or do I just have unreasonable expectations for UM?

To compile the code on Linux, run

nvcc -O3 -arch=sm_35 cublas_um.cpp -lcublas

or, for explicit memory management,

nvcc -O3 -arch=sm_35 -DEXPLICIT cublas_um.cpp -lcublas
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

#define N 4096
#define R 100
#define S (N*N*sizeof(float))
#define CHECK(x) do { if((x) != 0) { printf("%s %d\n", __FILE__, __LINE__); exit(1);} } while(false)

void cp(const float* src, float* dest) { 
    CHECK(cudaMemcpy(dest, src, S, cudaMemcpyDefault));
}

void alloc_host(float** p) {
#ifdef EXPLICIT
    CHECK(cudaMallocHost(p, S));
#else
    CHECK(cudaMallocManaged(p, S));
#endif
    CHECK(cudaMemset(*p, 0, S));
}

void alloc_device(float** p) {
    CHECK(cudaMalloc(p, S));
    CHECK(cudaMemset(*p, 0, S));
}

int main()
{
    cublasHandle_t handle;
    float *a, *b, *c, *ad, *bd, *cd, alpha=1, beta=0;

    CHECK(cublasCreate(&handle));

    alloc_host(&a); alloc_device(&ad);
    alloc_host(&b); alloc_device(&bd);
    alloc_host(&c); alloc_device(&cd);

    for(int i=0; i<R; ++i) {
#ifdef EXPLICIT
        cp(a, ad);
        cp(b, bd);
        CHECK(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N,
                          &alpha, ad, N, bd, N, &beta, cd, N));
        cp(cd, c);
#else
        CHECK(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N,
                          &alpha, a, N, b, N, &beta, c, N));
#endif
    }
    CHECK(cudaDeviceSynchronize());
}

Note up front: I have zero hands-on experience with unified memory or Maxwell, so the following is purely speculative. It would be great to get an authoritative reply from someone at NVIDIA.

Using the CUDA profiler, can you determine the bandwidth requirements for SGEMM when running on your GPU? It may be somewhat different based on transpose mode, so a thorough experiment should probably look at all four modes (NN, NT TN, TT).

From the specifications published by NVIDIA, a stock GTX 980 at base clocks provides 4612 GFLOPS SP and 224 GB/sec memory bandwidth:

[url]http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980/specifications[/url]

By my back-of-the-envelope calculations, running SGEMM on large matrices requires about 110 GB/sec memory bandwidth, however I am not at all certain about this as I do not know the SGEMM efficiency of the GTX 980 nor do I know the blocking factor used by the Maxwell SGEMM code. It would be highly desirable to use measured data. Obviously a GTX 980 SC will have slightly different specifications.

Assuming my 110 GB/sec estimate is accurate, that bandwidth requirement is easily satisfied by the GPU’s onboard memory and SGEMM is therefore computation bound. If we further assume that the use of unified memory means that the data instead is continuously supplied from the host’s memory via a 12 GB/sec PCIe gen3 link, a slowdown by about factor 10 seems plausible as the code would be completely memory bound.

On the other hand, by my understanding of unified memory, that scenario (continuous streaming of host data via PCIe) should not occur in the case of considerable re-use (here 100x) of the data. So I am not sure what to expect.

I don’t think you’re doing anything wrong here. I just think managed memory still has a lot of maturing to do before it’s useful for many things. It’s clear that it’s not keeping memory on the device that it needs to reuse. For that size matrix (does not fit in L2), 128x128 blocked sgemm is compute bound when reading at device memory speeds, and memory bound when reading over the pci bus.

Do you have multiple CUDA GPUs in that box? Or is the GTX980SC the only CUDA GPU in the box?

My guess is you do have multiple CUDA GPUs. If so, as an experiment, let’s suppose that deviceQuery enumerates your GTX980SC as device 0. Then do the following:

export CUDA_VISIBLE_DEVICES=“0”

and then re-run your test. (or modify the “0” above to whatever number deviceQuery enumerates your GTX980SC as)

If this makes a difference (in my case, it makes a huge difference), then you may want to read the multi-GPU section of the UM programming guide:

[url]Programming Guide :: CUDA Toolkit Documentation

“If peer mappings are not supported between the GPUs in the system, then the managed memory pages are placed in CPU system memory (“zero-copy” memory), and all GPUs will experience PCIe bandwidth restrictions.”

I have a system with multiple GPUs. When I don’t specify the CUDA_VISIBLE_DEVICES env variable, I get extremely long timings for the UM case, but ~5 sec for the explicit case.

When I specify that environment variable, both cases time to about 5-6 sec.

Thanks for the replies, everyone! Txbob is right, as usual. With the environment variable set, the time drops down to 3s.

I must say that I don’t really understand the whole “peer access” business. For me, it appears to be enabled when there are multiple cards attached to the same socket/NUMA node. However, when the same cards are attached to different NUMA nodes, they no longer have peer access. This seems strange, because the communication still happens via the relatively slow PCIe, and cudaMemcpy bandwidth is the same.

Peer access is not enabled when the GPUs are on separate Intel PCIE root complexes (effectively, for current Intel CPUs, separate CPU sockets). This is due to the fact that PCIE peer-to-peer uses a particular protocol, which is supported by the PCIE fabric (the PCIE root complex and all devices and switches attached to that root copmlex), but not supported on current Intel socket-to-socket interconnect, i.e. QPI.

Simple unidirectional bandwidth measurements appear the same, because QPI has approximately the same bandwidth as PCIE.

[Sorry for the redundancy, I forgot to F5 before replying and therefore missed txbob’s update]

GPU peer-to-peer communication is based on PCIe bus mastering. For two GPUs to communicate in this fashion requires them to reside on the same PCIe root complex. In a multi-socket system each socket has its own I/O hub and thus PCIe root complex and CPU sockets are linked by QPI. PCIe bus mastering cannot happen across QPI.

Unified memory tries to migrate data to the device where it is currently used and provides a peer-to-peer mapping to the other devices. However, if a peer-to-peer mapping is unavailable, the only place where all device can “see” the data is host memory, so that is where it puts the data in such a case.

Perfectly logical once one thinks about it and reads the documentation carefully (which I did not). NVIDIA may want to consider highlighting the paragraph quoted by txbob: it seems easy enough to miss and the performance implications are significant when one is affected by this scenario.

But why doesn’t it copy the data from the host to the GPU where it’s used? Why does it have to be GPU-to-GPU?

In any case, between the two cards where I have Peer Access, I see the same cudaMemcpy bandwidth as everywhere else, so it seems like Peer Access is not very useful, so why build something (UM) that relies on it, and without the ability to fall back on the usual PCIe protocol?

By the way, while the manual says that Peer Access works only with Teslas, I see it with GTX980SC.

For the memory space to be “unified” every entity, that is all the host’s CPUs and all attached GPUs have to be able to see it. If you have two GPUs on two different PCIe root complexes, that is not possible: There is no mapping that provides GPU A with access to data stored in the local memory of GPU B. In that case the only place where the data can be seen by all devices is the host’s system memory. If the CPUs and GPUs were all connected by the same cache-coherent link, this problem would not exist, so this is a limitation of current hardware that uses PCIe as the link.

GPU peer-to-peer communication pre-dates unified memory by several years. The benefit of peer-to-peer communication is that two devices can “talk” to each other directly across the PCIe link, instead of having to go through the host’s system memory. This improves latency and also avoids getting bottlenecked by the host memory bandwidth (currently about 25 GB/sec for a two-channel DIMM configuration and 50 GB/sec for a four-channel DIMM configuration; I read that the latest DDR4-based Haswell platforms achieve a little higher throughputs).

In a machine with a multi-GPU configuration peer-to-peer communication leads to improved scaling as the number of GPUs is increased. Dual-GPUs solutions like the Tesla K10 sport their own internal PCIe gen 3 bridge so communication between the two GPUs on the card never needs to leave the physical card.

I cannot speak to which GPUs currently support peer-to-peer communication as I have worked exclusively with non-consumer GPU, i.e. Teslas and Quadros, except for some very early CUDA work on the G80. Maybe txbob can clarify this point.

As I understand it you can peer to peer with consumer GPU’s in the same system. But if you want to use RDMA and talk directly to say an infiniband card, then you need the pro cards.

You seem to be talking about about UVAS, which I thought I already had. From the programming guide:

Anyways, as an application-level developer, I probably don’t need to know why Nviida’s Unified Memory is so inefficient on NUMA systems, even though the bandwidth is there.

@alexgg: I was attempting to answer this question of yours: “But why doesn’t it copy the data from the host to the GPU where it’s used”

Unified memory is more than a unified virtual address space. It also means that any data in that address space is directly and transparently accessible from any of the connected devices. Your system does provide you with functional unified memory, but because there are two GPUs, each on a different PCIe root complex, it has to default back to a low-performance implementation where data continues to reside on the host instead of migrating to the GPU where it is used.

The documentation quoted by txbob describes this low-performance mode very clearly: “the managed memory pages are placed in CPU system memory […] and all GPUs will experience PCIe bandwidth restrictions”.

I am not sure what you mean by “why Nviida’s Unified Memory is so inefficient on NUMA systems, even though the bandwidth is there.” I haven’t seen any evidence that the current implementation of unified memory is not making the best possible use of available bandwidth, within functional limitations imposed by the hardware. Obviously these restrictions may cause the current implementation of unified memory to be unattractive for many use cases, and I would not expect CUDA’s traditional copy-based programming paradigm to go away anytime soon. I anticipate the introduction of NVlink as the next step in moving towards more widespread use of unified memory with CUDA.

But why wouldn’t the memory manager recognize that there’s plenty of spare device memory available and treat it like a read only L3 cache while that continues to be the case? Or even reserve a section of device memory for that purpose? It just seems really silly to be sending out repeated requests for the same data over the pci bus where device memory sits idle.

Only NVIDIA can answer that question. I have zero insight into the design process, but speculate that the reason is insufficient bang for the buck, as most data needs to be read and written and a software-managed cache would add complexity. The benchmarking scenario, multiple calls of GEMM on unchanged source data, is not what happens in any realistic application setting I am familiar with.

Instead of investing in additional (complex and error-prone) software magic the right step forward would seem to be to evolve the hardware. The current unified memory implementation allows CUDA programmers to familiarize themselves with the new paradigm, although applicability to shipping applications may be very limited. In some sense this is similar to double-precision support in consumer cards which is functional and thus allows for software development, but does not provide enough performance for deployment of applications heavily reliant on double-precision computation.

None of the suggestions here (“device memory L3 caching”, or “cudaMemcpy across QPI”) provide coherency for multi-device access, AFAICT (given lack of a functional spec for these proposals…).

[url]Programming Guide :: CUDA Toolkit Documentation

multi-device coherency was an important design goal:

[url]Programming Guide :: CUDA Toolkit Documentation

“There are no constraints on concurrent inter-GPU access of managed memory, other than those that apply to multi-GPU access of non-managed memory.”

Having a single repository for the data and enforcing peer access from other devices provides a level of multi-device coherency. zero copy does also. With this level of multi-device coherency, I could have one kernel on one device update a global value and have it consumed by another running kernel elsewhere, with appropriate interlocks, but no explicit management of the data movement. The other methods, AFAICT, would not support such activity that depends on a level of coherence.

Host/device coherency is a separate topic and is discussed at some length in the UM programminmg guide.

[url]Programming Guide :: CUDA Toolkit Documentation

If you don’t like some of the trappings that come along with these design choices, you can, to some degree, opt-out of them, and revert to manual control, for example using non-UM methods, restricting UM activity to one device, or, for some use cases, via stream association methods:

[url]Programming Guide :: CUDA Toolkit Documentation

The UM designers did have a lot of considerations, and UM is very much a work in progress, and I’m not able to articulate most aspects of it with any level of expertise.

I understand the need for coherency for some applications, but the chip provides a non-coherent L1 cache (texture cache) for those applications that don’t need it. In fact sgemm explicitly requests its memory through that cache (LDG.E.CI). It would make sense to use device memory to extend that non-coherent cache for these pci requests.