Unified memory has slow bandwidth over NVLink 2.0 for large data sizes

Hi,

When measuring the bandwidth of Unified Memory over NVLink 2.0 in a microbenchmark, I get unexpectedly low bandwidth. I am measuring a host->device transfer on an IBM AC922 with Tesla V100 GPUs.

My expectation is that in a host -> device transfer, the measured bandwidth should be around 63 GiB/s. Instead, I measure ~2 GiB/s for data larger than the device’s memory capacity, e.g., 32 GiB.

For smaller data sizes < 16 GiB, bandwidth improves to 33 GiB/s when setting cudaMemAdviseSetAccessedBy.

What doesn’t work:

  • Setting memAdvise options either cause extreme measurement variance between 0.5 and 37 GiB/s (cudaMemAdviseSetAccessedBy) or regress bandwidth to ~1 GiB/s (cudaMemAdviseSetReadMostly).
  • Adjusting grid and block dimensions have almost no effect, as long as they are "large enough".

The GPU kernel is very simple:

__global__ void read_kernel(int *data, uint64_t len, int *result) {
    const unsigned gid = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned gstride = blockDim.x * gridDim.x;

    int counter = 0;
    for (uint64_t i = gid; i < len; i += gstride) {
        counter += data[i];
    }

    atomicAdd(result, counter);
}

A minimal, reproducible example is available here: https://github.com/LutzCle/unified-memory-debugging
In the repository’s README, I explain the problem in more detail using measurements.

I would be grateful for any help in resolving this issue.

The maximum measurable link bandwidth - ~70GB/s (not peak theoretical - 75GB/s) is achievable when doing a bulk transfer, e.g. cudaMemcpy or cudaMemPrefetchAsync.

It is generally not achievable in a purely demand-paged scenario. The limitation here is not the design of your kernel code, but the overhead associated with demand paging. This is also the reason why your “small” test case (16GB) also does not saturate link bandwidth.

Your larger-than-device-memory case is even slower because pages have to be both registered and un-registered on the GPU (eventually) by the managed memory system. This is referred to as the “eviction” scenario.

You may wish to review these resources:
https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9727-memory+management+on+modern+gpu+architectures

https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

A read-only touch-once case like this may be able to approach saturation of link bandwidth as a pinned allocation. You can approximate this behavior in managed memory by setting the advice as cudaMemAdviseSetPreferredLocation to cudaCpuDeviceId.

The following modification to your code demonstrates this:

$ cat t1.cu
/*
 * Copyright (c) 2019, German Research Center for Artificial Intelligence (DFKI)
 * Author: Clemens Lutz <clemens.lutz@dfki.de>
 *
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *     * Redistributions of source code must retain the above copyright
 *       notice, this list of conditions and the following disclaimer.
 *     * Redistributions in binary form must reproduce the above copyright
 *       notice, this list of conditions and the following disclaimer in the
 *       documentation and/or other materials provided with the distribution.
 *     * Neither the name of the <organization> nor the
 *       names of its contributors may be used to endorse or promote products
 *       derived from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
 * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

// If defined, use CUDA managed memory, otherwise allocate with malloc
#define USE_MANAGED
#define ADVISE_PREFERRED_LOCATION_CPU
// If defined, then set the memAdviseSetAccessedBy flag
// #define ADVISE_ACCESSED_BY

// If defined, then set the memAdviseSetReadMostly flag
// #define ADVISE_READ_MOSTLY

// If defined, then touch the data on the host between kernel launches to avoid
// device-side caching
// #define TOUCH_ON_HOST

// If defined, then read data on GPU, else write data on GPU
#define OP_READ

// 32 GiB of data
constexpr unsigned long long SIZE = 32 * 1024 * 1024 * (1024 / sizeof(int));

// Number of runs
constexpr unsigned RUNS = 5;

// Device
constexpr int DEVICE_ID = 0;

#ifndef USE_MANAGED
// NUMA node
constexpr int NUMA_NODE = 0;
#endif
#ifndef USE_MANAGED
#include <cstdlib>
#endif

#include <iostream>
#include <cuda_runtime.h>
#include <cstdint>
#ifndef USE_MANAGED
#include <numa.h>
#endif

#define CHECK_CUDA(ans) check_cuda((ans), __FILE__, __LINE__)
void check_cuda(cudaError_t code, const char *file, int line) {
    if (code != cudaSuccess) {
      std::cerr
          << "Exit with code "
          << cudaGetErrorString(code)
          << " (" << code << ") "
          << "in file " << file << ":" << line
          << std::endl;
      std::exit(1);
    }
}

__global__ void read_kernel(int *data, uint64_t len, int *result) {
    const unsigned gid = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned gstride = blockDim.x * gridDim.x;

    int counter = 0;
    for (uint64_t i = gid; i < len; i += gstride) {
        counter += data[i];
    }

    atomicAdd(result, counter);
}

__global__ void write_kernel(int *data, uint64_t len) {
    const unsigned gid = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned gstride = blockDim.x * gridDim.x;

    for (uint64_t i = gid; i < len; i += gstride) {
        data[i] = i;
    }
}

int main() {
    // Kernel launch parameters
    int sm_count = 0;
    CHECK_CUDA(cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, DEVICE_ID));
    int warp_size = 0;
    CHECK_CUDA(cudaDeviceGetAttribute(&warp_size, cudaDevAttrWarpSize, DEVICE_ID));
    const unsigned GRID_DIM = sm_count * 2;
    const unsigned BLOCK_DIM = warp_size * 4;

    std::cout
        << "Running on device " << DEVICE_ID
        << " with grid dim " << GRID_DIM
        << " and block dim " << BLOCK_DIM
        << std::endl;

    // Set CUDA device
    CHECK_CUDA(cudaSetDevice(DEVICE_ID));

    // Set NUMA node
#ifndef USE_MANAGED
    numa_run_on_node(NUMA_NODE);
#endif
    // Allocate managed memory
    int *data = nullptr;
#ifdef USE_MANAGED
    CHECK_CUDA(cudaMallocManaged(&data, SIZE * sizeof(int)));
    std::cout << "Managed memory enabled" << std::endl;

#ifdef ADVISE_READ_MOSTLY
    CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetReadMostly, DEVICE_ID));
    std::cout << "cudaMemAdviseSetReadMostly enabled" << std::endl;
#endif

#ifdef ADVISE_ACCESSED_BY
    CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetAccessedBy, DEVICE_ID));
    std::cout << "cudaMemAdviseSetAccessedBy enabled" << std::endl;
#endif

#ifdef ADVISE_PREFERRED_LOCATION_CPU
    CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId));
    std::cout << "cudaMemAdviseSetPreferredLocation CPU enabled" << std::endl;
#endif

#else
    data = (int*) numa_alloc_onnode(SIZE * sizeof(int), NUMA_NODE);
    std::cout << "System memory enabled" << std::endl;
#endif

#ifdef TOUCH_ON_HOST
    std::cout << "Touch on host between runs enabled" << std::endl;
#endif

    // Fill data array
    for (uint64_t i = 0; i < SIZE; ++i) {
        data[i] = i;
    }

    // Allocate result
    int *result = nullptr;
    CHECK_CUDA(cudaMalloc(&result, sizeof(int)));

    // Setup events
    cudaEvent_t start_timer, end_timer;
    CHECK_CUDA(cudaEventCreate(&start_timer));
    CHECK_CUDA(cudaEventCreate(&end_timer));

#ifdef OP_READ
    std::cout << "Running read kernel" << std::endl;
#else
    std::cout << "Running write kernel" << std::endl;
#endif
        read_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE, result); // warm-up

    for (unsigned run = 0; run < RUNS; ++run) {
        // Launch kernel and measure time
        CHECK_CUDA(cudaEventRecord(start_timer));
#ifdef OP_READ
        read_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE, result);
#else
        write_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE);
#endif
        CHECK_CUDA(cudaEventRecord(end_timer));

        // Wait for kernel completion
        CHECK_CUDA(cudaDeviceSynchronize());

        // Compute and print throughput in GiB/s
        uint64_t size_GiB = (SIZE * sizeof(int)) / 1024 / 1024 / 1024;
        float time_ms = 0.0;
        CHECK_CUDA(cudaEventElapsedTime(&time_ms, start_timer, end_timer));
        double tput = ((double)size_GiB) / time_ms * 1000.0;
        std::cout << "Throughput: " << tput << " GiB/s" << std::endl;

#ifdef TOUCH_ON_HOST
        for (uint64_t i = 0; i < SIZE; ++i) {
            data[i] = run + i;
        }
#endif
    }

    // Cleanup
    CHECK_CUDA(cudaEventDestroy(start_timer));
    CHECK_CUDA(cudaEventDestroy(end_timer));
#ifdef USE_MANAGED
    CHECK_CUDA(cudaFree(data));
#else
    numa_free(data, SIZE * sizeof(int));
#endif
    CHECK_CUDA(cudaFree(result));
}
$ nvcc -arch=sm_70 t1.cu -o t1 -std=c++11
$ ./t1
Running on device 0 with grid dim 160 and block dim 128
Managed memory enabled
cudaMemAdviseSetPreferredLocation CPU enabled
Running read kernel
Throughput: 38.6654 GiB/s
Throughput: 38.6786 GiB/s
Throughput: 38.6936 GiB/s
Throughput: 38.6895 GiB/s
Throughput: 38.6908 GiB/s
$

You may wish to study the code for the changes. For example I have also introduced a “warm-up” run. This establishes page-mappings for the GPU to the managed allocation, which improves performance for subsequent runs.

Here we are running on a variant of AC922 that has 6 V100 GPUs instead of 4 (or fewer) as in your node. This means that the various NVLink connections are dual-link connections rather than triple-link connections as they are in your system. This means that rather than having a peak theoretical bandwidth (per direction) of 75GB/s, it is 50GB/s in this system. Therefore this modification allows your code to achieve 38.7/50 = 77% of peak theoretical link bandwidth, on this particular setup.

This is still not the most efficient way to transfer data. As I said in the beginning, that would be a bulk transfer. But this method comes close (>80%), for this read-only touch-once pattern. As a point of reference, here are the bandwidthTest results on my system:

$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla V100-SXM2-16GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     39933.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     33362.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     731162.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$

So we see that the 38.7 number reported by your code is quite close to the 39.9 number reported by bandwidthTest.

Note that the number reported by bandwidthTest may be further improved (thus increasing the gap between the code performance and peak measureable) with careful tuning of parameters to bandwidthTest. The above test is for a transfer size of 32MB, much smaller than the 32GB involved in the test code. Here’s another example, probably more representative of “peak achievable”:

$ numactl -c 0 -m 0 ./bandwidthTest -mode=range -start=1073741824 -end=1073741824 -increment=1
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla V100-SXM2-16GB
 Range Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   1073741824                   45924.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   1073741824                   45951.5

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   1073741824                   767839.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$

I can get even better performance by replacing the warm-up kernel above with a cudaMemPrefetchAsync

This appears to have 2 benefits:

  1. cudaMemPrefetchAsync appears to be a quicker method to establish the page mappings
  2. Even though the preferred location is on the CPU, because I have prefetched the data, some of it is resident on the GPU. Therefore the kernel runs faster.
$ cat t1.cu
/*
 * Copyright (c) 2019, German Research Center for Artificial Intelligence (DFKI)
 * Author: Clemens Lutz <clemens.lutz@dfki.de>
 *
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *     * Redistributions of source code must retain the above copyright
 *       notice, this list of conditions and the following disclaimer.
 *     * Redistributions in binary form must reproduce the above copyright
 *       notice, this list of conditions and the following disclaimer in the
 *       documentation and/or other materials provided with the distribution.
 *     * Neither the name of the <organization> nor the
 *       names of its contributors may be used to endorse or promote products
 *       derived from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
 * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

// If defined, use CUDA managed memory, otherwise allocate with malloc
#define USE_MANAGED
#define ADVISE_PREFERRED_LOCATION_CPU
// If defined, then set the memAdviseSetAccessedBy flag
// #define ADVISE_ACCESSED_BY

// If defined, then set the memAdviseSetReadMostly flag
// #define ADVISE_READ_MOSTLY

// If defined, then touch the data on the host between kernel launches to avoid
// device-side caching
// #define TOUCH_ON_HOST

// If defined, then read data on GPU, else write data on GPU
#define OP_READ

// 32 GiB of data
constexpr unsigned long long SIZE = 32 * 1024 * 1024 * (1024 / sizeof(int));

// Number of runs
constexpr unsigned RUNS = 5;

// Device
constexpr int DEVICE_ID = 0;

#ifndef USE_MANAGED
// NUMA node
constexpr int NUMA_NODE = 0;
#endif
#ifndef USE_MANAGED
#include <cstdlib>
#endif

#include <iostream>
#include <cuda_runtime.h>
#include <cstdint>
#ifndef USE_MANAGED
#include <numa.h>
#endif

#define CHECK_CUDA(ans) check_cuda((ans), __FILE__, __LINE__)
void check_cuda(cudaError_t code, const char *file, int line) {
    if (code != cudaSuccess) {
      std::cerr
          << "Exit with code "
          << cudaGetErrorString(code)
          << " (" << code << ") "
          << "in file " << file << ":" << line
          << std::endl;
      std::exit(1);
    }
}

__global__ void read_kernel(int *data, uint64_t len, int *result) {
    const unsigned gid = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned gstride = blockDim.x * gridDim.x;

    int counter = 0;
    for (uint64_t i = gid; i < len; i += gstride) {
        counter += data[i];
    }

    atomicAdd(result, counter);
}

__global__ void write_kernel(int *data, uint64_t len) {
    const unsigned gid = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned gstride = blockDim.x * gridDim.x;

    for (uint64_t i = gid; i < len; i += gstride) {
        data[i] = i;
    }
}

int main() {
    // Kernel launch parameters
    int sm_count = 0;
    CHECK_CUDA(cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, DEVICE_ID));
    int warp_size = 0;
    CHECK_CUDA(cudaDeviceGetAttribute(&warp_size, cudaDevAttrWarpSize, DEVICE_ID));
    const unsigned GRID_DIM = sm_count * 2;
    const unsigned BLOCK_DIM = warp_size * 4;

    std::cout
        << "Running on device " << DEVICE_ID
        << " with grid dim " << GRID_DIM
        << " and block dim " << BLOCK_DIM
        << std::endl;

    // Set CUDA device
    CHECK_CUDA(cudaSetDevice(DEVICE_ID));

    // Set NUMA node
#ifndef USE_MANAGED
    numa_run_on_node(NUMA_NODE);
#endif
    // Allocate managed memory
    int *data = nullptr;
#ifdef USE_MANAGED
    CHECK_CUDA(cudaMallocManaged(&data, SIZE * sizeof(int)));
    std::cout << "Managed memory enabled" << std::endl;

#ifdef ADVISE_READ_MOSTLY
    CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetReadMostly, DEVICE_ID));
    std::cout << "cudaMemAdviseSetReadMostly enabled" << std::endl;
#endif

#ifdef ADVISE_ACCESSED_BY
    CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetAccessedBy, DEVICE_ID));
    std::cout << "cudaMemAdviseSetAccessedBy enabled" << std::endl;
#endif

#ifdef ADVISE_PREFERRED_LOCATION_CPU
    CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId));
    std::cout << "cudaMemAdviseSetPreferredLocation CPU enabled" << std::endl;
#endif


#else
    data = (int*) numa_alloc_onnode(SIZE * sizeof(int), NUMA_NODE);
    std::cout << "System memory enabled" << std::endl;
#endif

#ifdef TOUCH_ON_HOST
    std::cout << "Touch on host between runs enabled" << std::endl;
#endif

    // Fill data array
    for (uint64_t i = 0; i < SIZE; ++i) {
        data[i] = i;
    }

    // Allocate result
    int *result = nullptr;
    CHECK_CUDA(cudaMalloc(&result, sizeof(int)));

    // Setup events
    cudaEvent_t start_timer, end_timer;
    CHECK_CUDA(cudaEventCreate(&start_timer));
    CHECK_CUDA(cudaEventCreate(&end_timer));

#ifdef OP_READ
    std::cout << "Running read kernel" << std::endl;
#else
    std::cout << "Running write kernel" << std::endl;
#endif
//        read_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE, result); // warm-up
    CHECK_CUDA(cudaMemPrefetchAsync(data, SIZE*sizeof(int), 0));
    CHECK_CUDA(cudaDeviceSynchronize());
    for (unsigned run = 0; run < RUNS; ++run) {
        // Launch kernel and measure time
        CHECK_CUDA(cudaEventRecord(start_timer));
#ifdef OP_READ
        read_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE, result);
#else
        write_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE);
#endif
        CHECK_CUDA(cudaEventRecord(end_timer));

        // Wait for kernel completion
        CHECK_CUDA(cudaDeviceSynchronize());

        // Compute and print throughput in GiB/s
        uint64_t size_GiB = (SIZE * sizeof(int)) / 1024 / 1024 / 1024;
        float time_ms = 0.0;
        CHECK_CUDA(cudaEventElapsedTime(&time_ms, start_timer, end_timer));
        double tput = ((double)size_GiB) / time_ms * 1000.0;
        std::cout << "Throughput: " << tput << " GiB/s" << std::endl;

#ifdef TOUCH_ON_HOST
        for (uint64_t i = 0; i < SIZE; ++i) {
            data[i] = run + i;
        }
#endif
    }

    // Cleanup
    CHECK_CUDA(cudaEventDestroy(start_timer));
    CHECK_CUDA(cudaEventDestroy(end_timer));
#ifdef USE_MANAGED
    CHECK_CUDA(cudaFree(data));
#else
    numa_free(data, SIZE * sizeof(int));
#endif
    CHECK_CUDA(cudaFree(result));
}
$ nvcc -arch=sm_70 t1.cu -o t1 -std=c++11
$ ./t1
Running on device 0 with grid dim 160 and block dim 128
Managed memory enabled
cudaMemAdviseSetPreferredLocation CPU enabled
Running read kernel
Throughput: 63.6912 GiB/s
Throughput: 63.7076 GiB/s
Throughput: 63.7231 GiB/s
Throughput: 63.7404 GiB/s
Throughput: 63.7362 GiB/s
$

Hi Robert,

Thank you for your quick and informative response! I’m still trying to form an understanding of what I’m measuring. To this end, I’d like to expand on my previous post, as there are still some gaps in my understanding. For comparison to what we’ve discussed so far, I’ve added measurements on x86-64, and also using prefetching on POWER9, which I’ll go into below.

First off, thanks for pointing me in the direction of cudaMemAdviseSetPreferredLocation! I wasn’t aware that it would improve performance like this. The updated measurements on GitHub show exactly what you suggest. In fact, by setting both cudaMemAdviseSetPreferredLocation and cudaMemAdviseSetAccessedBy, a warm-up pass becomes unnecessary. Am I correct to assume that this is because the pages are pre-mapped before they’re accessed?

Sorry for not mentioning this before, but I’m aware of the demand paging mechanism that is explained on the DevBlog, and in research such as Zheng et al. “Towards High Performance Paged Memory for GPUs”: https://ieeexplore.ieee.org/abstract/document/7446077. Even so, I’m still unclear what effects I’m measuring.

It intuitively makes sense to me that demand paging in software is slower than the NVLink’s address translation service. It seems that cudaMemAdviseSetPreferredLocation basically uses ATS, which would explain why the measurement results are the same.

Compared to x86-64, POWER9 is actually slower when using “vanilla” Unified Memory without additional flags. This is counter-intuitive to me, given the theoretical bandwidth delta between NVLink and PCI-e. Perhaps it could be explained by more effort in optimizing the x86-64 driver code specifically for demand paging than on POWER9?

Strangely enough, prefetching doesn’t seem to improve bandwidth on POWER9, even though it reaches the full bandwidth on x86-64. To double-check my code, I also tried using the literal code from the DevBlog post that you referenced (see the “devblog_prefetch” branch), but that yields the same results. Setting cudaMemAdviseSetAccessedBy doesn’t change these results. Am I missing something?

Thank you again for your help.

I’m not aware of anything else that could explain it.

I’m not able to follow your logic. It doesn’t seem that way to me. The SetPreferredLocation example I gave clearly shows a difference in behavior when I use a warm up kernel, or no warm up kernel, or a prefetch. That shouldn’t be true if the underlying mechanism was ATS. I believe the reason the measurement results are the same is because the CPU-GPU link is the dominant limiter in both scenarios.

For the remaining 2 paragraphs, my suggestion would be that you provide complete test cases here in this thread (follow the example I gave of pasting a complete standalone code, followed by a compilation, and a test run - you need to do this anyway, to actually run the test, right?), and point out the specific things you are comparing.