The maximum measurable link bandwidth - ~70GB/s (not peak theoretical - 75GB/s) is achievable when doing a bulk transfer, e.g. cudaMemcpy or cudaMemPrefetchAsync.
It is generally not achievable in a purely demand-paged scenario. The limitation here is not the design of your kernel code, but the overhead associated with demand paging. This is also the reason why your “small” test case (16GB) also does not saturate link bandwidth.
Your larger-than-device-memory case is even slower because pages have to be both registered and un-registered on the GPU (eventually) by the managed memory system. This is referred to as the “eviction” scenario.
You may wish to review these resources:
https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9727-memory+management+on+modern+gpu+architectures
https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/
A read-only touch-once case like this may be able to approach saturation of link bandwidth as a pinned allocation. You can approximate this behavior in managed memory by setting the advice as cudaMemAdviseSetPreferredLocation to cudaCpuDeviceId.
The following modification to your code demonstrates this:
$ cat t1.cu
/*
* Copyright (c) 2019, German Research Center for Artificial Intelligence (DFKI)
* Author: Clemens Lutz <clemens.lutz@dfki.de>
*
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* * Neither the name of the <organization> nor the
* names of its contributors may be used to endorse or promote products
* derived from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
// If defined, use CUDA managed memory, otherwise allocate with malloc
#define USE_MANAGED
#define ADVISE_PREFERRED_LOCATION_CPU
// If defined, then set the memAdviseSetAccessedBy flag
// #define ADVISE_ACCESSED_BY
// If defined, then set the memAdviseSetReadMostly flag
// #define ADVISE_READ_MOSTLY
// If defined, then touch the data on the host between kernel launches to avoid
// device-side caching
// #define TOUCH_ON_HOST
// If defined, then read data on GPU, else write data on GPU
#define OP_READ
// 32 GiB of data
constexpr unsigned long long SIZE = 32 * 1024 * 1024 * (1024 / sizeof(int));
// Number of runs
constexpr unsigned RUNS = 5;
// Device
constexpr int DEVICE_ID = 0;
#ifndef USE_MANAGED
// NUMA node
constexpr int NUMA_NODE = 0;
#endif
#ifndef USE_MANAGED
#include <cstdlib>
#endif
#include <iostream>
#include <cuda_runtime.h>
#include <cstdint>
#ifndef USE_MANAGED
#include <numa.h>
#endif
#define CHECK_CUDA(ans) check_cuda((ans), __FILE__, __LINE__)
void check_cuda(cudaError_t code, const char *file, int line) {
if (code != cudaSuccess) {
std::cerr
<< "Exit with code "
<< cudaGetErrorString(code)
<< " (" << code << ") "
<< "in file " << file << ":" << line
<< std::endl;
std::exit(1);
}
}
__global__ void read_kernel(int *data, uint64_t len, int *result) {
const unsigned gid = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned gstride = blockDim.x * gridDim.x;
int counter = 0;
for (uint64_t i = gid; i < len; i += gstride) {
counter += data[i];
}
atomicAdd(result, counter);
}
__global__ void write_kernel(int *data, uint64_t len) {
const unsigned gid = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned gstride = blockDim.x * gridDim.x;
for (uint64_t i = gid; i < len; i += gstride) {
data[i] = i;
}
}
int main() {
// Kernel launch parameters
int sm_count = 0;
CHECK_CUDA(cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, DEVICE_ID));
int warp_size = 0;
CHECK_CUDA(cudaDeviceGetAttribute(&warp_size, cudaDevAttrWarpSize, DEVICE_ID));
const unsigned GRID_DIM = sm_count * 2;
const unsigned BLOCK_DIM = warp_size * 4;
std::cout
<< "Running on device " << DEVICE_ID
<< " with grid dim " << GRID_DIM
<< " and block dim " << BLOCK_DIM
<< std::endl;
// Set CUDA device
CHECK_CUDA(cudaSetDevice(DEVICE_ID));
// Set NUMA node
#ifndef USE_MANAGED
numa_run_on_node(NUMA_NODE);
#endif
// Allocate managed memory
int *data = nullptr;
#ifdef USE_MANAGED
CHECK_CUDA(cudaMallocManaged(&data, SIZE * sizeof(int)));
std::cout << "Managed memory enabled" << std::endl;
#ifdef ADVISE_READ_MOSTLY
CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetReadMostly, DEVICE_ID));
std::cout << "cudaMemAdviseSetReadMostly enabled" << std::endl;
#endif
#ifdef ADVISE_ACCESSED_BY
CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetAccessedBy, DEVICE_ID));
std::cout << "cudaMemAdviseSetAccessedBy enabled" << std::endl;
#endif
#ifdef ADVISE_PREFERRED_LOCATION_CPU
CHECK_CUDA(cudaMemAdvise(data, SIZE * sizeof(int), cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId));
std::cout << "cudaMemAdviseSetPreferredLocation CPU enabled" << std::endl;
#endif
#else
data = (int*) numa_alloc_onnode(SIZE * sizeof(int), NUMA_NODE);
std::cout << "System memory enabled" << std::endl;
#endif
#ifdef TOUCH_ON_HOST
std::cout << "Touch on host between runs enabled" << std::endl;
#endif
// Fill data array
for (uint64_t i = 0; i < SIZE; ++i) {
data[i] = i;
}
// Allocate result
int *result = nullptr;
CHECK_CUDA(cudaMalloc(&result, sizeof(int)));
// Setup events
cudaEvent_t start_timer, end_timer;
CHECK_CUDA(cudaEventCreate(&start_timer));
CHECK_CUDA(cudaEventCreate(&end_timer));
#ifdef OP_READ
std::cout << "Running read kernel" << std::endl;
#else
std::cout << "Running write kernel" << std::endl;
#endif
read_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE, result); // warm-up
for (unsigned run = 0; run < RUNS; ++run) {
// Launch kernel and measure time
CHECK_CUDA(cudaEventRecord(start_timer));
#ifdef OP_READ
read_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE, result);
#else
write_kernel<<<GRID_DIM, BLOCK_DIM>>>(data, SIZE);
#endif
CHECK_CUDA(cudaEventRecord(end_timer));
// Wait for kernel completion
CHECK_CUDA(cudaDeviceSynchronize());
// Compute and print throughput in GiB/s
uint64_t size_GiB = (SIZE * sizeof(int)) / 1024 / 1024 / 1024;
float time_ms = 0.0;
CHECK_CUDA(cudaEventElapsedTime(&time_ms, start_timer, end_timer));
double tput = ((double)size_GiB) / time_ms * 1000.0;
std::cout << "Throughput: " << tput << " GiB/s" << std::endl;
#ifdef TOUCH_ON_HOST
for (uint64_t i = 0; i < SIZE; ++i) {
data[i] = run + i;
}
#endif
}
// Cleanup
CHECK_CUDA(cudaEventDestroy(start_timer));
CHECK_CUDA(cudaEventDestroy(end_timer));
#ifdef USE_MANAGED
CHECK_CUDA(cudaFree(data));
#else
numa_free(data, SIZE * sizeof(int));
#endif
CHECK_CUDA(cudaFree(result));
}
$ nvcc -arch=sm_70 t1.cu -o t1 -std=c++11
$ ./t1
Running on device 0 with grid dim 160 and block dim 128
Managed memory enabled
cudaMemAdviseSetPreferredLocation CPU enabled
Running read kernel
Throughput: 38.6654 GiB/s
Throughput: 38.6786 GiB/s
Throughput: 38.6936 GiB/s
Throughput: 38.6895 GiB/s
Throughput: 38.6908 GiB/s
$
You may wish to study the code for the changes. For example I have also introduced a “warm-up” run. This establishes page-mappings for the GPU to the managed allocation, which improves performance for subsequent runs.
Here we are running on a variant of AC922 that has 6 V100 GPUs instead of 4 (or fewer) as in your node. This means that the various NVLink connections are dual-link connections rather than triple-link connections as they are in your system. This means that rather than having a peak theoretical bandwidth (per direction) of 75GB/s, it is 50GB/s in this system. Therefore this modification allows your code to achieve 38.7/50 = 77% of peak theoretical link bandwidth, on this particular setup.
This is still not the most efficient way to transfer data. As I said in the beginning, that would be a bulk transfer. But this method comes close (>80%), for this read-only touch-once pattern. As a point of reference, here are the bandwidthTest results on my system:
$ ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla V100-SXM2-16GB
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 39933.2
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 33362.4
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 731162.2
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
So we see that the 38.7 number reported by your code is quite close to the 39.9 number reported by bandwidthTest.
Note that the number reported by bandwidthTest may be further improved (thus increasing the gap between the code performance and peak measureable) with careful tuning of parameters to bandwidthTest. The above test is for a transfer size of 32MB, much smaller than the 32GB involved in the test code. Here’s another example, probably more representative of “peak achievable”:
$ numactl -c 0 -m 0 ./bandwidthTest -mode=range -start=1073741824 -end=1073741824 -increment=1
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla V100-SXM2-16GB
Range Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
1073741824 45924.7
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
1073741824 45951.5
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
1073741824 767839.1
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$