I am using unified memory to store a big array that fully occupies GPU memory and I want the array be stored in GPU memory only (just for experiments, later on I need to store the array in two GPUs).
First I get the available GPU memory via cudaMemGetInfo(), then I use cudaMallocManaged() to allocate unified memory for the array.
Then I initialize the array by launching a kernel with 640 blocks and each with 1024 threads.
The problem is:
- First several executions of the application behaves the same. After a few launches, according to nvprof and nvvp, there are lots of GPU page faults. And page faults occurred in the later part of the execution are due to oversubcription (in nvvp it says the page migration is due to eviction)
- There are lots of GPU page faults due to writes from the GPU. I tried to give advises and prefectching (even though there are no data to fetch, but I was desperate) before the kernel launch but none of them changed the behavior.
OS: Linux Ubuntu 16.04 64-bit
GPU: GTX 1080
Compiler flag: -arch=sm_61 -o
CUDA & toolkit version: 9.1
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <cuda_runtime_api.h>
#define inttype uint32_t
static const inttype EMPTYVECT32 = 0x7FFFFFFF;
__global__ void init_queue(inttype *d_q, inttype n_elem) {
inttype nthreads = blockDim.x * gridDim.x;
inttype i = (blockIdx.x * blockDim.x) + threadIdx.x;
for(; i < n_elem; i += nthreads) {
d_q[i] = (inttype) EMPTYVECT32;
}
}
int main(int argc, char const *argv[]) {
inttype *d_q;
size_t q_size;
size_t available, total;
cudaMemGetInfo(&available, &total);
printf("available: %lu\n", available);
q_size = available / sizeof(inttype);
while(cudaMallocManaged((void**)&d_q, q_size * sizeof(inttype)) == cudaErrorMemoryAllocation);
printf("q_size: %lu = %f GB\n", q_size, (float) q_size / 1024.0 / 1024.0 / 1024.0 * sizeof(inttype));
init_queue<<<640, 1024>>>(d_q, q_size);
cudaDeviceSynchronize();
cudaFree(d_q);
return 0;
}
This is very strange:
while(cudaMallocManaged((void**)&d_q, q_size * sizeof(inttype)) == cudaErrorMemoryAllocation);
It is not guaranteed that all of the memory that listed as “available” from cudaMemGetInfo can actually be allocated in a single allocation. Therefore there is no guarantee that your allocation will be able to be entirely device-resident.
I’ve changed the strange “while” into the normal one
cudaMallocManaged((void**)&d_q, q_size * sizeof(inttype));
And the results are the same. Even though the allocation may not be entirely device-resident, but according to the profiling results from nvvp, the page faults (due to writes) almost fills up the “GPU Page Faults” row for the during the kernel execution. It seems that most of pages are not device-resident. Is there a way to fix this?
And except for the page faults due to writes, I still cannot figure out why there are page faults due to oversubscription. Those page faults start to happen at the later half of the kernel execution. I thought by limiting the array size according to the available memory will not cause oversubscription, but apparently it is not the case. Why is this happening?
use cudaMemPrefetchAsync
The cudaMemGetInfo call returns an “available” number that may be higher than what can be actually allocated on the device.
Thanks, I will decrease the size of the array. However, cudaMemPrefetchAsync() makes no difference. The number of page faults is still the same.
cudaMemPrefetchAsync certainly seems to make a difference according to my testing:
$ cat t69.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define inttype uint32_t
static const inttype EMPTYVECT32 = 0x7FFFFFFF;
__global__ void init_queue(inttype *d_q, inttype n_elem) {
inttype nthreads = blockDim.x * gridDim.x;
inttype i = (blockIdx.x * blockDim.x) + threadIdx.x;
for(; i < n_elem; i += nthreads) {
d_q[i] = (inttype) EMPTYVECT32;
}
}
int main(int argc, char const *argv[]) {
inttype *d_q;
size_t q_size;
size_t available, total;
cudaMemGetInfo(&available, &total);
printf("available: %lu\n", available);
q_size = available / (4*sizeof(inttype));
cudaMallocManaged((void**)&d_q, q_size * sizeof(inttype));
#ifdef USE_PREFETCH
cudaMemPrefetchAsync(d_q, q_size*sizeof(inttype), 0);
#endif
printf("q_size: %lu = %f GB\n", q_size, (float) q_size / 1024.0 / 1024.0 / 1024.0 * sizeof(inttype));
init_queue<<<640, 1024>>>(d_q, q_size);
cudaDeviceSynchronize();
cudaFree(d_q);
return 0;
}
$ nvcc -arch=sm_60 -o t69 t69.cu
$ CUDA_VISIBLE_DEVICES="0" nvprof ./t69
==11329== NVPROF is profiling process 11329, command: ./t69
available: 16726163456
q_size: 1045385216 = 3.894363 GB
==11329== Profiling application: ./t69
==11329== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 590.98ms 1 590.98ms 590.98ms 590.98ms init_queue(unsigned int*, unsigned int)
API calls: 50.82% 590.99ms 1 590.99ms 590.99ms 590.99ms cudaDeviceSynchronize
30.78% 357.96ms 1 357.96ms 357.96ms 357.96ms cudaMemGetInfo
16.43% 191.05ms 1 191.05ms 191.05ms 191.05ms cudaFree
1.79% 20.773ms 1 20.773ms 20.773ms 20.773ms cudaMallocManaged
0.10% 1.1447ms 94 12.178us 414ns 483.79us cuDeviceGetAttribute
0.06% 704.27us 1 704.27us 704.27us 704.27us cuDeviceTotalMem
0.01% 123.01us 1 123.01us 123.01us 123.01us cuDeviceGetName
0.01% 101.63us 1 101.63us 101.63us 101.63us cudaLaunch
0.00% 5.9790us 2 2.9890us 266ns 5.7130us cudaSetupArgument
0.00% 4.8860us 3 1.6280us 435ns 2.8640us cuDeviceGetCount
0.00% 2.7650us 1 2.7650us 2.7650us 2.7650us cudaConfigureCall
0.00% 2.5320us 2 1.2660us 541ns 1.9910us cuDeviceGet
==11329== Unified Memory profiling result:
Device "Tesla P100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
6022 - - - - 746.5049ms Gpu page fault groups
$ nvcc -arch=sm_60 -o t69 t69.cu -DUSE_PREFETCH
$ CUDA_VISIBLE_DEVICES="0" nvprof ./t69
==11390== NVPROF is profiling process 11390, command: ./t69
available: 16726163456
q_size: 1045385216 = 3.894363 GB
==11390== Profiling application: ./t69
==11390== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 9.3382ms 1 9.3382ms 9.3382ms 9.3382ms init_queue(unsigned int*, unsigned int)
API calls: 53.03% 402.06ms 1 402.06ms 402.06ms 402.06ms cudaMemGetInfo
22.85% 173.22ms 1 173.22ms 173.22ms 173.22ms cudaFree
19.89% 150.80ms 1 150.80ms 150.80ms 150.80ms cudaMemPrefetchAsync
2.70% 20.435ms 1 20.435ms 20.435ms 20.435ms cudaMallocManaged
1.23% 9.3560ms 1 9.3560ms 9.3560ms 9.3560ms cudaDeviceSynchronize
0.15% 1.1204ms 94 11.919us 400ns 461.95us cuDeviceGetAttribute
0.09% 706.76us 1 706.76us 706.76us 706.76us cuDeviceTotalMem
0.04% 338.89us 1 338.89us 338.89us 338.89us cudaLaunch
0.02% 135.69us 1 135.69us 135.69us 135.69us cuDeviceGetName
0.00% 13.094us 2 6.5470us 366ns 12.728us cudaSetupArgument
0.00% 6.5940us 3 2.1980us 564ns 3.2730us cuDeviceGetCount
0.00% 3.9600us 1 3.9600us 3.9600us 3.9600us cudaConfigureCall
0.00% 2.6640us 2 1.3320us 821ns 1.8430us cuDeviceGet
$
- Note that I am using CUDA_VISIBLE_DEVICES to limit the CUDA runtime footprint to a single device.
- Without the -DUSE_PREFETCH, nvprof reports ~6000 page faults. With the -DUSE_PREFETCH, nvprof does not report any page faults (and kernel execution time is dramatically faster.)
Thank you very much! I noticed that I provided (void**)&d_q rather than d_q as the parameter for cudaMemPrefetchAsyn().
Now I have no page faults.
Any time you are having trouble with a CUDA code, its good practice to do proper CUDA error checking. cudaMemPrefetchAsync would return an error code (invalid argument) in that scenario, if you pass an incorrect pointer.