Unified Memory in Pascal

liorka1313 · September 20, 2018, 7:02pm

I had followed the following tutorial: Unified Memory for CUDA Beginners | NVIDIA Technical Blog
By this tutorial, Because I got GTX 1060 GPU which using Pascal Architecture, I should get some page faults.
But by my “nvprof” I have none:

nvprof --unified-memory-profiling per-process-device .\cuda_learnig.exe
==3560== NVPROF is profiling process 3560, command: .\cuda_learnig.exe
Max error: 0
==3560== Profiling application: .\cuda_learnig.exe
==3560== Warning: Found 53 invalid records in the result.
==3560== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==3560== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  83.584us         1  83.584us  83.584us  83.584us  add(int, float*, float*)
      API calls:   72.72%  205.41ms         2  102.71ms  1.1742ms  204.24ms  cudaMallocManaged
                   15.42%  43.542ms         1  43.542ms  43.542ms  43.542ms  cuDevicePrimaryCtxRelease
                   10.55%  29.813ms         1  29.813ms  29.813ms  29.813ms  cudaLaunchKernel
                    0.69%  1.9627ms         2  981.33us  699.08us  1.2636ms  cudaFree
                    0.41%  1.1502ms        44  26.140us     364ns  596.60us  cuDeviceGetAttribute
                    0.10%  287.00us         2  143.50us  120.34us  166.66us  cuModuleUnload
                    0.08%  229.74us         1  229.74us  229.74us  229.74us  cudaDeviceSynchronize
                    0.02%  51.783us         1  51.783us  51.783us  51.783us  cuDeviceTotalMem
                    0.00%  9.4820us         1  9.4820us  9.4820us  9.4820us  cuDeviceGetPCIBusId
                    0.00%  1.4580us         2     729ns     364ns  1.0940us  cuDeviceGetCount
                    0.00%  1.0950us         2     547ns     365ns     730ns  cuDeviceGet
                    0.00%     730ns         1     730ns     730ns     730ns  cuDeviceGetName

==3560== Unified Memory profiling result:
Device "GeForce GTX 1060 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
    2048  4.0000KB  4.0000KB  4.0000KB  8.000000MB  17.63118ms  Host To Device
     146  84.164KB  32.000KB  1.0000MB  12.00000MB  3.055590ms  Device To Host

Can you explain me why?

saulocpp · September 22, 2018, 7:41pm

Are you running the code in the article you referenced?
What exactly are you trying to achieve?

liorka1313 · September 22, 2018, 9:35pm

Yes, the same code.
I just try to learn how program in cuda, get better understanding about what happens, and by this article I understand that my code can run faster(but I seems that the code run good and there are 0 page faults in contrast to what the article says).

saulocpp · September 22, 2018, 10:50pm

I get the same result as you, also running on a Pascal.
Though the profiler doesn’t say there were 0 page faults, it doesn’t mean there weren’t any. My first guess would be that, when the article was made available (19 june 2017), a different SDK and driver were used.
I also tried various nvprof options to see if page faults are hidden by default, but I couldn’t find anything related (though the unified memory profiling is on by default). I would say, don’t worry if this information is not coming out on the screen. In the end you can see there was data being migrated between device and host.
But someone who knows better might want to step in and provide more solid explanation.

cudapop1 · September 24, 2018, 8:34pm

It’s because you’re running on Windows. Pascal page-faulting under Unified Memory currently only applies to Linux.

See Appendix K on “Unified Memory” in the “CUDA C Programming Guide”:

GPUs with SM architecture 6.x or higher (Pascal class or newer) provide additional Unified Memory features such as on-demand page migration and GPU memory oversubscription that are outlined throughout this document. Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) or macOS will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.
…
GPU architectures of compute capability lower than 6.x do not support fine-grained movement of the managed data to GPU on-demand. Whenever a GPU kernel is launched all managed memory generally has to be transfered to GPU memory to avoid faulting on memory access.

saulocpp · September 24, 2018, 8:44pm

Cudapop1, then thanks for clarifying this. As I currently don’t have a Linux dev environment, I was unable to repeat 100% that MH article. Appreciate you stepping in.