NVPROF is causing system instability and requiring reboot

Hello,

I created a simple C program to run on GPU. It works great when run via the .exe however, whenever i use nvprof it causes my system to become unstable (it freezes every x seconds) and requires a reboot. Any insight would be greatly appreciated.

I have played around and found:
1.) specifying 256 threads causes nvprof to NEVER return
2.) specifying 32 works but causes the system instability
3.) array size seems to have no affect
4.) memory/cpu of both cards are very low when profiling

I DO get this Warning when starting nvprof, Might this be the issue and how do i correct it? I have identical cards as shown in the info below:

==7596== NVPROF is profiling process 7596, command: .\whatup.exe
==7596== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation

System Specs:
Windows 10 Pro x64
i7-6800K
64 GB DDR4
2 x EVGA 980Ti (SLI)

NVIDIA Drivers:
Display Driver: 390.65
Geforce Experience: 3.12.0.84
3D Vision Controller Driver: 390.41
3D Vision: 390.65
CUDA v9.1

Here is the program output:
Number of blocks: 1
Number of threads per block: 32
value 0.000000
value 2.000000
value 4.000000
value 6.000000
value 8.000000
value 10.000000
value 12.000000
value 14.000000
value 16.000000
value 18.000000
COMPLETED****** ← GPU process completes

Here is output from cudaGetDeviceProperties(…)

Device Count: 2

Device 0: GeForce GTX 980 Ti
Device 0, MaxThreadsPerBlock: 1024
Device 0, TotalGlobalMem: 6442450944
Device 0, SharedMemPerBlock: 49152
Device 0, Major: 5
Device 0, Minor: 2
Device 0, ClockRate: 1190000
Device 0, ECCEnabled: 0
Device 0, TccDriver: 0
Device 0, ComputeMode: 0

Device 1: GeForce GTX 980 Ti
Device 1, MaxThreadsPerBlock: 1024
Device 1, TotalGlobalMem: 6442450944
Device 1, SharedMemPerBlock: 49152
Device 1, Major: 5
Device 1, Minor: 2
Device 1, ClockRate: 1190000
Device 1, ECCEnabled: 0
Device 1, TccDriver: 0
Device 1, ComputeMode: 0

1.) SLI is enabled and working (at least via NVIDIA control panel, and EVGA Precision)
2.) Code runs fine on GPU, just cannot profile w/o it killing my machine and requiring reboot (otherwise the screen freezes every x seconds)
3.) I have tried a ‘clean’ install of nvidia drivers

… and finally, here is the C code which i compile via: nvcc myfile.cu -o thefinal.exe

__global__ 
void add_gpu(float *a, float *b, float *c, float n){
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for(int i=index; i<n; i+=stride) {
        c[i] = a[i] + b[i];
    }
}

void call_GPU() {

    int SIZE = 1<<20;

    float *a, *b, *c;
    cudaMallocManaged(&a, SIZE * sizeof(float));  // wow, if this file is .c, this complains it needs 3rd param and if used it crashes!
    cudaMallocManaged(&b, SIZE * sizeof(float));
    cudaMallocManaged(&c, SIZE * sizeof(float));

    for(int i=0; i<SIZE; i++) {
        a[i] = i;
        b[i] = i;
        c[i] = 0;
    }

    int blocks = 1;
    int threads_per_blocks = 32;
    printf("Number of blocks: %d\n", blocks);
    printf("Number of threads per block: %d\n", threads_per_blocks);
    
    add_gpu<<<blocks, threads_per_blocks>>>(a, b, c, SIZE);
    cudaDeviceSynchronize();

    for(int i=0; i<10; i++) {
        printf("value %f\n", c[i]);
    }
    
    cudaFree(a);
    cudaFree(b);
    cudaFree(c);
}

int main() {

    printf("Ok, running on GPU\n");
    call_GPU();
    printf("COMPLETED******\n");

    return 0;
}
  1. turn off SLI
  2. run your setup with environment variable CUDA_VISIBLE_DEVICES=“0”

Thank you txbob, I will try that.

  • Can you clarify though are you saying that I cannot use SLI?
  • Is this ONLY while profiling (if so how can i ever test with both cards and ensure a speed increase)?
    -Will I be able to utilize both cards still when running my CUDA code to fully utilize all of my cores w/o SLI?

Sorry for all of the questions, I dont know if you are offering something to ‘just try’ or if this is the official way to solve it (and not be able to utilize both cards even though they are the same)

CUDA and SLI are mostly orthogonal. CUDA doesn’t use SLI for any performance or functionality. If you want to use both cards in CUDA, use standard CUDA multi-GPU programming methods.

[url]Programming Guide :: CUDA Toolkit Documentation

Thank you for clarifying txbob, I will try again without SLI

Disabling SLI does resolve the performance issue.

If i set no environment variable I still get the error, but the system does not become unstable.

If I set CUDA_VISIBLE_DEVICES=x then it selects the right card and works.

However, this seems to be the only way the code runs on the proper device. Using cudaSetDevice(x); seems to have no affect?

Thanks if you can provide any further details on device selection.

I see why now, cudaGetDevices was showing only 1 card.

Setting CUDA_VISIBLE_DEVICES=0,1 fixes this

… but returns with the warning:
==1016== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation

At least the system instability issue seems resolved (by disabling SLI).

Why does it seem like I am not able to use 2 identical cards for CUDA?

Your code as posted uses managed memory:

cudaMallocManaged(&b, SIZE * sizeof(float));
^^^^^^^^^^^^^^^^^

You might want to start by reading the entire section on managed memory in the CUDA programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Managed memory on windows with CUDA 9 or 9.1 will use the “pre-pascal” managed memory regime:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

In this regime, in a multi-device scenario, where the devices are not able to be in a P2P arrangement with each other, managed memory will instantiate managed allocations in pinned, mapped host memory. This will generally appear to give low performance for a wide variety of CUDA codes.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-multi-gpu

You should be able to run your code on either device, using cudaSetDevice() and without the environment variable setting. However in this setup:

  1. Managed allocations will be established in mapped, pinned host memory. (whether you are actually using 2 devices or not)
  2. The profiler will issue the warning. Note that the warning does not indicate that something is broken, that the code has thrown an error, or that any kind of profiling is impossible. What it is indicating is that unified memory profiling is not available, meaning you wont be able to see activity (transfers, allocations, faults) associated with the managed allocations. You could still observe kernel execution activity, for example.

Personally, I would not typically use managed memory this way, in such a setting. If you did not want to, you have a few options:

  1. Use the environment variable to restrict the number of devices seen by the CUDA runtime to a single device. When the CUDA runtime sees only a single device, it does not need to worry about managed allocations being accessible from other devices, and so it can use a “more typical” allocation method that is typical of single-device usage. The allocation is usable on both host and device, and the runtime will migrate the allocation where needed, as appropriate, according to the rules of the pre-Pascal UM regime. In this case, since the runtime only “sees” the indicated GPU, cudaSetDevice indeed has no effect.

  2. Don’t use managed allocations. Instead, use an ordinary cudaMalloc allocation. This can be done on each of your 2 devices, if you had a code that comprehended that, and you could use both devices to launch kernels without the aforementioned complications.

  3. Switch to a hardware and software configuration that allowed the GPUs in question to be in a P2P enabled clique.

Note that your code as posted does not use cudaSetDevice(), nor does it appear to comprehend how to “test with both cards and ensure a speed increase” (your words), so my comments here are somewhat abstract when referring to using multiple devices.

… let me add, this warning is REAL.

Here are the profiling results with:
CUDA_VISIBLE_DEVICES=1

Time(%) Time Calls Avg Min Max Name
100.00% 47.489us 1 47.489us 47.489us 47.489us add_gpu(float*, float*, float*, int)

Here are the profiling results with setting both (identical) cards visible:
CUDA_VISIBLE_DEVICES=0,1

Time(%) Time Calls Avg Min Max Name
100.00% 1.3972ms 1 1.3972ms 1.3972ms 1.3972ms add_gpu(float*, float*, float*, int)

A VERY big difference.

I really want/need/must utilize both cards but why cant I?