nvprof: Internal profiling error 4277:5 on Tesla P100, but not on GTX 1070

Hi,

i am trying to profile my application from the command line with nvprof on linux. Running nvprof without any parameters for the general analysis on my application works fine. Also, cuda-memcheck does not identify any errors.

However, nvprof (as well as nvvp and nsight compute) consistently fails to profile my application when run on the target system with a Tesla P100-PCIE-16GB if i enable more detailed analysis, such as --analysis-metrics, right when the first larger kernel is launched:

== Error: Internal profiling error 4277:5

I strongly suspect this is not due to an error in my applications code, as on my development system with a GeForce GTX 1070 nvprof runs just fine with --analysis-metrics.

Both systems run a vanilla Ubuntu 18.04 with CUDA installed from Ubuntu’s repositories, the package version is nvidia-cuda-toolkit/bionic,now 9.1.85-3ubuntu1.

Sadly, i do not have a minimal non-working example, as the application is quite large and i do not even have any idea where that profiler error could stem from.
I have already tried the following without any success:

  • using only a single stream (no parallel kernels)
  • reducing the number of threads
  • not using shared memory for kernel
  • not using any custom events (for application level speed analysis)

Can anyone provide more information about the error code?
Any insight into why profiling could fail on the Tesla but succeed on the GTX would also be very helpful.

Thanks!

While trying to track this error down a bit further, i tried to profile some of my unittests. While the tests run fine (including with cuda-memcheck), for most of them the profiler also fails, including a simple test for multi-precision subtraction. The relevant (device) code is listed below.

#define LIMBS 4 /* as an example */

typedef uint32_t mp_limb;
typedef mp_limb mp_t[LIMBS];

#define ASM asm __volatile__
#define __ASM_SIZE "32"
#define __ASM_CONSTRAINT "r"

#define __sub_cc(r, a, b) ASM ("sub.cc.u" __ASM_SIZE " %0, %1, %2;": "=" __ASM_CONSTRAINT (r): __ASM_CONSTRAINT (a), __ASM_CONSTRAINT (b))
#define __subc_cc(r, a, b) ASM ("subc.cc.u" __ASM_SIZE " %0, %1, %2;": "=" __ASM_CONSTRAINT (r): __ASM_CONSTRAINT (a), __ASM_CONSTRAINT (b))

#define __addcy(carry) ASM ("addc.u" __ASM_SIZE " %0, 0, 0;": "=" __ASM_CONSTRAINT (carry))

__device__
mp_limb mp_sub(mp_t r, const mp_t a, const mp_t b) {
        mp_limb carry = 0;
#ifdef __CUDA_ARCH__
        __sub_cc(r[0], a[0], b[0]);
#pragma unroll
        for (size_t i = 1; i < LIMBS; i++) {
                __subc_cc(r[i], a[i], b[i]);
        }

        __addcy(carry);
#else
#pragma unroll
        for (size_t i = 0; i < LIMBS; i++) {
                r[i] = a[i] - b[i] - carry;
                carry = r[i] > a[i];
        }
#endif
        return carry;
}

__global__
void cuda_mp_sub(mp_t r, const mp_t a, const mp_t b) {
        mp_sub(r, a, b);
        return;
}

The kernel is called with three correctly allocated mp_t arrays, i.e.

mp_t dev_a, dev_b, dev_c;
cudaMalloc((void **) dev_a, LIMBS * sizeof(mp_limb));
cudaMalloc((void **) dev_b, LIMBS * sizeof(mp_limb));
cudaMalloc((void **) dev_c, LIMBS * sizeof(mp_limb));

cuda_mp_sub<<< 1, 1 >>> (dev_r, dev_a, dev_b);

I do not see any error in this.

I’m sorry to say that P100 is not a supported GPU for profiling.

See https://developer.nvidia.com/nsight-compute under “System Requirements” and “Supported GPU architectures”

  • Pascal: GP10x (excluding GP100)
  • Volta: GV100
  • Turing: TU10x

…we should provie you will a better error message though :/

Thanks rbischof for that information. I completely missed that.
I guess, this does not only apply to nsight-compute, but also to nvprof? I could not find any information regarding excluded architectures for nvprof.

Well, yes, i guess that would help. But with this thread, that error code can now at least be googled ;)

nvprof supports profiling on Tesla P100.

Regarding the nvprof “Internal profiling error” - we need to narrow down which metric is causing the issue.

“–analysis-metrics” collects several metrics including source level metrics.

This error can be due to source level metrics.

You can check if:
a) “–metrics all” works
b) there is a issue with any of the “–source-level-analysis” options (global_access, shared_access, branch, instruction_execution, pc_sampling)

Good to hear.

I checked those on the simple subtraction example from above.
“nvprof --metrics all” still gives the same error 4277:5, whereas “–source-level-analysis” with any of the options for works fine.

Anything else i can help to track this down?

As apparently the error stems from collecting all the metrics, i used some (quite ugly) shell fu to test gathering all those metrics separately:

$ nvprof --query-metrics | cut -f1 -d":" | awk "NF" | awk '{$1=$1};1' | tail -n +4 > metrics
$ while read l; do echo $l; nvprof --metrics $l -o /tmp/test.nvprof -f ./bin/test_gpu_mp_sub; done < metrics

Quite surprisingly, nvprof does not throw an error on any metric.

As a workaround, is it possible to join the output files together to get have all metrics available at the same time in nvvp?

Well this looks like some nvprof metric collection issue - when multiple metrics are collected together. It will be great if you can try and share some reduced version of your code which reproduces this issue.

As a workaround, is it possible to join the output files together to get have all metrics available at the same time in nvvp?
You can import multiple nvprof generated metric data files. Refer https://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-nvprof-session

Thanks for the hint that i can simply import multiple files. Haven’t thought of that.

As a reduced version of the code, gluing together what i posted above to a complete source file nvproferr.cu

#include <stdint.h>
#include <stdio.h>

#define LIMBS 4 /* as an example */

typedef uint32_t mp_limb;
typedef mp_limb mp_t[LIMBS];

#define ASM asm __volatile__
#define __ASM_SIZE "32"
#define __ASM_CONSTRAINT "r"

#define __sub_cc(r, a, b) ASM ("sub.cc.u" __ASM_SIZE " %0, %1, %2;": "=" __ASM_CONSTRAINT (r): __ASM_CONSTRAINT (a), __ASM_CONSTRAINT (b))
#define __subc_cc(r, a, b) ASM ("subc.cc.u" __ASM_SIZE " %0, %1, %2;": "=" __ASM_CONSTRAINT (r): __ASM_CONSTRAINT (a), __ASM_CONSTRAINT (b))

#define __addcy(carry) ASM ("addc.u" __ASM_SIZE " %0, 0, 0;": "=" __ASM_CONSTRAINT (carry))

__device__
mp_limb mp_sub(mp_t r, const mp_t a, const mp_t b) {
        mp_limb carry = 0;
#ifdef __CUDA_ARCH__
        __sub_cc(r[0], a[0], b[0]);
#pragma unroll
        for (size_t i = 1; i < LIMBS; i++) {
                __subc_cc(r[i], a[i], b[i]);
        }

        __addcy(carry);
#else
#pragma unroll
        for (size_t i = 0; i < LIMBS; i++) {
                r[i] = a[i] - b[i] - carry;
                carry = r[i] > a[i];
        }
#endif
        return carry;
}

__global__
void cuda_mp_sub(mp_t r, const mp_t a, const mp_t b) {
        mp_sub(r, a, b);
        return;
}

int main(int arc, char *argv[]){
    mp_t r;
    mp_limb *dev_a, *dev_b, *dev_r;
    cudaMalloc((void **) &dev_a, LIMBS * sizeof(mp_limb));
    cudaMalloc((void **) &dev_b, LIMBS * sizeof(mp_limb));
    cudaMalloc((void **) &dev_r, LIMBS * sizeof(mp_limb));

    cuda_mp_sub<<< 1, 1 >>> (dev_r, dev_a, dev_b);

    cudaMemcpy(r, dev_r, LIMBS * sizeof(mp_limb), cudaMemcpyDeviceToHost);

    for(int i = 0; i < LIMBS; i++)
	    printf(" 0x%x", r[i]);
    printf("\n");

    return 0;
}

reproduces the problem for me, as can be seen from the following session

$ nvcc ./nvproferr.cu -o ./nvproferr

$ ./nvproferr 
 0x0 0x0 0x0 0x0

$ cuda-memcheck ./nvproferr
========= CUDA-MEMCHECK
 0x0 0x0 0x0 0x0
========= ERROR SUMMARY: 0 errors

$ nvprof ./nvproferr
==20529== NVPROF is profiling process 20529, command: ./nvproferr
 0x0 0x0 0x0 0x0
==20529== Profiling application: ./nvproferr
==20529== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   77.46%  4.2880us         1  4.2880us  4.2880us  4.2880us  cuda_mp_sub(unsigned int*, unsigned int const *, unsigned int const *)
                   22.54%  1.2480us         1  1.2480us  1.2480us  1.2480us  [CUDA memcpy DtoH]
      API calls:   99.68%  204.51ms         3  68.169ms  6.1100us  204.49ms  cudaMalloc
                    0.16%  334.11us        94  3.5540us     359ns  125.91us  cuDeviceGetAttribute
                    0.11%  225.90us         1  225.90us  225.90us  225.90us  cuDeviceTotalMem
                    0.02%  31.336us         1  31.336us  31.336us  31.336us  cuDeviceGetName
                    0.01%  30.009us         1  30.009us  30.009us  30.009us  cudaLaunch
                    0.01%  22.073us         1  22.073us  22.073us  22.073us  cudaMemcpy
                    0.00%  2.9890us         3     996ns     351ns  1.8310us  cuDeviceGetCount
                    0.00%  1.3630us         2     681ns     457ns     906ns  cuDeviceGet
                    0.00%  1.1160us         3     372ns     169ns     614ns  cudaSetupArgument
                    0.00%     840ns         1     840ns     840ns     840ns  cudaConfigureCall

$ nvprof --metrics all ./nvproferr
==20554== NVPROF is profiling process 20554, command: ./nvproferr
==20554== Error:  0x0 0x0 0x0 0x0
Internal profiling error 4277:5.
======== Error: CUDA profiling error.

Can you also reproduce this on a Tesla P100?

I guess no answer is a yes?

I retried this with CUDA 10 from Nvidias repositories for Ubuntu 18.04 and i am not seeing the error anymore.

$ /usr/local/cuda-10.0/bin/nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2018 NVIDIA Corporation
Release version 10.0.130 (21)

$ /usr/local/cuda-10.0/bin/nvprof --metrics all ./nvproferr
==31620== NVPROF is profiling process 31620, command: ./nvproferr
==31620== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "cuda_mp_sub(unsigned int*, unsigned int const *, unsigned int const *)" (done)
 0x0 0x0 0x0 0x0al events
==31620== Profiling application: ./nvproferr
==31620== Profiling result:
==31620== Metric result:
Invocations                               Metric Name                                                    Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: cuda_mp_sub(unsigned int*, unsigned int const *, unsigned int const *)
          1                             inst_per_warp                                                 Instructions per warp   27.000000   27.000000   27.000000
          1                         branch_efficiency                                                     Branch Efficiency     100.00%     100.00%     100.00%
          1                 warp_execution_efficiency                                             Warp Execution Efficiency       3.24%       3.24%       3.24%
...

Nice, very nice!

Sorry for a delayed response.

Thanks for sharing all the details and a reduced version of the code.

Good to know that the issue is fixed in CUDA 10.