NsightCompute doesn't profile some metrics on SM_75

I am trying to run a profiling job for LAMMPS on 2080ti. As you can see below, the run command successfully terminates. Please note that 2080Ti is used by LAMMPS

$ ~/lammps/lammps-7Aug19/src/lmp_mpi -sf gpu -in in.lj
--------------------------------------------------------------------------
[[13347,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: fury0

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
LAMMPS (7 Aug 2019)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (67.1838 67.1838 67.1838)
  1 by 1 by 1 MPI processor grid
Created 256000 atoms
  create_atoms CPU = 0.0726344 secs

--------------------------------------------------------------------------
- Using acceleration for lj/cut:
-  with 1 proc(s) per device.
--------------------------------------------------------------------------
Device 0: GeForce RTX 2080 Ti, 68 CUs, 9.5/11 GB, 1.5 GHZ (Single Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.

Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 46.55 | 46.55 | 46.55 Mbytes
Step Temp E_pair E_mol TotEng Press
       0         1.44   -6.7733685            0   -4.6133769   -5.0196739
    5000   0.69714777   -5.6663289            0   -4.6206113   0.74515797
Loop time of 42.8073 on 1 procs for 5000 steps with 256000 atoms

Performance: 50458.698 tau/day, 116.803 timesteps/s
96.8% CPU use with 1 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 15.967     | 15.967     | 15.967     |   0.0 | 37.30
Neigh   | 0.00023741 | 0.00023741 | 0.00023741 |   0.0 |  0.00
Comm    | 6.7688     | 6.7688     | 6.7688     |   0.0 | 15.81
Output  | 0.0011637  | 0.0011637  | 0.0011637  |   0.0 |  0.00
Modify  | 16.313     | 16.313     | 16.313     |   0.0 | 38.11
Other   |            | 3.757      |            |       |  8.78

Nlocal:    256000 ave 256000 max 256000 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:    69509 ave 69509 max 69509 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:    0 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 250
Dangerous builds not checked


---------------------------------------------------------------------
      Device Time Info (average):
---------------------------------------------------------------------
Data Transfer:   3.8269 s.
Data Cast/Pack:  12.0120 s.
Neighbor copy:   0.0005 s.
Neighbor build:  0.5015 s.
Force calc:      2.3973 s.
Device Overhead: 0.2641 s.
Average split:   1.0000.
Threads / atom:  4.
Max Mem / Proc:  348.89 MB.
CPU Driver_Time: 0.2776 s.
CPU Idle_Time:   2.9574 s.
---------------------------------------------------------------------


Please see the log.cite file for references relevant to this simulation

Total wall time: 0:00:45

However, the following nsight command fails.

$ ~/cuda-10.1.168/NsightCompute-2019.3/nv-nsight-cu-cli --quiet --metrics smsp__cycles_active.avg.pct_of_peak_sustained_elapsed -f -o 2080ti.perf4  ~/lammps/lammps-7Aug19/src/lmp_mpi -sf gpu -in in.lj
--------------------------------------------------------------------------
[[13317,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: fury0

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
LAMMPS (7 Aug 2019)
ERROR: GPU library not compiled for this accelerator (../gpu_extra.h:40)
Last command: package gpu 1
Cuda driver error 4 in call at file 'geryon/nvd_device.h' in line 135.
==ERROR== The application returned an error code (1)

I checked the gpu package with cuobjdump and it seems that SM_75 is used for GPU acceleration.

$ cuobjdump ../lammps-7Aug19/lib/gpu/libgpu.a

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_atom.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_ans.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_neighbor.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_neighbor_shared.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_device.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_base_atomic.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_base_charge.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_base_ellipsoid.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_base_dipole.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_base_three.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_base_dpd.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_pppm.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_pppm_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_gayberne.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_gayberne_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_re_squared.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_re_squared_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj96.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj96_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_expand.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_expand_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_dsf.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_dsf_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_class2_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_class2_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_morse.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_morse_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_charmm_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_charmm_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_sdk.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_sdk_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_sdk_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_sdk_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_eam.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_eam_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_eam_fs_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_eam_alloy_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_buck.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_buck_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_buck_coul.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_buck_coul_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_buck_coul_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_buck_coul_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_table.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_table_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_yukawa.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_yukawa_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_wolf.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_wolf_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dipole_lj.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dipole_lj_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dipole_lj_sf.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dipole_lj_sf_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_colloid.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_colloid_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_gauss.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_gauss_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_yukawa_colloid.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_yukawa_colloid_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul_debye.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul_debye_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_dsf.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_dsf_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_sw.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_sw_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_vashishta.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_vashishta_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_beck.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_beck_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_mie.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_mie_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_soft.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_soft_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul_msm.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_coul_msm_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_gromacs.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_gromacs_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dpd.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dpd_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_tersoff.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_tersoff_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_tersoff_zbl.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_tersoff_zbl_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_tersoff_mod.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_tersoff_mod_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_debye.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_debye_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_zbl.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_zbl_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_cubic.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_cubic_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_ufm.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_ufm_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dipole_long_lj.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_dipole_long_lj_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_expand_coul_long.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_lj_expand_coul_long_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_long_cs.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_coul_long_cs_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_long_cs.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_long_cs_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_wolf_cs.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:lal_born_coul_wolf_cs_ext.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:cudpp.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:cudpp_plan.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:cudpp_maximal_launch.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:cudpp_plan_manager.o:

member ../lammps-7Aug19/lib/gpu/libgpu.a:radixsort_app.cu_o:

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_75
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

member ../lammps-7Aug19/lib/gpu/libgpu.a:scan_app.cu_o:

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_75
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

Any idea about the error? Why I am not able to profile that metric with nsight?

We will try to look into this issue. In the meantime, could you please provide us some additional info, specifically

  • the exact version of OpenMPI you are using
  • the exact version of LAMMPS (is possible, how you compiled it, for easier reproduction on our end)
  • the OS you are using

Also, it would be helpful if you could check whether or not the issue still exists with Nsight Compute 2019.4. You can download that individually, or as part of the latest CUDA 10.1 Update 2 toolkit.

OK I tested with Nsight Compute 2019.4 and it works. At least with that metric. I will continue and let you know if there is any issue.

Thanks for the help.