Nvprof with --analysis-metrics freezes Jetson Tx2 after replaying kernel.

I’m having problems trying to generate analysis metrics for my other centos machine to display in the visual profiler. I can run and compile my kernel fine on the jetson board, but when using --analysis metrics, it claims “processing 0 of 32” with varying degrees of progress, and waits, doesn’t appear to do anything, then if I try to move my mouse, the mouse pointer moves a bit, stops and I can no longer interact with the device. After waiting 30 minutes, nothing happened, and I was forced to shut the device down. Profile files are blank when loaded into the visual profiler on centos. Additionally I tried doing the same thing on my centos machine with a 1070, which doesn’t have this behavior, and doesn’t lock me out of interacting with the system, though the terminal is unresponsive. It takes about a minute for the process to complete on my centos machine.

all I do is run nvprof --analysis-metrics -o filename.nvprof ./appname

Ok I’m in the process of trying to figure out which commands cause the entire system to stall. I created the script below. This script allowed me to do a sort of binary search on what commands (taken from --query–metrics) cause issue. There are a total of 120 commands listed here, every argument up until the 49th works, but the metric arguments

flop_count_sp
flop_count_sp_add
flop_count_sp_fma
flop_count_sp_mul
flop_count_sp_special

all freeze my system, and I suspect that the double versions don’t because I don’t use doubles in my program.

shared_efficiency also stalls

It looks like the driver can’t handle in code profile analysis? How am I supposed to profile my code if the tegra just crashes when I try? I’m installing with cuda-repo-l4t-8-0-local-8.0.34-1_arm64.deb btw.

#!/bin/bash

ARRAY=()
ARRAY+=(inst_per_warp)
ARRAY+=(branch_efficiency)
ARRAY+=(warp_execution_efficiency)
ARRAY+=(warp_nonpred_execution_efficiency)
ARRAY+=(inst_replay_overhead)
ARRAY+=(shared_load_transactions_per_request)
ARRAY+=(shared_store_transactions_per_request)
ARRAY+=(local_load_transactions_per_request)
ARRAY+=(local_store_transactions_per_request)
ARRAY+=(gld_transactions_per_request)
ARRAY+=(gst_transactions_per_request)
ARRAY+=(shared_store_transactions)
ARRAY+=(shared_load_transactions)
ARRAY+=(local_load_transactions)
ARRAY+=(local_store_transactions)
ARRAY+=(gld_transactions)
ARRAY+=(gst_transactions)
ARRAY+=(sysmem_read_transactions)
ARRAY+=(sysmem_write_transactions)
ARRAY+=(l2_read_transactions)
ARRAY+=(l2_write_transactions)
ARRAY+=(global_hit_rate)
ARRAY+=(local_hit_rate)
ARRAY+=(gld_requested_throughput)
ARRAY+=(gst_requested_throughput)
ARRAY+=(gld_throughput)
ARRAY+=(gst_throughput)
ARRAY+=(local_memory_overhead)
ARRAY+=(tex_cache_hit_rate)
ARRAY+=(l2_tex_read_hit_rate)
ARRAY+=(l2_tex_write_hit_rate)
ARRAY+=(tex_cache_throughput)
ARRAY+=(l2_tex_read_throughput)
ARRAY+=(l2_tex_write_throughput)
ARRAY+=(l2_read_throughput)
ARRAY+=(l2_write_throughput)
ARRAY+=(sysmem_read_throughput)
ARRAY+=(sysmem_write_throughput)
ARRAY+=(local_load_throughput)
ARRAY+=(local_store_throughput)
ARRAY+=(shared_load_throughput)
ARRAY+=(shared_store_throughput)
ARRAY+=(gld_efficiency)
ARRAY+=(gst_efficiency)
ARRAY+=(tex_cache_transactions)
ARRAY+=(flop_count_dp)
ARRAY+=(flop_count_dp_add)
ARRAY+=(flop_count_dp_fma)
ARRAY+=(flop_count_dp_mul)
ARRAY+=(flop_count_sp)
ARRAY+=(flop_count_sp_add)
ARRAY+=(flop_count_sp_fma)
ARRAY+=(flop_count_sp_mul)
ARRAY+=(flop_count_sp_special)
ARRAY+=(inst_executed)
ARRAY+=(inst_issued)
ARRAY+=(sysmem_utilization)
ARRAY+=(stall_inst_fetch)
ARRAY+=(stall_exec_dependency)
ARRAY+=(stall_memory_dependency)
ARRAY+=(stall_texture)
ARRAY+=(stall_sync)
ARRAY+=(stall_other)
ARRAY+=(stall_constant_memory_dependency)
ARRAY+=(stall_pipe_busy)
ARRAY+=(shared_efficiency)
ARRAY+=(inst_fp_32)
ARRAY+=(inst_fp_64)
ARRAY+=(inst_integer)
ARRAY+=(inst_bit_convert)
ARRAY+=(inst_control)
ARRAY+=(inst_compute_ld_st)
ARRAY+=(inst_misc)
ARRAY+=(inst_inter_thread_communication)
ARRAY+=(issue_slots)
ARRAY+=(cf_issued)
ARRAY+=(cf_executed)
ARRAY+=(ldst_issued)
ARRAY+=(ldst_executed)
ARRAY+=(atomic_transactions)
ARRAY+=(atomic_transactions_per_request)
ARRAY+=(l2_atomic_throughput)
ARRAY+=(l2_atomic_transactions)
ARRAY+=(l2_tex_read_transactions)
ARRAY+=(stall_memory_throttle)
ARRAY+=(stall_not_selected)
ARRAY+=(l2_tex_write_transactions)
ARRAY+=(flop_count_hp)
ARRAY+=(flop_count_hp_add)
ARRAY+=(flop_count_hp_mul)
ARRAY+=(flop_count_hp_fma)
ARRAY+=(inst_fp_16)
ARRAY+=(sysmem_read_utilization)
ARRAY+=(sysmem_write_utilization)
ARRAY+=(sm_activity)
ARRAY+=(achieved_occupancy)
ARRAY+=(executed_ipc)
ARRAY+=(issued_ipc)
ARRAY+=(issue_slot_utilization)
ARRAY+=(eligible_warps_per_cycle)
ARRAY+=(tex_utilization)
ARRAY+=(l2_utilization)
ARRAY+=(shared_utilization)
ARRAY+=(ldst_fu_utilization)
ARRAY+=(cf_fu_utilization)
ARRAY+=(special_fu_utilization)
ARRAY+=(tex_fu_utilization)
ARRAY+=(single_precision_fu_utilization)
ARRAY+=(double_precision_fu_utilization)
ARRAY+=(flop_hp_efficiency)
ARRAY+=(flop_sp_efficiency)
ARRAY+=(flop_dp_efficiency)
ARRAY+=(dram_read_transactions)
ARRAY+=(dram_write_transactions)
ARRAY+=(dram_read_throughput)
ARRAY+=(dram_write_throughput)
ARRAY+=(dram_utilization)
ARRAY+=(half_precision_fu_utilization)
ARRAY+=(ecc_transactions)
ARRAY+=(ecc_throughput)

PROFILENAME=$1
EXENAME=$2
ARRAYLENGTH=${#ARRAY[@]}

if [ -z "$3" ]
    then
        MINIDX=0
    else
        MINIDX=$3
fi
if [ -z "$4" ]
    then
        MAXIDX=${#ARRAY[@]}
    else

        if [ $4 -gt ${ARRAYLENGTH} ]
            then
                MAXIDX=$ARRAYLENGTH
            else
                MAXIDX=$4
        fi
fi

METRICLIST=""

for ((i=${MINIDX}; i<${MAXIDX}; i++))
do
  echo ${ARRAY[$i]}","
  METRICLIST+=${ARRAY[$i]}","
done

/usr/local/cuda-8.0/bin/nvprof -f --metrics $METRICLIST -o $PROFILENAME ./$EXENAME

snb4y4,
Thanks for your posting. We aren’t aware of the issue you reported but doesn’t mean there won’t be such issue. Is it possible to trim down you code to bare minimum but able to repro the issue you observed so we can try it here? Thanks again!

Hi, snb4y4

We tried to reproduce this issue in our environment but failed.
Could you share the complete option you used for nvprof?

Thanks.