How to get every instruction num from nv-nsight-cu-cli command-line

Hi,
How to get the every instruction(such as FFMA, XMAD) number from nv-nsight-cu-cli like nv-nsight-cu

Thanks

There is no good way to get this data from the CLI yet using standard output methods. You can, however, write a new python rule that extracts the information for each profiled kernel, and then forwards it to your output method of choice. In my example, I simply print it to stdout.

Create a new file called InstPrintout.py in your Nsight Compute 2019.5 installation’s section directory with the following content:

import NvRules

def get_identifier():
    return "InstPrintout"

def get_name():
    return "Instructions Per Opcode"

def get_description():
    return "Print executed instructions per opcode"

def get_section_identifier():
    return "InstructionStats"

def apply(handle):
    ctx = NvRules.get_context(handle)
    action = ctx.range_by_idx(0).action_by_idx(0)
    fe = ctx.frontend()

    inst_per_opcode = action.metric_by_name("sass__inst_executed_per_opcode")
    num_opcodes = inst_per_opcode.num_instances()
    opcodes = inst_per_opcode.correlation_ids()
    for i in range(0,num_opcodes):
        op = opcodes.as_string(i)
        num = inst_per_opcode.as_uint64(i)
        print("{}: {}".format(op, str(num)))

Run this as follows:
nv-nsight-cu-cli --apply-rules --section InstructionStats <app>

The output should look similar to this:

IMAD: 6262
S2R: 3136
ISETP: 3136
LDG: 3126
BSYNC: 1569
MOV: 1568
EXIT: 1568
BSSY: 1568
BRA: 1568
BMOV: 1568
STG: 1563
FADD: 1563
1 Like

Thank you very much.
It works

Hello, i have some new question about this.
I use vectorAdd sample code, and set numElements = 1

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>
/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */
__global__ void
vectorAdd(const float *A, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i];
    }
}

/**
 * Host main routine
 */
int
main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 1;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
    {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
    {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
    {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
        {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}

  • 1 instruction num is 54
  • 2 print every instruction total number is 46

why are they not equal ??

The two metrics are collected from different sources. The first one, Executed Instructions, is a hardware metric, i.e. some HW counter is increased whenever the GPU executed an instruction. The second metric is a SASS metric, i.e. it is collected by the tool patching the SASS assembly of your kernel and collecting a value whenever a patched instruction is executed.

The HW metric will include any instruction executed by the HW, including those in syscalls or driver code. The SASS metric will only collected data for patched instructions of your kernel code.

2 Likes