How to get every instruction num from nv-nsight-cu-cli command-line

erfeng.jef · June 5, 2020, 12:11pm

Hi,
How to get the every instruction(such as FFMA, XMAD) number from nv-nsight-cu-cli like nv-nsight-cu

Thanks

felix_dt · June 5, 2020, 1:36pm

There is no good way to get this data from the CLI yet using standard output methods. You can, however, write a new python rule that extracts the information for each profiled kernel, and then forwards it to your output method of choice. In my example, I simply print it to stdout.

Create a new file called InstPrintout.py in your Nsight Compute 2019.5 installation’s section directory with the following content:

import NvRules

def get_identifier():
    return "InstPrintout"

def get_name():
    return "Instructions Per Opcode"

def get_description():
    return "Print executed instructions per opcode"

def get_section_identifier():
    return "InstructionStats"

def apply(handle):
    ctx = NvRules.get_context(handle)
    action = ctx.range_by_idx(0).action_by_idx(0)
    fe = ctx.frontend()

    inst_per_opcode = action.metric_by_name("sass__inst_executed_per_opcode")
    num_opcodes = inst_per_opcode.num_instances()
    opcodes = inst_per_opcode.correlation_ids()
    for i in range(0,num_opcodes):
        op = opcodes.as_string(i)
        num = inst_per_opcode.as_uint64(i)
        print("{}: {}".format(op, str(num)))

Run this as follows:
nv-nsight-cu-cli --apply-rules --section InstructionStats <app>

The output should look similar to this:

IMAD: 6262
S2R: 3136
ISETP: 3136
LDG: 3126
BSYNC: 1569
MOV: 1568
EXIT: 1568
BSSY: 1568
BRA: 1568
BMOV: 1568
STG: 1563
FADD: 1563

erfeng.jef · June 8, 2020, 2:25am

Thank you very much.
It works

erfeng.jef · June 15, 2020, 8:17am

Hello, i have some new question about this.
I use vectorAdd sample code, and set numElements = 1

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>
/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */
__global__ void
vectorAdd(const float *A, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i];
    }
}

/**
 * Host main routine
 */
int
main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 1;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
    {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
    {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
    {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
        {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}

1 instruction num is 54
2 print every instruction total number is 46

why are they not equal ??

felix_dt · June 22, 2020, 10:14am

The two metrics are collected from different sources. The first one, Executed Instructions, is a hardware metric, i.e. some HW counter is increased whenever the GPU executed an instruction. The second metric is a SASS metric, i.e. it is collected by the tool patching the SASS assembly of your kernel and collecting a value whenever a patched instruction is executed.

The HW metric will include any instruction executed by the HW, including those in syscalls or driver code. The SASS metric will only collected data for patched instructions of your kernel code.

Topic		Replies	Views
How can I get each CUDA thread's dynamic instruction sequence and num? Nsight Compute cuda , kernel	4	526	October 11, 2022
Metrics smsp__sass_thread_inst_executed_op* returns n/a Nsight Compute	8	1777	August 2, 2019
CUDA error in PyTorch Training when using nv-nsight-cu-cli Nsight Compute cuda	4	1095	July 13, 2020
How to use CUPTI to get metrics for the Device Attributes CUPTI – CUDA Profiler Tools Interface cuda , samples , profiling	2	992	December 13, 2022
Some metric set and section are not enable Nsight Compute cuda , ubuntu	5	1532	January 16, 2024
How to use CUPTI to get average instruction execution time? CUDA Programming and Performance	7	1038	March 20, 2018
Which metric should I collect from ncu profiler if I want to get the IOPS (integer operations per second) for my kernel? Nsight Compute	6	664	August 24, 2023
How to get the exec. time inner the kernel function? Nsight Compute cuda , kernel , profiling	6	981	February 27, 2023
I found ncu's branch efficiency metric is always zero for any kernels Nsight Compute cuda	5	318	November 8, 2024
Is there any tools that can collect the instruction information of one cuda program? CUDA Programming and Performance	5	1649	June 14, 2013

How to get every instruction num from nv-nsight-cu-cli command-line

Related topics