Profiling at Xavier not working correctly

Hi,

I have two Xavier in the following way:
#1 Has the oldest jetpack version
#2 Has the newest jetpack version

I’m trying to execute kernels concurrently using streams and in order to visualize it, my intention was to use NVVP from an external point and do remote profiling. The problem arise when this doesn’t work at Xavier#2 but it does at Xavier #1. The error is the following:

“Data collection for 1 analysis stages failed
Metric 22741001 not found
Metric 22741001 not found”

If I try to by profiling with NVPROF and then importing the result file, I get a serialized execution at Xavier#2.

Any ideas?
thanks

Hi,

Does this issue only occur on concurrent streams use case?

A possible issue is that CUDA profiler require root authority from version 10.0.
Have you login the remote device as root?

Thanks.

Hi,

The mentioned error also happens when executing kernels that do not use concurrent kernels. It also happens that the timeline for functions such as cudaMemcpy do not appear.

I do access as root to the remote device.

The older jetpack version is 4.1.1 Developer Preview and the jetpack version in the other AGX Xavier is 4.2.3

P.S: Sorry for the incorrect post title, I already changed it.

Thanks.

Hi,

Thanks for the details.

In order to reproduce this issue in our environment, could you share reproducible source code with us?
It can be a simple kernel or just a CUDA sample that can reproduce this issue.

Thanks.

Hi,

I will leave below the code:

#include <stdio.h>
#include <cuda_profiler_api.h>
__global__
void vector_addition_loop(int N, float *d_x, float *d_y, float *d_z){
        int index=(blockIdx.x*blockDim.x) + threadIdx.x;
        for(int i=0; i<1000000; ++i){
                d_z[index]+=d_x[index]+d_y[index];
        }
}

int main(int argc, char *argv[]){

        int N = 32;
        int nB = 1;
        const int num_streams = 4;
        cudaStream_t streams[num_streams];


        float *x[num_streams], *y[num_streams], *z[num_streams], *d_x[num_streams], *d_y[num_streams], *d_z[num_streams];

        for(int j=0; j<num_streams; ++j){
                x[j] = (float*)malloc(nB*N*sizeof(float));
                y[j] = (float*)malloc(nB*N*sizeof(float));
                z[j] = (float*)malloc(nB*N*sizeof(float));
        }

        for(int j=0; j<num_streams; ++j){
                for(int i=0; i<N*nB; i++){
                        x[j][i]=0.01f;
                        y[j][i]=0.02f;
                }
        }

        for(int i=0; i<num_streams; ++i){
                cudaStreamCreate(&streams[i]);
                cudaMalloc(&d_x[i], nB*N*sizeof(float));
                cudaMalloc(&d_y[i], nB*N*sizeof(float));
                cudaMalloc(&d_z[i], nB*N*sizeof(float));

                cudaMemcpy(d_x[i],x[i],N*nB*sizeof(float), cudaMemcpyHostToDevice);
                cudaMemcpy(d_y[i],y[i],N*nB*sizeof(float), cudaMemcpyHostToDevice);

                vector_addition_loop<<<nB, N, 0, streams[i]>>>(N,d_x[i],d_y[i], d_z[i]);
        }
        cudaDeviceSynchronize();
        cudaProfilerStop();

        for(int i=0; i<num_streams; ++i){
                cudaFree(d_x[i]);
                cudaFree(d_y[i]);
                cudaFree(d_z[i]);
                free(x[i]);
                free(y[i]);
                free(z[i]);
        }
        cudaDeviceReset();
}

Thanks

Were you able to reproduce the issue? Do you need something else?

Kind regards

Hi,

Sorry for keeping you waiting.

We found there is a possible issue and it’s fixed in JetPack4.3
Would you mind to give our latest nsight profiler a try?

Really sorry for the late update and thanks for your patience.