Kernel performance when switching compute capability from 3.0 to 6.2 on Jetson Tx2

andrea.capirchio · March 22, 2023, 12:10pm

Hi all,

I am trying to optimize a kernel to its maximum performance on my jetson Tx2, using Jetpack 4.3 (cuda toolkit 10.0).
The kernel includes shared memory usage, float to int calculations and conversions and finally some for loops.

I compiled it (using fastmath on nvcc) achieving roughly 300ms per run (measured through nvprof).
I then realized that I didn’t specify any compute capability, so my eclipse Nsight was defaulting to compute capability 3.0 and PTX/SASS 2.0 compilation.

When I switched to the 6.2 compute capability, I got my Kernel slowed down by almost 400% achieving average 1.3 seconds with the same conditions as before.

I would have expected an enhancement instead.

I am struggling to find an answer to this, does anybody have any suggestion or pointer, or general optimization advice?

The code of my kernel is the following (it’s an interpolation and modulation of float input values to uint8 output):



/******************************************
Function Call:
******************************************/
int block_1d = 256;
int n_cols = 256*205;
int n_pre_rows = 13;
int n_post_rows = 3105;
int grid_1d = (n_cols + block_1d -1) / block_1d;

my_kernel<<<grid_1d, block_1d, block_1d*n_pre_rows*sizeof(float), my_stream>>>(
    (uint8_t*)ptr_u8,(const float*)ptr_f,
    n_pre_rows, n_post_rows, n_cols, (float)downsamplingFactor, (1.f/(float)downsamplingFactor), samplingFactor, subf);

/******************************************/


__global__ void my_kernel(
    uint8_t* __restrict__ out_buff, const float* __restrict__ input_buff,
    const uint32_t n_pre_rows, const uint32_t n_post_rows, const uint32_t n_cols, 
    const float downsamplingFactor, const float downsamplingFactorMul, const float samplingFactor, const float subf)
{
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    extern uint8_t __shared__ s[];
    float m;
    float * pre_data = (float*) s;
    uint32_t shared_idx = threadIdx.x*n_pre_rows;
    uint32_t accumulation;
    int increments;
    int lastData;
    uint32_t col_idx;
    uint32_t ds_factor_u16;
    uint32_t subloop_idx;
    uint32_t post_data_init_0;
    uint32_t post_data_init_1;
    uint32_t post_data_current;
    const float round_factor = 0.5f;

    if (x < (n_cols)){
        ds_factor_u16 = (uint16_t) (downsamplingFactor);
        col_idx = x*n_post_rows;
        pre_data[shared_idx] =  samplingFactor * (input_buff[x*n_pre_rows] - subf);
        for (int i=0; i<n_pre_rows-1; i++){
            subloop_idx = i*ds_factor_u16;
            pre_data[shared_idx+i+1] =  (samplingFactor * (input_buff[x*n_pre_rows+i+1] - subf));
            m = (pre_data[shared_idx+i+1] - pre_data[shared_idx+i]) * downsamplingFactorMul;
            if (i==0){
                post_data_init_0 = floorf(pre_data[shared_idx] + round_factor);
                post_data_init_1 = floorf(pre_data[shared_idx]+ m + round_factor);
                increments = 2*(post_data_init_1 - post_data_init_0);
                accumulation = post_data_init_0;
                lastData = -1;
            }
            for (int j = (i==0); j<downsamplingFactor; j++){
                post_data_current = floorf(pre_data[shared_idx+i]+ (float)j* m + round_factor);
                accumulation += (increments >> 1);
                if (increments > 128) {
                    lastData = -1;
                } else {
                    if (post_data_current > accumulation) {
                        lastData = -1;
                    } else if (post_data_current < accumulation) {
                        lastData = 1;
                    }
                }
                out_buff[col_idx + subloop_idx + j-1] = (lastData == 1);
                increments += lastData;
            }
        }
    }
}

Thanks,
Andrea

AastaLLL · March 23, 2023, 2:23am

Hi,

Could you wrap the sample code to a standalone app so we can try it internally.
Thanks.

andrea.capirchio · March 24, 2023, 1:59pm

Hi AastaLLL,

of course, here is the code, you should be able to compile it just by including the common/inc/ directory of the cuda samples:

// system includes
#include <cstdio>
#include <ctime>
#include <vector>
#include <algorithm>
#include <stdlib.h>

// utilities from samples
#include <helper_cuda.h>

cudaStream_t *g_streams; 


__global__ void my_kernel(
    uint8_t* __restrict__ out_buff, const float* __restrict__ input_buff,
    const uint32_t n_pre_i_depths, const uint32_t n_post_i_depths, const uint32_t n_cols, 
    const float downsamplingFactor, const float downsamplingFactorMul, const float samplingFactorQuant, const float mindelayf);

int main(int argc, char** argv){
    int columns = 256*205;
    int out_d = 3105;
    int bs_d = (out_d - 1) & (~31);
    int in_d = 13;
    float samplingFactor = 20e6*16.f;
    uint32_t downsamplingFactor = 256;
    float mindelayf = 0.f;
    float r;

    float *in_buff;
    uint8_t *out_buff;

    int in_size = columns*in_d*sizeof(float);
    int out_size = columns*bs_d*sizeof(uint8_t);

    int ret = 0;
    int block_1d = 256;
    int grid_btstr_1d = (columns + block_1d -1) / block_1d;

    // init
    g_streams = new cudaStream_t[2];
    for (int i=0; i<2; i++) {
        checkCudaErrors(cudaStreamCreate(&g_streams[i]));
    }
    srand (static_cast <unsigned> (time(0)));

    checkCudaErrors(cudaMallocManaged(&in_buff, in_size));
    checkCudaErrors(cudaMallocManaged(&out_buff, out_size));
    checkCudaErrors(cudaStreamAttachMemAsync(g_streams[0], in_buff, 0, cudaMemAttachHost));
    checkCudaErrors(cudaStreamAttachMemAsync(g_streams[0], out_buff, 0, cudaMemAttachHost));
    checkCudaErrors(cudaDeviceSynchronize());

    mindelayf = 1.f;
    for (int i=0; i<columns; i++){
        for (int j=0; j<in_d; j++){
            r = (static_cast <float> (rand()) / static_cast <float> (RAND_MAX)) / 100.f;
            in_buff [i*in_d +j]= 1.f + (float)i*1e-7 + (float)j*1e-8 + r;
        }
    }

    printf("floats:\n");
    for (int i=0; i<10; i++){
        for (int j=0; j<10; j++){
            printf("%e  ", in_buff [i*in_d +j]);
        }
        printf("\n");
    }

    // ------------------------------- //

    checkCudaErrors(cudaStreamAttachMemAsync(g_streams[1], in_buff, 0, cudaMemAttachSingle));
    checkCudaErrors(cudaStreamAttachMemAsync(g_streams[1], out_buff, 0, cudaMemAttachSingle));
    checkCudaErrors(cudaDeviceSynchronize());

    my_kernel<<<grid_btstr_1d, block_1d, block_1d*(in_d*sizeof(float) + sizeof(uint32_t)), g_streams[1]>>>(
        (uint8_t*)out_buff,(const float*)in_buff,
        in_d, bs_d, columns, (float)downsamplingFactor, (1.f/(float)downsamplingFactor), samplingFactor, mindelayf);

    checkCudaErrors(cudaStreamAttachMemAsync(g_streams[0], out_buff, 0, cudaMemAttachHost));
    checkCudaErrors(cudaDeviceSynchronize());

    // ------------------------------- //

    printf("bitstream:\n");
    for (int i=0; i<10; i++){
        for (int j=0; j<10; j++){
            printf("0x%x  ", out_buff [i*bs_d +j*downsamplingFactor]);
        }
        printf("\n");
    }

    return 0;
}

/******************************************/


__global__ void my_kernel(
    uint8_t* __restrict__ out_buff, const float* __restrict__ input_buff,
    const uint32_t n_pre_rows, const uint32_t n_post_rows, const uint32_t n_cols, 
    const float downsamplingFactor, const float downsamplingFactorMul, const float samplingFactor, const float subf)
{
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    extern uint8_t __shared__ s[];
    float m;
    float * pre_data = (float*) s;
    uint32_t shared_idx = threadIdx.x*n_pre_rows;
    uint32_t accumulation;
    int increments;
    int lastData;
    uint32_t col_idx;
    uint32_t ds_factor_u16;
    uint32_t subloop_idx;
    uint32_t post_data_init_0;
    uint32_t post_data_init_1;
    uint32_t post_data_current;
    const float round_factor = 0.5f;

    if (x < (n_cols)){
        ds_factor_u16 = (uint16_t) (downsamplingFactor);
        col_idx = x*n_post_rows;
        pre_data[shared_idx] =  samplingFactor * (input_buff[x*n_pre_rows] - subf);
        for (int i=0; i<n_pre_rows-1; i++){
            subloop_idx = i*ds_factor_u16;
            pre_data[shared_idx+i+1] =  (samplingFactor * (input_buff[x*n_pre_rows+i+1] - subf));
            m = (pre_data[shared_idx+i+1] - pre_data[shared_idx+i]) * downsamplingFactorMul;
            if (i==0){
                post_data_init_0 = floorf(pre_data[shared_idx] + round_factor);
                post_data_init_1 = floorf(pre_data[shared_idx]+ m + round_factor);
                increments = 2*(post_data_init_1 - post_data_init_0);
                accumulation = post_data_init_0;
                lastData = -1;
            }
            for (int j = (i==0); j<downsamplingFactor; j++){
                post_data_current = floorf(pre_data[shared_idx+i]+ (float)j* m + round_factor);
                accumulation += (increments >> 1);
                if (increments > 128) {
                    lastData = -1;
                } else {
                    if (post_data_current > accumulation) {
                        lastData = -1;
                    } else if (post_data_current < accumulation) {
                        lastData = 1;
                    }
                }
                out_buff[col_idx + subloop_idx + j-1] = (lastData == 1);
                increments += lastData;
            }
        }
    }
}

this is an example result from nvprof I got when testing (with nvpmodel = 0):

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  332.98ms         1  332.98ms  332.98ms  332.98ms  my_kernel(unsigned char*, float const *, unsigned int, unsigned int, unsigned int, float, float, float, float)

Thanks,
Andrea

AastaLLL · March 27, 2023, 6:10am

Hi,

Have you maximized the clock?
We have tried your sample on TX2+JetPack4.6.3 and are not been able to reproduce this issue.

default

$ nvcc test.cu -o default -I/usr/local/cuda/samples/common/inc

==14679== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  181.54ms         1  181.54ms  181.54ms  181.54ms  my_kernel(unsigned char*, float const *, unsigned int, unsigned int, unsigned int, float, float, float, float)
      API calls:   57.44%  262.07ms         2  131.04ms  4.3200us  262.07ms  cudaStreamCreate
                   40.19%  183.36ms         3  61.119ms  415.20us  182.22ms  cudaDeviceSynchronize
                    2.31%  10.521ms         2  5.2603ms  658.27us  9.8624ms  cudaMallocManaged
                    0.03%  127.97us         1  127.97us  127.97us  127.97us  cudaLaunchKernel
                    0.02%  92.736us        97     956ns     448ns  28.608us  cuDeviceGetAttribute
                    0.01%  53.824us         5  10.764us  3.5840us  24.896us  cudaStreamAttachMemAsync
                    0.00%  10.752us         1  10.752us  10.752us  10.752us  cuDeviceTotalMem
                    0.00%  5.2800us         3  1.7600us     896ns  2.5600us  cuDeviceGetCount
                    0.00%  2.0800us         2  1.0400us     864ns  1.2160us  cuDeviceGet
                    0.00%  1.5040us         1  1.5040us  1.5040us  1.5040us  cuDeviceGetName
                    0.00%     704ns         1     704ns     704ns     704ns  cuDeviceGetUuid

Specified sm=62

$ nvcc test.cu -o gen62 -I/usr/local/cuda/samples/common/inc -gencode arch=compute_62,code=sm_62

==14734== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  181.08ms         1  181.08ms  181.08ms  181.08ms  my_kernel(unsigned char*, float const *, unsigned int, unsigned int, unsigned int, float, float, float, float)
      API calls:   56.54%  252.09ms         2  126.05ms  4.4160us  252.09ms  cudaStreamCreate
                   41.02%  182.91ms         3  60.970ms  439.71us  181.76ms  cudaDeviceSynchronize
                    2.37%  10.585ms         2  5.2927ms  621.92us  9.9635ms  cudaMallocManaged
                    0.03%  126.11us         1  126.11us  126.11us  126.11us  cudaLaunchKernel
                    0.02%  96.480us        97     994ns     448ns  30.592us  cuDeviceGetAttribute
                    0.01%  53.280us         5  10.656us  3.2000us  24.608us  cudaStreamAttachMemAsync
                    0.00%  11.072us         1  11.072us  11.072us  11.072us  cuDeviceTotalMem
                    0.00%  5.5040us         3  1.8340us     896ns  2.9760us  cuDeviceGetCount
                    0.00%  2.1120us         2  1.0560us     864ns  1.2480us  cuDeviceGet
                    0.00%  1.6000us         1  1.6000us  1.6000us  1.6000us  cuDeviceGetName
                    0.00%     704ns         1     704ns     704ns     704ns  cuDeviceGetUuid

Thanks.

andrea.capirchio · March 27, 2023, 10:15am

Hi,

I thought that activating nvpmodel 0 would enable the maximum clock frequencies but running jetson_clocks actually delivers better results.

I was able to replicate your experiment with the maximized clocks (so they now have the same result cc3.0 and cc6.2).
I wonder why in non-max-clock condition this wasn’t happening.

Now I don’t know if this is OT, but:
Do you have any advice to further reduce GPU execution time?
For example, should I use shared memory to hold my variable values (even if not strictly needed)?
I noticed better results w.r.t a small set of registers.

Thanks a lot for your support.
Andrea

AastaLLL · March 28, 2023, 5:45am

Hi,

The default setting is the dynamic clock.
It indicates that the clock rate will be adjusted automatically according to the GPU loading.
Usually, the execution will be different each time since the clock rate doesn’t fix to the maximum.

It’s recommended to profile your program with Nsight Compute.
There are some occupancy and memory usage score that can help you improve the code further.

Thanks.

andrea.capirchio · March 28, 2023, 11:54am

Hi,

thanks for your answers, I will sure do.

One last thing:
I’m not sure I can install Nsight Compute with Jetpack 4.3: the site redirects to the SDK manger, the SDK manager only shows this:

I have the old nvvp visual profiler though.
Should I get Compute from another Jetpack version?

Thanks,
Andrea

AastaLLL · March 30, 2023, 7:14am

Hi,

Yes, you can find it in JetPack 4.6:

https://docs.nvidia.com/jetson/archives/jetpack-archived/jetpack-46/release-notes/index.html#developer-tools

Thanks.

system · April 25, 2023, 7:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TX2 Computing Performance has Dropped Jetson TX2 power , performance	12	1103	October 18, 2021
Performance degradation on CUDA Jetson TX2	10	2360	October 18, 2021
CUDA kernel and Xavier performance Jetson AGX Xavier	8	1008	October 18, 2021
CUDA performance issue on tx2 Jetson TX2	8	4344	October 18, 2021
TX2 running slow Jetson TX2	16	4349	October 18, 2021
TX2 with FP16 Running Slower than FP32 Jetson TX2	22	4520	October 18, 2021
Jetson TX2 is slower than 5.0 device? Jetson TX2	4	1345	October 18, 2021
Code execution slower after flashing jetson TX2 with jetpack l4t3.1 Jetson TX2	8	1356	October 18, 2021
Performance Variance Between Jetsons Jetson TX2	54	2671	October 18, 2021
Jetson TX2 Performance CUDA Programming and Performance	12	24899	December 23, 2017

Kernel performance when switching compute capability from 3.0 to 6.2 on Jetson Tx2

default

Specified sm=62

Related topics