CUDA kernel and Xavier performance

waterbear · February 27, 2020, 4:28pm

I’m working on a CUDA kernel and am hoping others can confirm the execution numbers I’m seeing, and whether or not there are ways to optimize my kernel operations.

I’ve got a benchmarking application where I’m separating simulated RGBA camera data into 3 R/G/B planes (alpha channel data is skipped) and converting that planar data into floating-point. The camera resolution is 1920x1080 w/32-bits per pixel and 8-bits per R/G/B/A channel. The application will eventually be extended to handle a total of 4 cameras, so I’ll be converting 1920x1080*4 pixels – reducing the amount of time the operations take is going to be very important.

The kernel I’ve defined combines the channel separation and floating-point conversion:

__global__
void swizzle_channels_and_convert_u8_to_f(int pixel_count, int *srcRGBA, float *dstRGBFloatPlanes)
{
        int pixel_index = blockIdx.x * blockDim.x + threadIdx.x;
        int pixel_stride = blockDim.x * gridDim.x;

        int in_step;
        int out_step;

        for (int pixel = pixel_index; pixel < pixel_count; pixel+=pixel_stride)
        {
                in_step = pixel;
                out_step = pixel;

                dstRGBFloatPlanes[out_step+pixel_count*0] = __int2float_rd((srcRGBA[in_step] & 0x0000FF) >> 0);  // r
                dstRGBFloatPlanes[out_step+pixel_count*1] = __int2float_rd((srcRGBA[in_step] & 0x00FF00) >> 8);  // g
                dstRGBFloatPlanes[out_step+pixel_count*2] = __int2float_rd((srcRGBA[in_step] & 0xFF0000) >> 16); // b
        }
}

I’m using the Unified Memory Model and am declaring the RGBA and floating point planar data using cudaMallocManaged(). I invoke the kernel in this way:

int blockSize = 512;
        int numBlocks = (width*height + blockSize - 1) / blockSize;

        /* ...other setup code here (random simulated camera data)... eliminated for brevity... */

                swizzle_channels_and_convert_u8_to_f<<<numBlocks,blockSize>>>(width*height,(int *)RGBAChannelData,RGBFloatPlaneData);
                cudaDeviceSynchronize(); // wait for operation to complete

When I run this on the Xavier using nvprof, I get the following execution time results:

nvidia@tegra-ubuntu:~/projects/cuda$ sudo /usr/local/cuda/bin/nvprof ./rgb-planar --skipnpp --count 10
counts: 10
Block size: 512, block count: 4050
==27469== NVPROF is profiling process 27469, command: ./rgb-planar --skipnpp --count 10
==27469== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
CUDA kernel loop...


2560,2462,2022,2058,2458,2427,2384,2070,2310,2443
Elapsed time.  Min = 2022, Max = 2560, Avg =  2319, Trimmed Mean = 2326
Done!
==27469== Profiling application: ./rgb-planar --skipnpp --count 10
==27469== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.355ms        10  1.1355ms  1.1243ms  1.1919ms  swizzle_channels_and_convert_u8_to_f(int, int*, float*)
      API calls:   92.43%  322.91ms         2  161.45ms  2.1241ms  320.78ms  cudaMallocManaged
                    4.81%  16.821ms        10  1.6821ms  1.3583ms  1.9267ms  cudaDeviceSynchronize
                    1.69%  5.8868ms        10  588.68us  532.09us  635.58us  cudaLaunchKernel
                    0.91%  3.1758ms         2  1.5879ms  1.4845ms  1.6912ms  cudaFree
                    0.14%  480.31us        96  5.0030us  2.2080us  119.43us  cuDeviceGetAttribute
                    0.01%  29.026us         1  29.026us  29.026us  29.026us  cuDeviceTotalMem
                    0.01%  24.993us         3  8.3310us  3.0720us  12.704us  cuDeviceGetCount
                    0.00%  7.2960us         2  3.6480us  2.8160us  4.4800us  cuDeviceGet
                    0.00%  6.3680us         1  6.3680us  6.3680us  6.3680us  cuDeviceGetName
                    0.00%  3.2650us         1  3.2650us  3.2650us  3.2650us  cuDeviceGetUuid

The profiling shows that I’m averaging 1.14ms to convert 1920x1080 R/G/B/A 32-bit pixel values into 1920x1080*3 R/G/B floating-point planes. Assuming I’m maximizing the parallel operations, additional cameras would increase this linearly. At 4 cameras, I’m looking at ~4.6ms. There are going to be subsequent image corrections going on, so I really need keep all of this as small as possible.

I get the following throughput results with nvprof:

nvidia@tegra-ubuntu:~/projects/cuda$ sudo /usr/local/cuda/bin/nvprof --metrics gld_throughput,gst_throughput ./rgb-planar --skipnpp --count 10
counts: 10
Block size: 512, block count: 4050
==26554== NVPROF is profiling process 26554, command: ./rgb-planar --skipnpp --count 10
CUDA kernel loop...
==26554== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (2 of 2)...
        2 internal events
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
411384,308067,303048,302804,300768,303385,300472,306879,302784,303794
Elapsed time.  Min = 300472, Max = 411384, Avg =  314338, Trimmed Mean = 303941
Done!
==26554== Profiling application: ./rgb-planar --skipnpp --count 10
==26554== Profiling result:
==26554== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: swizzle_channels_and_convert_u8_to_f(int, int*, float*)
         10                            gld_throughput                    Global Load Throughput  20.224GB/s  20.542GB/s  20.481GB/s
         10                            gst_throughput                   Global Store Throughput  20.224GB/s  20.542GB/s  20.481GB/s

The global load and store throughput at ~20.2GB/s seems relatively low. The specs on the Xavier’s DDR4 memory indicate 137GB/s so I’m off by about a factor of 8 from that. I’m not sure what’s going on under the hood when data goes from the CPU → GPU – is nvprof returning the right order of magnitude here? Is there anything that can be done to improve the throughput I’m seeing?

I’m wondering if I should be partitioning my input and output memory differently, or if I should somehow constrain how much data the individual threads end up working on. I tried separating my kernel into 3 different kernels, each operating against a single plane, but that did not seem to have any appreciable effects (and ended up being slightly slower).

Any hints or help will be greatly appreciated!

Thanks,
–tim
rgb-planar.tar (20 KB)

AastaLLL · March 2, 2020, 7:56am

Hi,

Just in case you don’t know.

Have you maximized the device performance first?

sudo nvpmodel -m 0
sudo jetson_clocks

Thanks.

waterbear · March 2, 2020, 12:49pm

Thanks for the response! Yes, I’ve maxed the performance via nvpmodel and jetson-clocks. I haven’t seen any appreciable difference in bandwidth from doing that.

–tim

AastaLLL · March 5, 2020, 3:41am

Hi,

Sorry, we tested your application but it seems like the performance is much better than reported.

nvidia@xavier:~/topic_1072384$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


1286,1136,1113,1096,1091,1078,1110,1099,1135,1091
Elapsed time.  Min = 1078, Max = 1286, Avg =  1123, Trimmed Mean = 1108
Done!

Do I miss anything?
Could you try the sample with a clean reboot environment again?

We are using JetPack4.3 for your reference.
Thanks.

waterbear · March 5, 2020, 12:45pm

Thanks for going the extra mile and giving it a try!

Curious that your numbers are different than mine. I’m running Jetpack 4.2. I’ll re-run things and confirm. I’ll also update to the latest Jetpack version and re-run things and see what difference that makes.

Can you tell me what ‘uname -a’ reports for you? I’d like to know what kernel version you are running against also.

Thanks again!
–tim

AastaLLL · March 6, 2020, 3:40am

Hi,

We reflash the system and install all the package from JetPack4.3.

nvidia@xavier:~$ uname -a
Linux xavier 4.9.140-tegra #1 SMP PREEMPT Mon Dec 9 22:52:02 PST 2019 aarch64 aarch64 aarch64 GNU/Linux

Thanks.

waterbear · March 6, 2020, 3:27pm

Using Jetpack 4.2 and confirming via “nvpmodel -q”, I was able to determine that I was actually operating at 30W (mode=3), rather than MAXN like I thought. Also, it seems the order of execution is critical. After running “nvpmodel”, you’ll want to re-run “jetson-clocks”.

See below for my latest results:

nvidia@tegra-ubuntu:~$ sudo nvpmodel -q
NV Power Mode: MODE_30W_ALL
3
nvidia@tegra-ubuntu:~$ cd projects/rgb-planar/
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


2544,2433,2316,2404,2387,2384,2460,2384,2478,2400
Elapsed time.  Min = 2316, Max = 2544, Avg =  2419, Trimmed Mean = 2416
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo nvpmodel -m 0
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


2012,1946,1911,1946,1936,1918,1930,1935,1944,1933
Elapsed time.  Min = 1911, Max = 2012, Avg =  1941, Trimmed Mean = 1936
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo jetson_clocks 
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


1013,908,933,875,915,886,902,939,877,899
Elapsed time.  Min = 875, Max = 1013, Avg =  914, Trimmed Mean = 907
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo nvpmodel -m 3
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


2143,3430,2080,2120,2042,2011,2051,2089,2180,2125
Elapsed time.  Min = 2011, Max = 3430, Avg =  2227, Trimmed Mean = 2103
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo jetson_clocks 
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


1394,1254,1182,1309,1213,1223,1179,1167,1276,1261
Elapsed time.  Min = 1167, Max = 1394, Avg =  1245, Trimmed Mean = 1237
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$

Thanks for the help and making me realize I wasn’t running optimally!

–tim

AastaLLL · March 9, 2020, 2:11am

Hi,

YES. nvpmodel by default set the GPU clock into dynamic.
So you will need to fix it to the maximal with jetson_clocks.

Here is a related topic for your reference:
https://devtalk.nvidia.com/default/topic/1030506/jetson-tx2/nvpmodel-and-jetson_clocks/

Thanks.

Topic		Replies	Views
How to get more Gflops ? :) CUDA Programming and Performance	21	27764	September 12, 2008
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17472	June 8, 2010
CPU performance is worse on the Xavier then the TX2 Jetson AGX Xavier	9	2340	October 18, 2021
Skybuck's RAM Test version 0.07 available. CUDA Programming and Performance	19	1804	July 26, 2011
Error in Textue CUDA Programming and Performance	17	7593	September 13, 2009
Xavier LPDDR4X memory throughput Jetson AGX Xavier	7	779	October 18, 2021
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37791	August 30, 2009
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11451	May 23, 2010
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22773	May 5, 2010
Vehicle Routing Problem with CUDA CUDA Programming and Performance	18	4453	January 14, 2010

CUDA kernel and Xavier performance

Related topics