CUDA kernel and Xavier performance

I’m working on a CUDA kernel and am hoping others can confirm the execution numbers I’m seeing, and whether or not there are ways to optimize my kernel operations.

I’ve got a benchmarking application where I’m separating simulated RGBA camera data into 3 R/G/B planes (alpha channel data is skipped) and converting that planar data into floating-point. The camera resolution is 1920x1080 w/32-bits per pixel and 8-bits per R/G/B/A channel. The application will eventually be extended to handle a total of 4 cameras, so I’ll be converting 1920x1080*4 pixels – reducing the amount of time the operations take is going to be very important.

The kernel I’ve defined combines the channel separation and floating-point conversion:

__global__
void swizzle_channels_and_convert_u8_to_f(int pixel_count, int *srcRGBA, float *dstRGBFloatPlanes)
{
        int pixel_index = blockIdx.x * blockDim.x + threadIdx.x;
        int pixel_stride = blockDim.x * gridDim.x;

        int in_step;
        int out_step;

        for (int pixel = pixel_index; pixel < pixel_count; pixel+=pixel_stride)
        {
                in_step = pixel;
                out_step = pixel;

                dstRGBFloatPlanes[out_step+pixel_count*0] = __int2float_rd((srcRGBA[in_step] & 0x0000FF) >> 0);  // r
                dstRGBFloatPlanes[out_step+pixel_count*1] = __int2float_rd((srcRGBA[in_step] & 0x00FF00) >> 8);  // g
                dstRGBFloatPlanes[out_step+pixel_count*2] = __int2float_rd((srcRGBA[in_step] & 0xFF0000) >> 16); // b
        }
}

I’m using the Unified Memory Model and am declaring the RGBA and floating point planar data using cudaMallocManaged(). I invoke the kernel in this way:

int blockSize = 512;
        int numBlocks = (width*height + blockSize - 1) / blockSize;

        /* ...other setup code here (random simulated camera data)... eliminated for brevity... */

                swizzle_channels_and_convert_u8_to_f<<<numBlocks,blockSize>>>(width*height,(int *)RGBAChannelData,RGBFloatPlaneData);
                cudaDeviceSynchronize(); // wait for operation to complete

When I run this on the Xavier using nvprof, I get the following execution time results:

nvidia@tegra-ubuntu:~/projects/cuda$ sudo /usr/local/cuda/bin/nvprof ./rgb-planar --skipnpp --count 10
counts: 10
Block size: 512, block count: 4050
==27469== NVPROF is profiling process 27469, command: ./rgb-planar --skipnpp --count 10
==27469== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
CUDA kernel loop...


2560,2462,2022,2058,2458,2427,2384,2070,2310,2443
Elapsed time.  Min = 2022, Max = 2560, Avg =  2319, Trimmed Mean = 2326
Done!
==27469== Profiling application: ./rgb-planar --skipnpp --count 10
==27469== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.355ms        10  1.1355ms  1.1243ms  1.1919ms  swizzle_channels_and_convert_u8_to_f(int, int*, float*)
      API calls:   92.43%  322.91ms         2  161.45ms  2.1241ms  320.78ms  cudaMallocManaged
                    4.81%  16.821ms        10  1.6821ms  1.3583ms  1.9267ms  cudaDeviceSynchronize
                    1.69%  5.8868ms        10  588.68us  532.09us  635.58us  cudaLaunchKernel
                    0.91%  3.1758ms         2  1.5879ms  1.4845ms  1.6912ms  cudaFree
                    0.14%  480.31us        96  5.0030us  2.2080us  119.43us  cuDeviceGetAttribute
                    0.01%  29.026us         1  29.026us  29.026us  29.026us  cuDeviceTotalMem
                    0.01%  24.993us         3  8.3310us  3.0720us  12.704us  cuDeviceGetCount
                    0.00%  7.2960us         2  3.6480us  2.8160us  4.4800us  cuDeviceGet
                    0.00%  6.3680us         1  6.3680us  6.3680us  6.3680us  cuDeviceGetName
                    0.00%  3.2650us         1  3.2650us  3.2650us  3.2650us  cuDeviceGetUuid

The profiling shows that I’m averaging 1.14ms to convert 1920x1080 R/G/B/A 32-bit pixel values into 1920x1080*3 R/G/B floating-point planes. Assuming I’m maximizing the parallel operations, additional cameras would increase this linearly. At 4 cameras, I’m looking at ~4.6ms. There are going to be subsequent image corrections going on, so I really need keep all of this as small as possible.

I get the following throughput results with nvprof:

nvidia@tegra-ubuntu:~/projects/cuda$ sudo /usr/local/cuda/bin/nvprof --metrics gld_throughput,gst_throughput ./rgb-planar --skipnpp --count 10
counts: 10
Block size: 512, block count: 4050
==26554== NVPROF is profiling process 26554, command: ./rgb-planar --skipnpp --count 10
CUDA kernel loop...
==26554== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (2 of 2)...
        2 internal events
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
411384,308067,303048,302804,300768,303385,300472,306879,302784,303794
Elapsed time.  Min = 300472, Max = 411384, Avg =  314338, Trimmed Mean = 303941
Done!
==26554== Profiling application: ./rgb-planar --skipnpp --count 10
==26554== Profiling result:
==26554== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: swizzle_channels_and_convert_u8_to_f(int, int*, float*)
         10                            gld_throughput                    Global Load Throughput  20.224GB/s  20.542GB/s  20.481GB/s
         10                            gst_throughput                   Global Store Throughput  20.224GB/s  20.542GB/s  20.481GB/s

The global load and store throughput at ~20.2GB/s seems relatively low. The specs on the Xavier’s DDR4 memory indicate 137GB/s so I’m off by about a factor of 8 from that. I’m not sure what’s going on under the hood when data goes from the CPU → GPU – is nvprof returning the right order of magnitude here? Is there anything that can be done to improve the throughput I’m seeing?

I’m wondering if I should be partitioning my input and output memory differently, or if I should somehow constrain how much data the individual threads end up working on. I tried separating my kernel into 3 different kernels, each operating against a single plane, but that did not seem to have any appreciable effects (and ended up being slightly slower).

Any hints or help will be greatly appreciated!

Thanks,
–tim
rgb-planar.tar (20 KB)

Hi,

Just in case you don’t know.

Have you maximized the device performance first?

sudo nvpmodel -m 0
sudo jetson_clocks

Thanks.

Thanks for the response! Yes, I’ve maxed the performance via nvpmodel and jetson-clocks. I haven’t seen any appreciable difference in bandwidth from doing that.

–tim

Hi,

Sorry, we tested your application but it seems like the performance is much better than reported.

nvidia@xavier:~/topic_1072384$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


1286,1136,1113,1096,1091,1078,1110,1099,1135,1091
Elapsed time.  Min = 1078, Max = 1286, Avg =  1123, Trimmed Mean = 1108
Done!

Do I miss anything?
Could you try the sample with a clean reboot environment again?

We are using JetPack4.3 for your reference.
Thanks.

Thanks for going the extra mile and giving it a try!

Curious that your numbers are different than mine. I’m running Jetpack 4.2. I’ll re-run things and confirm. I’ll also update to the latest Jetpack version and re-run things and see what difference that makes.

Can you tell me what ‘uname -a’ reports for you? I’d like to know what kernel version you are running against also.

Thanks again!
–tim

Hi,

We reflash the system and install all the package from JetPack4.3.

nvidia@xavier:~$ uname -a
Linux xavier 4.9.140-tegra #1 SMP PREEMPT Mon Dec 9 22:52:02 PST 2019 aarch64 aarch64 aarch64 GNU/Linux

Thanks.

Using Jetpack 4.2 and confirming via “nvpmodel -q”, I was able to determine that I was actually operating at 30W (mode=3), rather than MAXN like I thought. Also, it seems the order of execution is critical. After running “nvpmodel”, you’ll want to re-run “jetson-clocks”.

See below for my latest results:

nvidia@tegra-ubuntu:~$ sudo nvpmodel -q
NV Power Mode: MODE_30W_ALL
3
nvidia@tegra-ubuntu:~$ cd projects/rgb-planar/
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


2544,2433,2316,2404,2387,2384,2460,2384,2478,2400
Elapsed time.  Min = 2316, Max = 2544, Avg =  2419, Trimmed Mean = 2416
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo nvpmodel -m 0
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


2012,1946,1911,1946,1936,1918,1930,1935,1944,1933
Elapsed time.  Min = 1911, Max = 2012, Avg =  1941, Trimmed Mean = 1936
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo jetson_clocks 
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


1013,908,933,875,915,886,902,939,877,899
Elapsed time.  Min = 875, Max = 1013, Avg =  914, Trimmed Mean = 907
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo nvpmodel -m 3
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


2143,3430,2080,2120,2042,2011,2051,2089,2180,2125
Elapsed time.  Min = 2011, Max = 3430, Avg =  2227, Trimmed Mean = 2103
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$ sudo jetson_clocks 
nvidia@tegra-ubuntu:~/projects/rgb-planar$ ./rgb-planar --skipnpp --count 10
counts: 10 
Block size: 512, block count: 4050
CUDA kernel loop...


1394,1254,1182,1309,1213,1223,1179,1167,1276,1261
Elapsed time.  Min = 1167, Max = 1394, Avg =  1245, Trimmed Mean = 1237
Done!
nvidia@tegra-ubuntu:~/projects/rgb-planar$

Thanks for the help and making me realize I wasn’t running optimally!

–tim

Hi,

YES. nvpmodel by default set the GPU clock into dynamic.
So you will need to fix it to the maximal with jetson_clocks.

Here is a related topic for your reference:
https://devtalk.nvidia.com/default/topic/1030506/jetson-tx2/nvpmodel-and-jetson_clocks/

Thanks.