I’m working on a CUDA kernel and am hoping others can confirm the execution numbers I’m seeing, and whether or not there are ways to optimize my kernel operations.
I’ve got a benchmarking application where I’m separating simulated RGBA camera data into 3 R/G/B planes (alpha channel data is skipped) and converting that planar data into floating-point. The camera resolution is 1920x1080 w/32-bits per pixel and 8-bits per R/G/B/A channel. The application will eventually be extended to handle a total of 4 cameras, so I’ll be converting 1920x1080*4 pixels – reducing the amount of time the operations take is going to be very important.
The kernel I’ve defined combines the channel separation and floating-point conversion:
__global__
void swizzle_channels_and_convert_u8_to_f(int pixel_count, int *srcRGBA, float *dstRGBFloatPlanes)
{
int pixel_index = blockIdx.x * blockDim.x + threadIdx.x;
int pixel_stride = blockDim.x * gridDim.x;
int in_step;
int out_step;
for (int pixel = pixel_index; pixel < pixel_count; pixel+=pixel_stride)
{
in_step = pixel;
out_step = pixel;
dstRGBFloatPlanes[out_step+pixel_count*0] = __int2float_rd((srcRGBA[in_step] & 0x0000FF) >> 0); // r
dstRGBFloatPlanes[out_step+pixel_count*1] = __int2float_rd((srcRGBA[in_step] & 0x00FF00) >> 8); // g
dstRGBFloatPlanes[out_step+pixel_count*2] = __int2float_rd((srcRGBA[in_step] & 0xFF0000) >> 16); // b
}
}
I’m using the Unified Memory Model and am declaring the RGBA and floating point planar data using cudaMallocManaged(). I invoke the kernel in this way:
int blockSize = 512;
int numBlocks = (width*height + blockSize - 1) / blockSize;
/* ...other setup code here (random simulated camera data)... eliminated for brevity... */
swizzle_channels_and_convert_u8_to_f<<<numBlocks,blockSize>>>(width*height,(int *)RGBAChannelData,RGBFloatPlaneData);
cudaDeviceSynchronize(); // wait for operation to complete
When I run this on the Xavier using nvprof, I get the following execution time results:
nvidia@tegra-ubuntu:~/projects/cuda$ sudo /usr/local/cuda/bin/nvprof ./rgb-planar --skipnpp --count 10
counts: 10
Block size: 512, block count: 4050
==27469== NVPROF is profiling process 27469, command: ./rgb-planar --skipnpp --count 10
==27469== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
CUDA kernel loop...
2560,2462,2022,2058,2458,2427,2384,2070,2310,2443
Elapsed time. Min = 2022, Max = 2560, Avg = 2319, Trimmed Mean = 2326
Done!
==27469== Profiling application: ./rgb-planar --skipnpp --count 10
==27469== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 11.355ms 10 1.1355ms 1.1243ms 1.1919ms swizzle_channels_and_convert_u8_to_f(int, int*, float*)
API calls: 92.43% 322.91ms 2 161.45ms 2.1241ms 320.78ms cudaMallocManaged
4.81% 16.821ms 10 1.6821ms 1.3583ms 1.9267ms cudaDeviceSynchronize
1.69% 5.8868ms 10 588.68us 532.09us 635.58us cudaLaunchKernel
0.91% 3.1758ms 2 1.5879ms 1.4845ms 1.6912ms cudaFree
0.14% 480.31us 96 5.0030us 2.2080us 119.43us cuDeviceGetAttribute
0.01% 29.026us 1 29.026us 29.026us 29.026us cuDeviceTotalMem
0.01% 24.993us 3 8.3310us 3.0720us 12.704us cuDeviceGetCount
0.00% 7.2960us 2 3.6480us 2.8160us 4.4800us cuDeviceGet
0.00% 6.3680us 1 6.3680us 6.3680us 6.3680us cuDeviceGetName
0.00% 3.2650us 1 3.2650us 3.2650us 3.2650us cuDeviceGetUuid
The profiling shows that I’m averaging 1.14ms to convert 1920x1080 R/G/B/A 32-bit pixel values into 1920x1080*3 R/G/B floating-point planes. Assuming I’m maximizing the parallel operations, additional cameras would increase this linearly. At 4 cameras, I’m looking at ~4.6ms. There are going to be subsequent image corrections going on, so I really need keep all of this as small as possible.
I get the following throughput results with nvprof:
nvidia@tegra-ubuntu:~/projects/cuda$ sudo /usr/local/cuda/bin/nvprof --metrics gld_throughput,gst_throughput ./rgb-planar --skipnpp --count 10
counts: 10
Block size: 512, block count: 4050
==26554== NVPROF is profiling process 26554, command: ./rgb-planar --skipnpp --count 10
CUDA kernel loop...
==26554== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (2 of 2)...
2 internal events
Replaying kernel "swizzle_channels_and_convert_u8_to_f(int, int*, float*)" (done)
411384,308067,303048,302804,300768,303385,300472,306879,302784,303794
Elapsed time. Min = 300472, Max = 411384, Avg = 314338, Trimmed Mean = 303941
Done!
==26554== Profiling application: ./rgb-planar --skipnpp --count 10
==26554== Profiling result:
==26554== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Xavier (0)"
Kernel: swizzle_channels_and_convert_u8_to_f(int, int*, float*)
10 gld_throughput Global Load Throughput 20.224GB/s 20.542GB/s 20.481GB/s
10 gst_throughput Global Store Throughput 20.224GB/s 20.542GB/s 20.481GB/s
The global load and store throughput at ~20.2GB/s seems relatively low. The specs on the Xavier’s DDR4 memory indicate 137GB/s so I’m off by about a factor of 8 from that. I’m not sure what’s going on under the hood when data goes from the CPU → GPU – is nvprof returning the right order of magnitude here? Is there anything that can be done to improve the throughput I’m seeing?
I’m wondering if I should be partitioning my input and output memory differently, or if I should somehow constrain how much data the individual threads end up working on. I tried separating my kernel into 3 different kernels, each operating against a single plane, but that did not seem to have any appreciable effects (and ended up being slightly slower).
Any hints or help will be greatly appreciated!
Thanks,
–tim
rgb-planar.tar (20 KB)