Inexplicable CUDA kernel speedup when using tall thin rectangles to process image

jetsonnvidia · May 15, 2020, 11:34am

I’ve been using this example to apply some realtime graphical effects to a webcam image:

https://docs.nvidia.com/jetson/l4t-multimedia/l4t_mm_v4l2_cam_cuda_group.html

I have 1024 threads running, each one on a separate rectangle in the image. I have discovered that if I split the image up into a series of tall thin rectangles (each thread processes one rectangle), then the threads run MUCH faster than if the rectangles are wide and not very tall.

i.e. This is slow (each rectangle is a thread and is overlaid on the webcam image to show which pixels it processes):

slow

But this is fast:

fast

Why is this? I have read the entire CUDA Best Practices Guide but not once does it mention this.

DaneLLL · May 18, 2020, 1:02am

Hi,
Please share a patch based on 12_camera_v4l2_cuda so that we can reproduce it and do further investigation.

jetsonnvidia · May 19, 2020, 1:57pm

Do you have a 4k webcam like the Logitech Brio?

You will only be able to observe the slowdown when running my patch with a 4k video feed.

DaneLLL · May 19, 2020, 11:30pm

Hi,
We have E-Con See3CAM CU135 which can run in 4K. Don’t observe the issue. So you can reproduce it without any patch on 12_camera_V4l2_cuda?

jetsonnvidia · May 20, 2020, 2:24pm

There is 1 patch that shows it at 11fps, 1 at 25fps. The 11fps splits the image into a block of 32x32 threads. The 25fps splits the image into a block of 1024x1 threads. (columns x rows)

--- NvAnalysis.cu       2020-05-20 15:20:35.108022675 +0100
+++ NvAnalysis_11fps.cu 2020-05-20 15:06:27.899181576 +0100
@@ -35,12 +35,23 @@
 __global__ void
 addLabelsKernel(int *pDevPtr, int pitch)
 {
-    int row = blockIdx.y * blockDim.y + threadIdx.y + BOX_H;
-    int col = blockIdx.x * blockDim.x + threadIdx.x + BOX_W;
-    char *pElement = (char *)pDevPtr + row * pitch + col;
-
-    pElement[0] = 0;
+    int row = blockIdx.y * blockDim.y + threadIdx.y;
+    int col = blockIdx.x * blockDim.x + threadIdx.x;
+    int row_pixel = row * (2048/32);
+    int col_pixel = col * (4096/32);
+    char *pElement = (char *)pDevPtr;
+    int offset;
 
+    for(int x_offset = 0; x_offset < (4096/32); x_offset++)
+    {
+        for(int y_offset = 0; y_offset < (2048/32); y_offset++)
+        {
+            offset = ((row_pixel + y_offset) * pitch) + col_pixel + x_offset;
+            
+            pElement[offset] = (pElement[offset] * pElement[offset]) & 0xFF;
+        }
+    }
+    
     return;
 }

And the other patch:

--- NvAnalysis.cu       2020-05-20 15:20:35.108022675 +0100
+++ NvAnalysis_25fps.cu 2020-05-20 15:18:05.852321754 +0100
@@ -35,19 +35,30 @@
 __global__ void
 addLabelsKernel(int *pDevPtr, int pitch)
 {
-    int row = blockIdx.y * blockDim.y + threadIdx.y + BOX_H;
-    int col = blockIdx.x * blockDim.x + threadIdx.x + BOX_W;
-    char *pElement = (char *)pDevPtr + row * pitch + col;
-
-    pElement[0] = 0;
+    int row = blockIdx.y * blockDim.y + threadIdx.y;
+    int col = blockIdx.x * blockDim.x + threadIdx.x;
+    int row_pixel = row * (2048);
+    int col_pixel = col * (4);
+    char *pElement = (char *)pDevPtr;
+    int offset;
 
+    for(int x_offset = 0; x_offset < (4); x_offset++)
+    {
+        for(int y_offset = 0; y_offset < (2048); y_offset++)
+        {
+            offset = ((row_pixel + y_offset) * pitch) + col_pixel + x_offset;
+            
+            pElement[offset] = (pElement[offset] * pElement[offset]) & 0xFF;
+        }
+    }
+    
     return;
 }
 
 int
 addLabels(CUdeviceptr pDevPtr, int pitch)
 {
-    dim3 threadsPerBlock(BOX_W, BOX_H);
+    dim3 threadsPerBlock(1024, 1);
     dim3 blocks(1,1);
 
     addLabelsKernel<<<blocks,threadsPerBlock>>>((int *)pDevPtr, pitch);

Apply these patches with:
cd mmapi_samples/samples/common/algorithm/cuda/
patch < name.patch

Why is there such a large difference in speed? Why is this not discussed in the CUDA optimisation guide?

DaneLLL · May 20, 2020, 11:42pm

Hi,
We will try to reproduce the issue and investigate further.

jetsonnvidia · May 21, 2020, 10:05am

Thanks. BTW, the command line I use to run the program is:

./camera_v4l2_cuda -d /dev/video0 -s 4096x2160 -f MJPEG -c

The Logitech Brio at 4k uses MJPEG, there is no raw option.

AastaLLL · May 22, 2020, 6:22am

Hi,

This looks like a memory coalescing problem that particular access pattern can cause a performance drop.
Here is an introduction video for coalescing memory and stride memory.

Here is the profiling data from stride memory with a desktop GPU.
You can see the bandwidth drop from 70GB to 10GB when stride increase from 1 to 8.

It looks like the stride number in patch-1 is 128(4096/32) and 4 in patch-2.
Then this may explain why the performance in patch-1 is much lower.
Thanks.

Topic		Replies	Views
Why is this kernel slow? CUDA Programming and Performance	5	4536	September 27, 2010
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6254	November 16, 2010
Cuda good practices for image processing CUDA Programming and Performance	8	15698	February 12, 2009
Fast update for rectangular memory region? How to add a scalar to a submatrix? CUDA Programming and Performance	9	3274	October 10, 2007
Kernel with a for-loop over matrix columns results in worse bandwidth for longer ranges? CUDA Programming and Performance	4	739	April 27, 2021
help me! coalesced access CUDA Programming and Performance	15	7207	October 27, 2010
Help in thread divergene CUDA Programming and Performance	18	9657	July 28, 2010
Should I expect a speed up by increasing number of threads per block? CUDA Programming and Performance	11	2715	January 30, 2015
full warp Vs. half warp coalesced global memory loads CUDA Programming and Performance	7	8174	May 27, 2009
[beginner] memory access CUDA Programming and Performance	7	11380	November 3, 2010

Inexplicable CUDA kernel speedup when using tall thin rectangles to process image

Related topics