Inexplicable CUDA kernel speedup when using tall thin rectangles to process image

I’ve been using this example to apply some realtime graphical effects to a webcam image:

https://docs.nvidia.com/jetson/l4t-multimedia/l4t_mm_v4l2_cam_cuda_group.html

I have 1024 threads running, each one on a separate rectangle in the image. I have discovered that if I split the image up into a series of tall thin rectangles (each thread processes one rectangle), then the threads run MUCH faster than if the rectangles are wide and not very tall.

i.e. This is slow (each rectangle is a thread and is overlaid on the webcam image to show which pixels it processes):

slow

But this is fast:

fast

Why is this? I have read the entire CUDA Best Practices Guide but not once does it mention this.

Hi,
Please share a patch based on 12_camera_v4l2_cuda so that we can reproduce it and do further investigation.

Do you have a 4k webcam like the Logitech Brio?

You will only be able to observe the slowdown when running my patch with a 4k video feed.

Hi,
We have E-Con See3CAM CU135 which can run in 4K. Don’t observe the issue. So you can reproduce it without any patch on 12_camera_V4l2_cuda?

There is 1 patch that shows it at 11fps, 1 at 25fps. The 11fps splits the image into a block of 32x32 threads. The 25fps splits the image into a block of 1024x1 threads. (columns x rows)

--- NvAnalysis.cu       2020-05-20 15:20:35.108022675 +0100
+++ NvAnalysis_11fps.cu 2020-05-20 15:06:27.899181576 +0100
@@ -35,12 +35,23 @@
 __global__ void
 addLabelsKernel(int *pDevPtr, int pitch)
 {
-    int row = blockIdx.y * blockDim.y + threadIdx.y + BOX_H;
-    int col = blockIdx.x * blockDim.x + threadIdx.x + BOX_W;
-    char *pElement = (char *)pDevPtr + row * pitch + col;
-
-    pElement[0] = 0;
+    int row = blockIdx.y * blockDim.y + threadIdx.y;
+    int col = blockIdx.x * blockDim.x + threadIdx.x;
+    int row_pixel = row * (2048/32);
+    int col_pixel = col * (4096/32);
+    char *pElement = (char *)pDevPtr;
+    int offset;
 
+    for(int x_offset = 0; x_offset < (4096/32); x_offset++)
+    {
+        for(int y_offset = 0; y_offset < (2048/32); y_offset++)
+        {
+            offset = ((row_pixel + y_offset) * pitch) + col_pixel + x_offset;
+            
+            pElement[offset] = (pElement[offset] * pElement[offset]) & 0xFF;
+        }
+    }
+    
     return;
 }

And the other patch:

--- NvAnalysis.cu       2020-05-20 15:20:35.108022675 +0100
+++ NvAnalysis_25fps.cu 2020-05-20 15:18:05.852321754 +0100
@@ -35,19 +35,30 @@
 __global__ void
 addLabelsKernel(int *pDevPtr, int pitch)
 {
-    int row = blockIdx.y * blockDim.y + threadIdx.y + BOX_H;
-    int col = blockIdx.x * blockDim.x + threadIdx.x + BOX_W;
-    char *pElement = (char *)pDevPtr + row * pitch + col;
-
-    pElement[0] = 0;
+    int row = blockIdx.y * blockDim.y + threadIdx.y;
+    int col = blockIdx.x * blockDim.x + threadIdx.x;
+    int row_pixel = row * (2048);
+    int col_pixel = col * (4);
+    char *pElement = (char *)pDevPtr;
+    int offset;
 
+    for(int x_offset = 0; x_offset < (4); x_offset++)
+    {
+        for(int y_offset = 0; y_offset < (2048); y_offset++)
+        {
+            offset = ((row_pixel + y_offset) * pitch) + col_pixel + x_offset;
+            
+            pElement[offset] = (pElement[offset] * pElement[offset]) & 0xFF;
+        }
+    }
+    
     return;
 }
 
 int
 addLabels(CUdeviceptr pDevPtr, int pitch)
 {
-    dim3 threadsPerBlock(BOX_W, BOX_H);
+    dim3 threadsPerBlock(1024, 1);
     dim3 blocks(1,1);
 
     addLabelsKernel<<<blocks,threadsPerBlock>>>((int *)pDevPtr, pitch);

Apply these patches with:
cd mmapi_samples/samples/common/algorithm/cuda/
patch < name.patch

Why is there such a large difference in speed? Why is this not discussed in the CUDA optimisation guide?

Hi,
We will try to reproduce the issue and investigate further.

1 Like

Thanks. BTW, the command line I use to run the program is:

./camera_v4l2_cuda -d /dev/video0 -s 4096x2160 -f MJPEG -c

The Logitech Brio at 4k uses MJPEG, there is no raw option.

Hi,

This looks like a memory coalescing problem that particular access pattern can cause a performance drop.
Here is an introduction video for coalescing memory and stride memory.

Here is the profiling data from stride memory with a desktop GPU.
You can see the bandwidth drop from 70GB to 10GB when stride increase from 1 to 8.

It looks like the stride number in patch-1 is 128(4096/32) and 4 in patch-2.
Then this may explain why the performance in patch-1 is much lower.
Thanks.