Image processing with CUDA: design question.

dave4ff8k · January 25, 2018, 11:09am

My aim is to split a 1280x960 image of ints into small 4x4 regions and find the maximum pixel value in each 4x4 region (with a margin around the edge of the overall image). There is a second processing stage on each 4x4 region but I omit that for now. I hoped CUDA (running on a Jetson TX1) would help speed this up…

My initial design is as follows:
Split the image up into 16x30 overlapping blocks of 96x38 pixels. So 480 blocks in total – each processing 3648 pixels.
Then I used 20x8 (160 total) threads to load the 3648 pixels into shared memory. So 3648x4 = 14592 bytes. Each block can then be split up into 160 4x4 regions – so each thread find a maximum in its own 4x4 region.

I thought this would work well because the memory loads can be coalesced (96 is a multiple of 32), 160 threads can be used for loading AND splitting with hardly any sitting idle for long and total threads is a multiple of 32.

I have this design running, and a reference CPU version running as well. Unfortunately, the CPU version is twice as fast as the GPU version…. A bit of profiling reports low occupancy and it might be to do with using too much shared memory per block. Could anyone perhaps elaborate on this, or suggest a better design?

Many thanks in advance.

njuffa · January 25, 2018, 11:56am

The Jetson TX1 uses the same physical memory to serve both CPU and GPU, I would think? And this appears to be a task that is bound by memory throughput since there is very little computation. Assuming the CPU and the GPU access the memory with the same throughput (I don’t know whether that is true, worth measuring with the STREAM benchmark), I would expect no speedup from doing the processing on the GPU.

You don’t say how low the occupancy is, but given the shared memory usage per thread block, it is certainly going to limit occupancy somewhat. Try cutting shared memory usage per thread block in half. Also, double check the access patterns for global and shared memory access. The profiler can help you determine how efficient these accesses are (coalescing, bank conflicts).

dave4ff8k · January 25, 2018, 1:22pm

Thanks for your speedy reply. :)

I guess I’m a bit confused now.
Perhaps wrongly I assumed the CPU had its own RAM, and the GPU had its own memory - global (which any block and thread could access) and shared (which can be shared by threads in the same block).

Anyhow, my timings don’t include copying memory from host to device (which I had assumed was transfer between CPU RAM and GPU global memory). However, the timings do include the copies between what I had assumed was GPU global memory to GPU shared memory (and are then shared between each block’s threads.)

Please could you explain how you think it is setup?

Many thanks

dave4ff8k · January 25, 2018, 1:50pm

So as a relative newbie, I find the profiler a bit confusing tbh.

So the CUDA Application Analysis reports:
Low memcpy/compute overlap (0%),
low kernel concurrency (0%),
low memcpy throughput (1.073 GB/sa avg, for memcpys accounting for 90.5% of all memcpy time).

The complete utilization graph looks to be down at about 30% throughout.

When I examine individual kernels (I only have one here):
Efficiency:
Global load efficiency = 100%
Global store efficiency = 15.8% (!)
Shared efficiency = 37.8% (!)
Warp execution efficiency = 71.5% (!)
Non-predicated warp execution efficiency = 70% (!)
Occupancy:
Achieved = 29.1% (!)
Theoretical = 46.9%
Limiter = Shared Memory

When I perform kernel analysis it says:
Kernel performance is bound by instruction and memory latency.
There is then a bar chart that shows the compute utilisation at less than 10% (the legend shows most of this 10% consists of arithmetic operations) and memory utilisation about 5%.

Perhaps you would be kind enough to suggest what this shows?

njuffa · January 25, 2018, 7:01pm

As I said, the kernel should be bound by memory throughput, as only 15 max() calls that map directly to IMAX instructions are needed to find the maximum value out of 4x4 = 16 pixels, so I see no reason to worry about computation.

Global load efficiency is important and is 100%, that is good. Since stores are typically “fire and forget”, and this should be applicable to your scenario (they are used for writing out final results), I wouldn’t worry about that, for now. Revisit the store access pattern later.

The occupancy looks worryingly low, and it is caused by the shared memory usage per block being too high. Cut it in half.

I don’t know why there are memcpy() calls in the code. The Jetson TX 1 has a unified physical memory, does it not? I haven’t used one of these parts, but given the unified physical memory I do not see the need for copying, that is just adding overhead.

Working with the profiler will be a feedback loop. It suggests a bottleneck, you change the code to eliminate the bottleneck, now you have a new bottleneck, etc.

BTW, there is a quite active dedicated sub-forum for the Jetson TX 1: [url]https://devtalk.nvidia.com/default/board/164/jetson-tx1/[/url]

dave4ff8k · January 26, 2018, 4:10pm

Ok thank you. That is very helpful.
I’ll look into this some more.

Topic		Replies	Views
Cuda programming Jetson TX2 cuda	4	641	October 18, 2021
[SOLVED] Finding the maximum values with CUDA CUDA Programming and Performance	4	8887	October 13, 2017
Device Query GeForce 210 CUDA Programming and Performance	6	11171	May 27, 2010
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1861	February 3, 2012
Shared Mem usage in Tx2 CUDA kernel Jetson TX2	6	1114	October 18, 2021
Analyzing the speedup CUDA Programming and Performance	6	841	October 25, 2017
Processing pictures, low load efficiency..? CUDA Programming and Performance	3	528	April 10, 2018
Array Problem CUDA Programming and Performance	5	1462	January 27, 2010
Gap between measured perf. and peak CUDA Programming and Performance	8	13074	March 20, 2008
Disappointing shared memory performance CUDA Programming and Performance	3	737	September 8, 2011

Image processing with CUDA: design question.

Related topics