I am using CUDA for image processing and I developed a pixel-to-pixel correlation method with CUDA. My approach is to process a single pixel within one CUDA thread. When the image is small, less than 512x512 pixels it was Ok, but when I increased the image size it became very slow. So I wonder when I see in Nvidia Profiler this issue:

Very High Utilization
ALU is the highest-utilized pipeline (83.6%) based on active cycles, taking into account the rates of its different instructions. It executes integer and logic operations. The pipeline is over-utilized and likely a performance bottleneck. Based on the number of executed instructions, the highest utilized pipeline (83.6%) is ALU. It executes integer and logic operations. Comparing the two, the overall pipeline utilization appears to be caused by frequent, low-latency instructions. See the Kernel Profiling Guide or hover over the pipeline name to understand the workloads handled by each pipeline. The Instruction Statistics section shows the mix of executed instructions in this kernel.

And then: Instruction Statistics
FP32/64 Instructions: This kernel executes 1781720 fused and 979381685 non-fused FP32 instructions. By converting pairs of non-fused instructions to their fused, higher-throughput equivalent, the achieved FP32 performance could be increased by up to 50% (relative to its current performance). Check the Source page to identify where this kernel executes FP32 instructions.

Does it simply means that I have reached the limit of the GPU and all threads are busy?

No. The profiler actually tells you exactly what to look at to increase performance:

This kernel executes 1781720 fused and 979381685 non-fused FP32 instructions. By converting pairs of non-fused instructions to their fused, higher-throughput equivalent, the achieved FP32 performance could be increased by up to 50% (relative to its current performance). Check the Source page to identify where this kernel executes FP32 instructions.

The fused FP32 operation being referred to is FMA (= fused multiply-add). GPUs, and frankly all modern high-performance processors, are optimized for maximum throughput of this operation.

Typical non-fused FP32 operations are FADD, FMUL. The CUDA compiler, at default settings, will look aggressively for opportunities to contract FMUL with dependent FADD into FMA. However, it will not otherwise re-associate floating-point arithmetic, as floating-point arithmetic (as opposed to actual mathematical operations) is non-associate, and this behavior preserves programmer’s sanity.

Use the profiler to identify where the kernel executes the bulk of the FP32 operations, and look into re-organizing the computation so as to maximize the use of FMA operations (use the fmaf() and fma() standard math library functions for that).

Before diving into this low-level aspect, you may want direct your optimization effort “one level up” and double-check the algorithm(s) used:

(1) Are there alternative approaches that belong to a lower complexity class (“big-O”)? The question hints that maybe an algorithm of complexity O(n^{2}) or even O(n^{3}) may be involved.

(2) Are there alternative approaches that lend themselves more readily to the application of fused multiply-adds?

I will think about improving algorithms and using the fused multiply-adds, but this complicates the source and for both I will need some time.

Just one more thing that is not completely clear from the Profiler output - What is the limit when I will reach occupancy of 100% and the max number N of the active threads? My GPU is RTX 3060, in the Occupancy Calculator I see:

Active Threads per Multiprocessor 1536
Active Warps per Multiprocessor 48
Max Threads per Multiprocessor 1536
Threads per Warp 32

Sorry if not clear. If I process image with size N = w x h, with one CUDA thread per pixel, should I expect that after certain limit of N pixels performance and calculation time per image will start depending linearly from N? I believe that this is when I reach maximal GPU utilization.

So if I have 1000 images I will wait the same time if I send them on the GPU pipeline one by one or by batches of two or more (if their size is bigger than N) .

The use of batching mechanisms is recommended when each processed item (e.g. image, matrix) is “small” to make the best use of available GPU resources. It is true that the benefits of batching decrease and ultimately vanish as the size of each processed items becomes larger.

For any given use case, you can easily convince yourself of this by trying batching with increasing item sizes, and plotting the resulting throughput (y-ordinate). The resulting graph (which may be somewhat noise) should asymptotically approach a horizontal line.