Extract overlapping image patches from an image in CUDA

I am currently planning on writing a function that extracts overlapping image patches from a 2D image (width x height) into a 3D batch of these patches (batch_id x patch_width x patch_height). As far as I know, there are no utilities in CUDA or OpenCV CUDA which make that very easy. (Please correct me if I am wrong here)

Since I need to resort to writing my own CUDA kernel for this task I need to decide how to tackle this approach. As far as I see there are two ways how to write the kernel:

1 Create a GPU thread for each pixel and map this pixel to potentially multiple locations in the 3D batch.
2 Create a GPU thread for each pixel in the 3D batch and let it fetch its corresponding pixel from the image.

I didn’t find a clear answer in the CUDA Programming Guide to whether any of these approaches have specific advantages or disadvantages. Would you favor one of these approaches or is there an even easier way of doing this?

This is actually my post as well. I wanted to ask here as well since in the SO thread nobody could tell me if there are any primitives for this kind of operation in CUDA or one of the NVIDIA libraries.

It is fine to crosspost here and on Stackoverflow. It is courteous to let readers know of the location of the respective crosspost, so no prospective answerer wastes their time redundantly replying to a question that has already been answered elsewhere.

The CUDA Programming Guide doesn’t specify how to assign threads to data as that is one of the principal design decisions to be made by a CUDA programmer, and the most advantageous arrangement will depend on use case.

As a rule of thumb, a reasonable design starting point is often to assign threads to data based on data output. That is, each thread produces one element of the output, and collects whatever inputs it needs to produce that output. A variant of this approach has each thread produce multiple elements of the overall output.

If a simple paper analysis of the problem at hand doesn’t clearly favor one arrangement over the other, I would highly recommend prototyping promising variants and examining them with the help of the CUDA profiler. Doing this repeatedly on different tasks will improve a CUDA programmer’s intuition (feel) for data-to-thread mapping in CUDA.

That’s why I added the link here.

The first 2 primary optimization objectives of any CUDA programmer are:

  1. expose enough parallelism: a first order simplification/approximation here is to launch a kernel with a grid that has enough threads to saturate the GPU you are running on.

  2. make efficient use of memory

Both principles may be at play here. If we ignore item 1 above, then I would suggest that your method 1 is preferred, because it has at least the opportunity to do a single load operation per input pixel. A store will still be necessary for each of the 3D batch that that pixel appears in. But your method 2 would require a load for each store. method 1 would not, necessarily. So that could be more efficient use of memory. When you have a memory-bound problem, which this one certainly will be, its critical to focus on efficient use of memory. Related to this would be to make sure you do the best possible job of coalesced access, both for loads and stores. That should not be difficult to arrange.

If we pull in item 1 above, then this raises the next set of questions, around dimensions. Dimensions/data set sizes often matter when we are talking about performance. If we used your method 1, we only have as many threads to launch as there are pixels in the input image. If there are enough pixels/threads, then that may be sufficient to saturate your GPU. If not, method 2 may actually be better, because it may have the potential to launch more threads overall (perhaps many more threads. Again dimensions matter.)

I don’t know of any primitives (which I translate to mean “library calls that would be useful”). On SO, asking questions that are asking for recommendations of libraries to use are explicitly off topic, so its possible that people not answering that question are simply adhering to site expectations.

CUDA to a first order approximation is C++ and libraries. Are there C++ “primitives” that meet your needs? If not, there won’t be any in CUDA, most likely, either. You then would need to look at libraries. I don’t know of any libraries that would be an obvious fit for what you are asking about here. You could certainly use NPP to do image copying, in the general case. I doubt it would be an obvious performance “win”, and it won’t be amenable to the various optimization strategies here.

A very good suggestion (IMO) was given to you in your cross posting: don’t copy the data. Instead just create a mapping for each image in the 3D stack, that maps back that image/pixel to the original source image. GPUs can often do math at a much higher rate than memory accesses. Therefore, if the mapping can be done mathematically/algorithmically, it may overall be faster not to create the 3D batch at all. Simply use the mapping functions to access pixels in the original image. The effectiveness of this suggestion may vary depending on how many times you will subsequently access each pixel in the 3D batch. If you have to access the 3D batch many times per pixel (which will require a read per pixel, no matter what) it may be more expedient to create the 3D batch. If you only need to access those output pixels once, then a mapping may be much better. Creating the output batch minimally requires a read, a write, and a read for the final op. If you use the mapping, there is only the read associated with the final op (if you only read the result once), plus the mapping arithmetic cost. This could be a 3:1 reduction in overall memory bandwidth consumed. That could make as much as a 3:1 difference in performance. OTOH if you access each pixel in the output 3D batch many times, then the bandwidth cost for using the mapping will be approximately the same as the bandwidth cost to use the 3D batch.

As an addendum: In situations where there is a goal conflict between coalescing loads versus coalescing stores, optimizing the coalescing of loads is more important in most practically encountered scenarios. There will be other operations waiting on the load data to be returned, while this is typically no the case for data stored. This means that low effective load throughput directly results in lower overall performance.