How to make parallel calls to NPP functions?

I need to remap a large stack of images with the nppiRemap functions. I thought I would be smart and parallelize this process by calling those functions from within a CUDA kernel, but alas, they are technically __host__ functions. It didn’t seem particularly wise to create a separate stream context for each image in the stack, but it’s starting to seem like that might be the only option. Am I missing something?

nppiRemap() is highly parallelized internally. What set of circumstances led to the hypothesis that additional performance could be gained by parallel execution of multiple invocations of nnpiRemap?

What’s the size (in pixels) of the images?

Are you indicating that a simple for loop would be no less efficient than parallel invocation? Interesting, I would not have expected that to be the case.

The shape of the input images are approximately 1024x512. The shape of the output images vary (depending on a parameter); I wouldn’t expect anything less than 512x512 or more than 2048x2048. They are single-channel uint8 images, and there can be anywhere between 1 and ~500 images in a stack.

Imagine a factory containing 10 machines producing widgets. In order to keep these machine busy at all times as long as there is still work to do, team lead nnpiRemap assigns a team of 100 workers to theses machines, who take turns using them, with zero switching time between workers. Their boss, QuaternionsRock, in order to boost production, now proposes to assign 10 teams of 100 workers to the machines. Switching out teams incurs a finite, non-zero time overhead epsilon.

Under this proposal, how much does total production per unit of time increase?

I suppose I wouldn’t have expected a remap operation on a single image to be able to achieve full utilization of the GPU, especially considering there are versions of nppiRemap that operate on multi-channel planar images - if there is no efficiency to be gained, why not just tell the user to iterate through the channel planes themselves?

The execution resources of the GPU (called “CUDA cores” in marketing speak), are on the order of 10K. The number of pixels in each image are at least on the order of 260K. It is fairly common in image processing to assign one thread to each output pixel. Given that here we have a single byte-wide channel, nppiRemap may be assigning four consecutive pixels to each thread for efficient memory access, and use only 65K threads.

This tells us that in this scenario a single call to nppiRemap() is able to fully utilize all available GPU execution resources. Depending on the details of the processing (some modes of this function would seem to require more computation than others) the utilization of the execution resources may actually not be the performance-limiting factor. Instead, the processing may be limited by available memory bandwidth.

Yes, given the context of the question. Concurrent execution of multiple kernel invocation makes only sense when a single launch is unable to fully utilize the GPU resources. This does occur in real life, but rarely. When it happens it should be an incentive to ponder how more parallelism can be exposed (for example, by batch processing). And if it happens a lot, the task at hand might not be a good match for the massive parallelism of GPUs.