nppiResize_8u_C1R function


I was wondering whether the nppiResize_8u_C1R function (part of NPP library) can somehow be called with within a specific CUDA stream.

I need to resize images from different host threads. Preferably I would like to use as much GPU as possible, i.e. ideally pushing usage to 100%. However I cannot achieve more than 20% of GPU usage for say 16 host threads.

My assumption is that the nppiResize_8u_C1R function somehow serializes, i.e. it waits one host thread to finish then it handles the second thread etc. Is this correct? If it is how would one get concurrent execution of this function?

Thanks a lot

npp has nppGetStream() and nppSetStream() functions to handle streams. You can refer to the documentation for these functions.

(e.g. p33)

If you issue nppSetStream(…) to some stream you have created, subsequent npp calls (within a given CPU thread) should be issued to that stream.

Otherwise, the npp functions will be issued to the default stream and they will serialize.

Whether or not this affects GPU utilization I can’t say. Not sure what gpu utilization monitor you are using, and it may not be indicating what you think it is. If an individual kernel is fully utilizing the GPU (which would probably be the case for a reasonably large image) then launching additional kernels may not improve utilization much. You may still be able to improve overall efficiency by making effective use of overlap of copy and compute, which is also facilitated by streams (although more than just stream usage is reqiured.)

For a multithreaded application, you’ll also want to be sure your GPU is in the correct compute mode, i.e. Default, or Exclusive Process.

Thanks txbob.

I think overlap of copy and compute would certainly improve the speed. Do you perhaps know some good resources where I can learn about that?