NPP JPEG Routines Concurrency

Hi,

I am investigating making JPEG related solutions using GPU acceleration via NPP. I use the IDCT, DCT, and resize functions. We’re looking at deploying a service, so concurrency is of utmost importance to us to drive throughput. We’re facing some issues.
As I understand, there are two ways of driving concurrency in the CUDA world:
1. Multi Thread and use different CUDA streams.
The problem here is that the IDCT and DCT functions have been marked as “Not thread safe” in the NPP manual. Any idea whether there is some news on these being made thread safe in newer versions of NPP?
2. Use MPS Server and go multi process.
This is the only practical option for us if multi threading is out.
The problem we are facing here is, that the service is to be deployed on Amazon EC2 and the GPU instances there have an NVIDIA GRID K520, which seems to support only SM 3.0, where as MPS Server requires >= SM 3.5.
Anyone faced this issue on Amazon EC2 instances?

Would greatly appreciated some info here.

You can have a multi-threaded app as long as only a single thread drives NPP.

You’re not likely to get much concurrency benefit by trying to drive multiple NPP routines operating on images of any reasonable size on a single GPU (the number of threads/blocks launched by the NPP routine kernels will essentially be large enough to fill up the GPU – preventing any kernel concurrency.)

Therefore one possible approach to consider might be to have a multi-threaded app that hands all its NPP work to a single thread, which performs the NPP calls and then returns the results to the various requesting threads. Other aspects of concurrency can easily be managed by the single NPP thread, such as overlap of copy and compute.

You won’t be able to deploy MPS on a SM3.0 device.

NPP has undergone some changes in CUDA 8.0 RC. If you haven’t taken a look at that you may wish to, although a couple functions are still marked as non-thread-safe:

nppiDCTQuantFwd8x8LS_JPEG_8u16s_C1R
nppiDCTQuantInv8x8LS_JPEG_16s8u_C1R