nvJpeg Library-- How do you use cuda streams to get best concurrency

Since I found nowhere to get more information about nvJpeg other than library update(even a topic about nvJpeg in this forum). I have posted several questions about nvJpeg in DALI’s GitHub repository,where you can find https://github.com/NVIDIA/DALI/pull/117.
DALI guys told me that your performance report are use one CPU core in CPU test and one stream in GPU test, and if I want to use multi-streams I should also use multi-threads (one CPU thread with one stream).

I’ve tried to use one thread with one stream as they(DALI) do(https://github.com/NVIDIA/DALI/blob/master/dali/pipeline/operators/decoder/nvjpeg_decoder.h), but the utilization of CPU / GPU are still not fulfilled( about 20-30% of 12 CPU cores; 10-20% of GPU(1080Ti) ) .Stream number set as 4 is limitation --larger than 4 can not increase performance(FPS).

So, If I want more performance increase , is that a good idea to launch more streams ?