How to do pingpong operation to improve cuda video decoder’s performance?

At the last page of <>,it said:
“To improve performance, having 2 or more D3D9 or OpenGL surfaces to ping/pong can improve
performance. This enables this driver to schedule workload without blocking the display thread.”

My program use nvcuvid to decode 15-18 h264 hd streams,so I need to optimize my program to catch the best performance;from the above suggestion,I allocate 2 textures for every decode worker(I use OpenGL to rendering video),one for read decoded video from nvcuvid,another for display rendering,the pseudocode like below:

for(int i=0;i<numOfStreams;i++)
GLuint & texFore = aryOfTexFore[i];
GLuint & texBack = aryOfTexBack[i];

// read out decoded image to texBack;
cuMemcpy2D(); // copy out decoded image

// render video using texFore

// swap fore & back textures for next rendering

But let me down,there is no any performance increase I can get,the fps is same with the code that use 1 texture for all streams’ decoding,so how to rewrite my code? I found my program can consume 99% VPU usage at maximum(from GPU-Z’s realtime monitor),so is there any need to do the pingpong optimization?