Integration of CUDA and OpenGL is slow on Linux

I have met a problem on the Linux platform…
I write a program to decode h.264 videos with CUDA and render it with OpenGL.
In my program, one thread is responsible for one video decoding and rendering on one window(there are multiple threads in my program)
However, I found the performance becomes almost 3x worse when I doubles the thread number to decode double videos…
Anyone know why the performance becomes so bad?