Hi guys, I need some advice. Here is what I want to do:
I am implementing a video player that decodes images on the GPU and displays them with OpenGL. Since it gives me better performance I am always decoding multiple frames at once in a single kernel execution (say GOP size of 8). This kernel execution takes significantly longer than 1/24sec but decodes 8 frames in less than 8/24 sec, so in theory real time @ 24fps should be feasible.
Now, when I start a sequence, the player suffers from quite regular hick-ups, i.e. it is stuttering. A couple of frames are played in real-time and then the player pauses for a moment, and so on.
Is this because the OpenGL drawing functions have to wait for the really long decoding kernel execution? Or can OpenGL functions and CUDA kernels run concurrently? Btw, I am running OpenGL and CUDA in two different host threads and have the Contexts share the PBOs that are used to copy the decoded frames from a CUDA buffer to a OpenGL texture.
If you’re using the same card for both CUDA and display, you’ll run into problems. The GPU cannot run two kernels simultaneously… and OpenGL is effectively a kernel. So you need to make sure that your kernels are significantly shorter than 1/24 of a second in order to give the display enough time slices.
This doesn’t preclude you from doing your buffered 8-at-once strategy. Just structure your compute to use several sequential shorter kernel calls. That can be annoying but perhaps not too bad depending on your algorithm.
Don’t be too concerned with the overhead of even dozens of kernel launches, it’s quite minor in practice.
If it’s possible, you might want to consider looking at the problem from a different angle. Instead of computing eight frames at once, would it be possible to split up each frame into 16 (or more) sub-frames and compute those sub-frames in parallel, and calculate the frames sequentially? This approach works well for some problems, and not for others, but you may want to think about this approach. This would allow you to calculate a frame, call OpenGL, calculate the next frame, call OpenGL, and so on.
@Ailleur: Yes, I do have to use PBOs to make my decoded frames available to OpenGL. I am registering them to CUDA only once in the beginning and only mapping and unmapping them as each frame is decoded. But no, I cannot confirm that these are time consuming operations. They execute <1ms. Possibly this issue was fixed with CUDA 2.0.
@SPWorley: This is exactly what I was afraid of. I wish the architecture was designed in a way that multiple kernels can run at once, but there are probably many good reasons that they can’t. I had the same work-around, you suggested, in mind and will now tackle this.
@ ColinS: Unfortunately I cannot reveal exactly what kind of codec I am working on, but your suggestion does not apply to my problem. Thanks anyways.
I have successfully seperated my big fat kernel function into a bunch of shorter ones. But now, when I playback a video at 24fps, after 3 oder 4 seconds, the display turns gray and I have to restart my PC. The only advance warning I get is that the GPU fan noise lvel rises significantly, but I guess that’s normal when I keep the GPU so busy.
Has anyone experienced a similar problem? Does the Nvidia driver create a crash log of some sort?
According to PC Wizard 2008, the GPU temperature rises from its normal temperature of 65Â°C to 86Â°C before the PC crashes with a grey screen. Ambient temperature rises from 51Â°C to 55Â°C.
And anyways, there is plenty of space in my pc case as the GTX-280 is my only device so far. Temperature shouldn’t be an issue … I hope.