Writing an Optimized Decoder

I’m trying to write an optimized decoder. I’ve reviewed the samples (which are generally very helpful but none of which are optimized) and the documentation. I’m basically stuck on this spot on page 17 of the API Reference: “This thread checks if there are any decoded frames available”. I may be missing something obvious but how exactly do I do this? Specifically I need to have a thread feeding NAL units into the decoder (cuvidDecodePicture()) and another pulling decoded frames out. Exactly how do they synchronize? How and when do they block and wake up? (Sample code would be a terrific help). Thanks.