Nobody is replying… interesting
OK, I believe I have the answer to the CUDA_ERROR_LAUNCH_FAILED error. There would appear to be two pretty critical numbers that are fed to the decoder in the CUVIDDECODECREATEINFO structure:
ulNumDecodeSurfaces : this relates directly to the size of the PicIndex circular buffer. Not sure what happens if this is too small, but I suspect it isn’t nice. There appears to be a limit on this of 32.
ulNumOutputSurfaces : No idea what this controls, but when I changed it from 2 to 10 the errors stopped.
I am running what is considered to be the decoder post-processing (mapping the frame, color conversion, conversion to RGB, and unmapping the frame) in a separate thread. I suspect that if my queue for this thread exceeds the number of decode surfaces bad things will happen with the PicIndex circular buffer and things will fail. Evidently, there is something going on with the number of output surfaces as well.
What impact on memory does the number of output surfaces have? I have a 2GB GTX 670 card, but I would like my code to be able to be run on significantly smaller amounts of RAM on smaller cards.
There is a statement in the cudaDecodeD3D9 sample that says they are limiting the decode memory to 24MB by limitimg 1.5 * frame pixels to 16,777,216. OK, sounds reasonable - except with most cards having at least 256MB available isn’t 24MB a little aggressive a limit?
And why is there no comment about the number of output surfaces and what effect that has on memory?