I am a developer from Civolution France, a company specialized in video application.
We are new to CUDA development, but we aim to develop an application which will decode a H.264 stream insert a logo on it and re-encode the video to a new H.264 stream.
As we need high velocity for this application, we want t use the CUDA technology to speed up the decoding and encoding process.
We have identify a H.264 encoder SDK, provided by MainConcept, which is based on CUDA. For de decoding process, we tried to use the NVCUVID API from NVIDIA, which seems to be quite fast for decoding H.264 stream.
So, for test purpose, I started from the cudaDecodeGL sample provided by NVIDIA and I integrated in this sample the call to the Mainconcept SDK to encode to a new H.264 stream.
Whe using only one board (tested ith GeForce 470 and Tesla C2050), the results obtained by this test program are quite good.
My problem comes when I want to start several instances of the program simultaneously using different board. For doing this tests, we had access to a server with 8 Tesla C2050 board inside (running Windowq 7 Professionnal). Our test program (through command line option) allows to define which card to use for the process. So I created a .bat file which runs 8 instances of the test program (one instance for each Tesla board). When I tried to run the .bat file, the PC freezed and I got a bluescreen and reboot.
When I run only one instance of the program, everything is fine, whatever the Tesla board I use. I have done another test then. I introduced a pause command in the .bat file between each launch of the test program so that the program are not run simultaneaouly. I did not get any blue screen then, but some of the test program instances returned error code or crashed (this can work up to 4 or 5 instances, but all others return errors or crash).
So I tried to do my tests using the cudaDecodeGL sample from NVIDIA (modified so that I can specify a specific device to run on), and I got the same results.
I tried also to remove the OpenGL code from this sample (as we do not need display with our case), but with the same result (one instance work file, several instances does not - can be fine up to 4 or 5 instance but no more).
I have no idea where the problem is.
Does anyone know if there is some limitation concerning the nvcuvid API ? Is this supposed to run simultaneaouly on several nvidia board ?
If yes, can anyone provide me with a simple sample code doing this (as we do not need display, we do not need to use the OpenGL code).