CUDA under a multithreaded environment

I am building a real time signal processing system where in I am running 8 CPU threads for data acquisition and 8 threads for writing this data into files. I am maintaining the data read from DAQ systems in an array of ints. When each of the DAQ system has read atleast one set of data , I am launching a separate thread which calls a CUDA kernel for signal processing. The CUDA thread cycles through the int DAQ arrays and picks up blocks of data from each of the 8 DAQ arrays and processes them and then moves onto the next block.

The problem I am facing in that when I run the program the CUDA program generates a lot of garbage and incorrect values at its output. However, if I write the input to the CUDA kernel into a file and then run the CUDA kernel via a different process and read the files as input, I am getting the correct results. So, i guess I can safely assume that the input to the CUDA kernel is correct. But the CUDA kernel fails to give the correct result in real-time but works fine with the same inputs in non-real-time. Does the multi-threaded environment has got something to do with this. I am thoroughly confused as to why this is happening.

P.S.: the 8 DAQ buffers are sufficiently long that there no overflow or re-flow and there are no other threads that are reading from the buffer, threads are only reading from the buffer.

You provide little detail as to your host-thread-to-device setup; but from what you mention, it seems as if your host threads are out of sync/ inconsistent when it comes to the data

Make sure your host to device memory transfers complete before kernel launches, and that this is guaranteed even when using multiple host threads

I normally manage this by explicitly creating streams in the parent thread beforehand, and passing streams and data pointers to host child threads

Or, simply try a cudaDeviceSynchronize before the kernel launch

Note that stuff like texture samples are not CPU-thread-safe. Likely also several NPP functions.
See http://stackoverflow.com/questions/19662388/how-to-get-gpu-kernels-using-global-texture-references-thread-safe-for-multiple
and
https://devtalk.nvidia.com/default/topic/534966/can-npp-be-safely-used-in-multi-threaded-code-/

“However, if I write the input to the CUDA kernel into a file and then run the CUDA kernel via a different process and read the files as input, I am getting the correct results. So, i guess I can safely assume that the input to the CUDA kernel is correct. But the CUDA kernel fails to give the correct result in real-time but works fine with the same inputs in non-real-time.”

I think your conclusions are unsound. If writing the input to a file (where it is verifiably correct, and probably unaffected by other system behavior) and then reading from a file produces the correct result, then the CUDA kernel and processing is probably sound. The most likely explanation is that in the real-time case, the input to the cuda kernel is (somehow) not the same as what was read in the file case. I would focus on verifying the data input delivery system rather than the CUDA kernel itself.