Asynchronous processing?

I have an application that requires a large number of processors to handle processing of individual, computationally isolated streams of data. The streams of data can be asynchronously started, and each would (presumably) have to be processed on a single GPU core with (some of) their own memory (some shared).

I can’t find any good examples anywhere that show that this has been done before, I’ve gone through the tutorials and written some basic programs based off those, but I’m not seeing any way to asynchronously stream data into, and get data from a single process tied to a single GPU core.

Is this possible with CUDA, or is CUDA strictly limited to parallelization of a single operation (like matrix arithmetic, etc)?

Thanks in advance for any responses, pointers, etc.