Streams are part of the CUDA API, intended to help get more concurrency. In your code, you can create streams, and launch kernels, memcpys, and memsets into those streams. A stream is like a FIFO queue of work for the GPU. A stream guarantees that each operation launched into it completes before the next operation starts. Work in separate streams may execute concurrently if the hardware has resources available to do so. For example, a GPU with two Copy Engines is capable of executing multiple kernels, a host-to-device memcpy, and a device-to-host memcpy all at the same time, as long as those operations are all in different streams.
See here for more info. particularly the section about streams:
The trace tools then display which stream each workload executed on. Since streams are serialized sequences of work, it makes sense on a timeline to display them as individual rows.