Thread vs Stream what is the difference?

I am quite puzzled looking at Thread and Stream (CUDA Toolkit 3.2).
Can somebody explain me what is the differences/similarities between these 2 objects?

Streams are a feature of the CUDA APIs which allow for concurrent operations within a single GPU context. Initially this was limited to overlapping copying with kernel execution, but on Fermi hardware it has been extended to permit (resources allowing) concurrent kernel execution on a GPU.

On the GPU, a thread is the basic execution element. You write kernel code for a single thread, tell the device how you want those threads assembled into blocks and how many blocks you wish to run, and the execution model collects SIMD groupings (“warps”) of threads and schedules them on multiprocessors.

The GPU hardware has NO knowledge about a “Stream”, It only knows about spawning threads and executing kernels.

Contexts and Streams are NVIDIA Driver related. A stream is an in-order channel of GPU operations.
Every context has a default stream.
However sometimes the application might need to leverage certain GPU features like:

  1. Concurrent memory copy and Kernel execution
  2. Concurrent kernel execution
    To leverage this, NVIDIA driver allows for multiple streams of execution inside a single GPU context.

Hope this helps,

Thanks avidday and Sarnath.

So in my context of performing FFT (simple 1D R2C), the cuFFT library takes care of the threads, while I can use Streams through the cuffSetStream().

You mention that may be


In my case the deviceQuery SDK code mention for my GTX460 that

  1. Concurrent Copy & Execution = Yes

  2. Concurrent Kernel execution = Yes

But should I use cudaSetDeviceFlags and how I find the right flag ??



No that shouldn’t be necessary. Just use the runtime API Streams API to create as many streams as you need for concurrent copying and execution, and pass the streams to CUFFT and the copy APIs as per the documentation. If you want to see how streams can work in practice, have a look at the SDK simpleStreams example.

Thanks. I had a look at the simpleStreams example. It is already quite complicated for a new in the field.

I have recently write a topic in this forum to see I am doing correctly. Had you have a look at it? I have no reply yet.


I don’t use CUFFT at all, I can’t help you with that sort of performance tuning.