number of simultaneous threads

hey all,

im considering buying a GTS 250. but before i do i want to be clear on a threads question. the CUDA manual says it has 16 multiprocessors (128 processors).

so does this mean that in a CUDA program there can be upto 512x128 thread running simultaneously. or is threads simply a software concept and only 128 threads running simultaneously? (SIMULTANEOUS being my key question)

thanks for any help


physically, it is, only 128 threads can be executed simultaneously.

However you should think that all threads execute simultaneously when building your parallel program.

for example, you need do __synch for threads in a threadblock, if you can determine order of warps,

then you can avoid __synch. However “avoid __synch” is dangerous, sometimes it works, sometimes it does not work.

Actually, more than 128 threads will be run simultianuously, since the processors are highly hyperthreaded to hide latency. When a thread is stalled with a memory access, for example, the processor simply executes a different thread. Each core switches to a different thread every 4 clock cycles (which works out to 1 instruction across a warp), choosing among a number of threads defined by the occupancy of the kernel - that is, the fewer registers and less shared memory a kernel uses, the more threads the processor can switch between at a time, thus the better it can hide latency. Consider 96 threads per multiprocessor a bare minimum for decent performance.

Yeah, it all depends on your definition of “simultaneous”. :)

The multiprocessors are designed to switch rapidly between threads without the usual context switch required on a CPU. (When you have 16,384 registers at your disposal, you can do stuff like that.) And you’ll have instructions from nearly all your threads at some stage in the pipeline of one of the stream processors.

It is the interesting thing about Streaming processors. A Streaming Processor can run 4 thread simultaneously using the “Pipeline” concept.

So basically the cores can run 4 thread in parallel which makes 32 threads per multiprocessors. (4*8) Which is a warp :rolleyes: (so i know how warp works).

But there is one more thing “THread scheduler”, A device which schedules threads and blocks on the GPU device.

This scheduler can schedule 768 Threads or 8 blocks in a SM, so instead of just running a warp parallel in GPU you actually do context switching among idle warps.

So you can schedule 768 threads on an SM which makes 768*128 threads in parallel on a 16 SM device.

I cleared this hardware concept in a recent webinar. Can somebody confirm this?

As far as i know, per MultiProz there can be 32 threads (WARP) be really computed in parallel.

that means you got 16 * 32 threads in “hardware-parallism”.

Actually, the pipelines are more than 4 stages deep. (We haven’t been give a precise number, but the estimate from some people is ~20 stages, much like a modern CPU.)

This is correct, yes. Warps which are not waiting on a memory load or store can be sent into the pipeline for processing.