What is the actual limit on simultaneously running threads? Asin, is it possible for more than one b

I was wondering what the actual number of threads that runs simultaneously is. I know its up to 1024 threads per block, but how many blocks can run at once, and how many grids also?

Many Thanks,
Phill

It depends on your resource usage (registers, shared memory, etc), but:

It is possible to run more than one block at once per Multiprocessor. [But you are not in control of this]

As for grids, I’m not sure, but I think 1 per kernel call (of which it is possible to have more than 1 running simultaneously)

It depends on your resource usage (registers, shared memory, etc), but:

It is possible to run more than one block at once per Multiprocessor. [But you are not in control of this]

As for grids, I’m not sure, but I think 1 per kernel call (of which it is possible to have more than 1 running simultaneously)

Up to 8 blocks per SM or 1536 threads on GF100, depending on other limits (registers, shared memory, etc).

You can have up to 16 kernels executing concurrently from a single context.

Up to 8 blocks per SM or 1536 threads on GF100, depending on other limits (registers, shared memory, etc).

You can have up to 16 kernels executing concurrently from a single context.

Whilst we’re on this topic, I have a follow up question -

If one has 1 device running from 1 context with a theoretical limit of N ‘simultaneous’ threads and one then runs K kernels concurrently will:

  1. NK threads be running simultaneously or

  2. N threads be running simultaneously (but each kernel with N/K’th the resources)?

Whilst we’re on this topic, I have a follow up question -

If one has 1 device running from 1 context with a theoretical limit of N ‘simultaneous’ threads and one then runs K kernels concurrently will:

  1. NK threads be running simultaneously or

  2. N threads be running simultaneously (but each kernel with N/K’th the resources)?

  1. none of the above. if you can launch N threads on your GPU, launching K kernels of N threads each such that they can run concurrently will run at most 2 kernels at a time.

(and in that one nugget, you now know everything you need to know about concurrent kernels on GF100)

  1. none of the above. if you can launch N threads on your GPU, launching K kernels of N threads each such that they can run concurrently will run at most 2 kernels at a time.

(and in that one nugget, you now know everything you need to know about concurrent kernels on GF100)

Just to clarify, what you’re saying is that, on GF100, K is limited to 2 and so at most you can have 2N threads running simultaneously (where N is the maximum number of simultaneous threads in 1 kernel call)

ie. 1536 x numberOfSMs x K(=2) [GF100]

Just to clarify, what you’re saying is that, on GF100, K is limited to 2 and so at most you can have 2N threads running simultaneously (where N is the maximum number of simultaneous threads in 1 kernel call)

ie. 1536 x numberOfSMs x K(=2) [GF100]

let N be the number of possible threads resident on the GPU at a time for a given kernel A.

you launch K copies of kernel A in separate streams of N threads each.

assuming that every warp has the same runtime, you will never have more than two copies of kernel A resident at the same time.

let N be the number of possible threads resident on the GPU at a time for a given kernel A.

you launch K copies of kernel A in separate streams of N threads each.

assuming that every warp has the same runtime, you will never have more than two copies of kernel A resident at the same time.

But you will have 2? ie. A total of 2N resident threads?

But you will have 2? ie. A total of 2N resident threads?

Nope, still N threads.

Nope, still N threads.

But then it’s not really concurrent? More like interleaved?

Sorry to bombard you with questions, but where can one find this out? There are but a few sentences in the 3.1.1 Programming Guide about Concurrent Kernel Execution

Edit:
I guess this sort of explains it visually:
External Media

But then it’s not really concurrent? More like interleaved?

Sorry to bombard you with questions, but where can one find this out? There are but a few sentences in the 3.1.1 Programming Guide about Concurrent Kernel Execution

Edit:
I guess this sort of explains it visually:
External Media

No, it’s concurrent, but just because you say that there are no dependencies between launches doesn’t mean the machine will run them all simultaneously because of other resource limits.