I was wondering what the actual number of threads that runs simultaneously is. I know its up to 1024 threads per block, but how many blocks can run at once, and how many grids also?
Many Thanks,
Phill
I was wondering what the actual number of threads that runs simultaneously is. I know its up to 1024 threads per block, but how many blocks can run at once, and how many grids also?
Many Thanks,
Phill
It depends on your resource usage (registers, shared memory, etc), but:
It is possible to run more than one block at once per Multiprocessor. [But you are not in control of this]
As for grids, I’m not sure, but I think 1 per kernel call (of which it is possible to have more than 1 running simultaneously)
It depends on your resource usage (registers, shared memory, etc), but:
It is possible to run more than one block at once per Multiprocessor. [But you are not in control of this]
As for grids, I’m not sure, but I think 1 per kernel call (of which it is possible to have more than 1 running simultaneously)
Up to 8 blocks per SM or 1536 threads on GF100, depending on other limits (registers, shared memory, etc).
You can have up to 16 kernels executing concurrently from a single context.
Up to 8 blocks per SM or 1536 threads on GF100, depending on other limits (registers, shared memory, etc).
You can have up to 16 kernels executing concurrently from a single context.
Whilst we’re on this topic, I have a follow up question -
If one has 1 device running from 1 context with a theoretical limit of N ‘simultaneous’ threads and one then runs K kernels concurrently will:
NK threads be running simultaneously or
N threads be running simultaneously (but each kernel with N/K’th the resources)?
Whilst we’re on this topic, I have a follow up question -
If one has 1 device running from 1 context with a theoretical limit of N ‘simultaneous’ threads and one then runs K kernels concurrently will:
NK threads be running simultaneously or
N threads be running simultaneously (but each kernel with N/K’th the resources)?
(and in that one nugget, you now know everything you need to know about concurrent kernels on GF100)
(and in that one nugget, you now know everything you need to know about concurrent kernels on GF100)
Just to clarify, what you’re saying is that, on GF100, K is limited to 2 and so at most you can have 2N threads running simultaneously (where N is the maximum number of simultaneous threads in 1 kernel call)
ie. 1536 x numberOfSMs x K(=2) [GF100]
Just to clarify, what you’re saying is that, on GF100, K is limited to 2 and so at most you can have 2N threads running simultaneously (where N is the maximum number of simultaneous threads in 1 kernel call)
ie. 1536 x numberOfSMs x K(=2) [GF100]
let N be the number of possible threads resident on the GPU at a time for a given kernel A.
you launch K copies of kernel A in separate streams of N threads each.
assuming that every warp has the same runtime, you will never have more than two copies of kernel A resident at the same time.
let N be the number of possible threads resident on the GPU at a time for a given kernel A.
you launch K copies of kernel A in separate streams of N threads each.
assuming that every warp has the same runtime, you will never have more than two copies of kernel A resident at the same time.
But you will have 2? ie. A total of 2N resident threads?
But you will have 2? ie. A total of 2N resident threads?
Nope, still N threads.
Nope, still N threads.
But then it’s not really concurrent? More like interleaved?
Sorry to bombard you with questions, but where can one find this out? There are but a few sentences in the 3.1.1 Programming Guide about Concurrent Kernel Execution
Edit:
I guess this sort of explains it visually:
External Media
But then it’s not really concurrent? More like interleaved?
Sorry to bombard you with questions, but where can one find this out? There are but a few sentences in the 3.1.1 Programming Guide about Concurrent Kernel Execution
Edit:
I guess this sort of explains it visually:
External Media
No, it’s concurrent, but just because you say that there are no dependencies between launches doesn’t mean the machine will run them all simultaneously because of other resource limits.