concurrently running blocks from multiple kernels on the same SM related to Fermi and unified shader

Hi There,

Fermi allows us to concurrently run multiple kernels. Unified shader architecture also allows us to run different shaders (fragment/geometry/vertex) at the same time. Hence, I think in a sense they have similarities. To me, it looks like NVIDIA has disclosed this feature, which became available way back in the unified shader architecture, to the programmers with the Fermi architecture. If this assumption is right, I was wondering which kept NVIDIA from disclosing this feature in the architectures which was released earlier than Fermi? What was the bottleneck, and how was it solved?

I am also wondering about the scheduling details of the blocks on these architectures.

I think there is only one scheduling queue for the blocks launched from different kernels and the scheduler issues these blocks in the order they were received. Also, I think scheduler keep sending these blocks to SMs as long as there are free resources. Assuming that there are no dependencies between the kernels (e.g. in Fermi each kernel is launched from a different stream) I was wondering if blocks from different kernels can run together on the same SM at the same time.

For instance, lets assume kernel1 is launched before kernel2 and number of blocks of kernel1 is not enough to fill the machine. In that case, is it possible that one SM may run the blocks of kernel1 (ending blocks of kernel1) and kernel2 (starting blocks of kernel2) at the same time?

Or this is not possible and blocks from different kernels can not reside on the same SM at the same time? Maybe the scheduler only allows blocks from the same kernel/shader to run on one SM at one time?

Thanks!

Pinar

Hi There,

Fermi allows us to concurrently run multiple kernels. Unified shader architecture also allows us to run different shaders (fragment/geometry/vertex) at the same time. Hence, I think in a sense they have similarities. To me, it looks like NVIDIA has disclosed this feature, which became available way back in the unified shader architecture, to the programmers with the Fermi architecture. If this assumption is right, I was wondering which kept NVIDIA from disclosing this feature in the architectures which was released earlier than Fermi? What was the bottleneck, and how was it solved?

I am also wondering about the scheduling details of the blocks on these architectures.

I think there is only one scheduling queue for the blocks launched from different kernels and the scheduler issues these blocks in the order they were received. Also, I think scheduler keep sending these blocks to SMs as long as there are free resources. Assuming that there are no dependencies between the kernels (e.g. in Fermi each kernel is launched from a different stream) I was wondering if blocks from different kernels can run together on the same SM at the same time.

For instance, lets assume kernel1 is launched before kernel2 and number of blocks of kernel1 is not enough to fill the machine. In that case, is it possible that one SM may run the blocks of kernel1 (ending blocks of kernel1) and kernel2 (starting blocks of kernel2) at the same time?

Or this is not possible and blocks from different kernels can not reside on the same SM at the same time? Maybe the scheduler only allows blocks from the same kernel/shader to run on one SM at one time?

Thanks!

Pinar

Yes, an SM can run blocks from multiple kernels. Check out slide 52 from the Fermi fundamental optimization slides from SC10 tutorial:

http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Fundamental_Optimizations.pdf

Yes, an SM can run blocks from multiple kernels. Check out slide 52 from the Fermi fundamental optimization slides from SC10 tutorial:

http://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Fundamental_Optimizations.pdf

Hello there,

Thanks paulius for this tutorial, it looks quite helpful for us.

Arastirmaci was assumming that there are not dependencies between kernels.

But what happens if there are dependencies between kernels in FERMI? Can we assume that there is global synchronization among blocks when the kernel ends as in CCC 1.X? If so, that’ s means that the kernels are not executed at the same time, right?

I am observing in my code, kernels with dependencies between them, are executed at the same time but one after the other. I’ ve checked that by printing out something in the kernel.

An other question that I am thinking is about the Shared Memory usage. If an SM can allocate different blocks from different kernels at the same time, the total of shared memory allocated by the SM would be the sum of the shared memory allocated by blocks running on the SM, even though the shared memory usage is different per each block. I am having strange launching problems in certains cases, and I was wondering if that is the reason.

Thank you very much in advance

Hello there,

Thanks paulius for this tutorial, it looks quite helpful for us.

Arastirmaci was assumming that there are not dependencies between kernels.

But what happens if there are dependencies between kernels in FERMI? Can we assume that there is global synchronization among blocks when the kernel ends as in CCC 1.X? If so, that’ s means that the kernels are not executed at the same time, right?

I am observing in my code, kernels with dependencies between them, are executed at the same time but one after the other. I’ ve checked that by printing out something in the kernel.

An other question that I am thinking is about the Shared Memory usage. If an SM can allocate different blocks from different kernels at the same time, the total of shared memory allocated by the SM would be the sum of the shared memory allocated by blocks running on the SM, even though the shared memory usage is different per each block. I am having strange launching problems in certains cases, and I was wondering if that is the reason.

Thank you very much in advance

Kernels issued into the same stream (which is stream 0 if no explicit stream argument is provided) are assumed to be dependent. Therefore, no threadblock from the second kernel should be executed by the hw until all threadblocks of the preceding kernel have all been completed.

Regarding your smem question, yes, hw will not launch a threadblock if sufficient resources are not available. Here resources would be either registers or shared memory.

Kernels issued into the same stream (which is stream 0 if no explicit stream argument is provided) are assumed to be dependent. Therefore, no threadblock from the second kernel should be executed by the hw until all threadblocks of the preceding kernel have all been completed.

Regarding your smem question, yes, hw will not launch a threadblock if sufficient resources are not available. Here resources would be either registers or shared memory.

Paulius, thank you very much for answering the second part of my question. Do you have any comments for the first part of my question:

“Fermi allows us to concurrently run multiple kernels. Unified shader architecture also allows us to run different shaders (fragment/geometry/vertex) at the same time. Hence, I think in a sense they have similarities. it looks like NVIDIA has disclosed this feature, which became available way back in the unified shader architecture, to the programmers with the Fermi architecture. If this assumption is right, I was wondering which kept NVIDIA from disclosing this feature in the architectures which was released earlier than Fermi? What was the bottleneck, and how was it solved?”

Thanks again!

arastirmaci

Paulius, thank you very much for answering the second part of my question. Do you have any comments for the first part of my question:

“Fermi allows us to concurrently run multiple kernels. Unified shader architecture also allows us to run different shaders (fragment/geometry/vertex) at the same time. Hence, I think in a sense they have similarities. it looks like NVIDIA has disclosed this feature, which became available way back in the unified shader architecture, to the programmers with the Fermi architecture. If this assumption is right, I was wondering which kept NVIDIA from disclosing this feature in the architectures which was released earlier than Fermi? What was the bottleneck, and how was it solved?”

Thanks again!

arastirmaci

Paul,

I know you work for NVIDIA. But still… Can I have the liberty to ask this question?

Slide 52 states:

"

Scheduling:

–Kernels are executed in the order in which they were issued

–Threadblocksfor a given kernel are scheduled if all threadblocksfor preceding kernels have been scheduled and there still are SM resources available

"

There are 2 ways to interpret the 2nd bullet.

  1. “SM resources” can be taken as a single word → meaning → If there are free SMs, the threadblocks of the other kernel will be scheduled.

  2. “SM Resources” can be treated as “resources belonging to an SM”. In that sense, threadblocks of different kernels can concurrently co-exist within an SM during run-time.

Which of the above interpretation is correct?

Paul,

I know you work for NVIDIA. But still… Can I have the liberty to ask this question?

Slide 52 states:

"

Scheduling:

–Kernels are executed in the order in which they were issued

–Threadblocksfor a given kernel are scheduled if all threadblocksfor preceding kernels have been scheduled and there still are SM resources available

"

There are 2 ways to interpret the 2nd bullet.

  1. “SM resources” can be taken as a single word → meaning → If there are free SMs, the threadblocks of the other kernel will be scheduled.

  2. “SM Resources” can be treated as “resources belonging to an SM”. In that sense, threadblocks of different kernels can concurrently co-exist within an SM during run-time.

Which of the above interpretation is correct?

The second is correct. It was stated by Tim Murray a while ago (not so long after introduction of Fermi) that a single SM can process blocks from multiple kernels at the same time. (Cool! External Image)

The second is correct. It was stated by Tim Murray a while ago (not so long after introduction of Fermi) that a single SM can process blocks from multiple kernels at the same time. (Cool! External Image)

Sure - concurrent kernels were not possible prior to Fermi.