Memory requests for 1 warp vs 32 blocks of 1 thread. (1 subcore vs 32 subcore) Grouping of memory re

I can imagine two situations (I am not sure if it can be forced to actually happen in practice but let’s suppose it can):

  1. 1 subcore execute 32 threads in parallel (a warp). The memory access is perfectly aligned and sequential within the warp.

The second situation is:

  1. 32 subcore execute 1 thread in parallel. In principle the memory access is still perfectly aligned and sequential. (It’s the same code/kernel as situation 1)

According to the guide memory requests can be “grouped” into one big memory request. (For some compute versions cache also comes into play, but let’s suppose it’s not in the cache).

Situation 1: The warps memory accesses are probably grouped into 1, 2 or 3 big memory requests. (Depending on the element type/size).

But what would happen for situation 2: Would the individual memory accesses/requests of the subcores also be grouped into one big memory request ?

Or would this become 32 individual memory requests ?

I can imagine two situations (I am not sure if it can be forced to actually happen in practice but let’s suppose it can):

  1. 1 subcore execute 32 threads in parallel (a warp). The memory access is perfectly aligned and sequential within the warp.

The second situation is:

  1. 32 subcore execute 1 thread in parallel. In principle the memory access is still perfectly aligned and sequential. (It’s the same code/kernel as situation 1)

According to the guide memory requests can be “grouped” into one big memory request. (For some compute versions cache also comes into play, but let’s suppose it’s not in the cache).

Situation 1: The warps memory accesses are probably grouped into 1, 2 or 3 big memory requests. (Depending on the element type/size).

But what would happen for situation 2: Would the individual memory accesses/requests of the subcores also be grouped into one big memory request ?

Or would this become 32 individual memory requests ?

CUDA terminology is confusing enough as it is, without coming up with new terms. :)

What do you mean by “subcore”? Are you referring to a “CUDA core” (as they now call it), also known as a “streaming processor” or “SP” in the old docs?

CUDA terminology is confusing enough as it is, without coming up with new terms. :)

What do you mean by “subcore”? Are you referring to a “CUDA core” (as they now call it), also known as a “streaming processor” or “SP” in the old docs?

As far as I know the gpu has “multi-processors” and each multi processor is divided further into “subcores”.

So with “multi-processor” I mean “streaming (multi) processor”, which you seem to refer to as sm or sp or smp.

So with subcore I mean “cuda core”, which you have no reference off ;)

As far as I know the gpu has “multi-processors” and each multi processor is divided further into “subcores”.

So with “multi-processor” I mean “streaming (multi) processor”, which you seem to refer to as sm or sp or smp.

So with subcore I mean “cuda core”, which you have no reference off ;)

OK, so the terminology used in the CUDA C Programming Guide is “multiprocessor” and “CUDA core”.

Given that, I can answer (or at least clarify) your question:

Situation 1 and situation 2 can’t happen, because that’s not how the hardware works.

In compute capability 1.x, a multiprocessor contained 8 CUDA cores, and all 8 CUDA cores processe the same warp. A warp instruction finished every 4 clock cycles since the warp size is 32 and each CUDA core can complete one simple instruction per thread per clock. In compute capability 2.x, the multiprocessor has 32 or 48 CUDA cores that are grouped so that 16 CUDA cores process a single warp. (Double precision is an exception, in which case 32 cores work together on one warp.) Compute capability 2.0 devices complete 2 warp instructions (different warps) every 2 clock cycles per multiprocessor. Compute capability 2.1 devices complete up to 3 warp instructions (from 2 warps) every 2 clock cycles.

Keep in mind that CUDA cores are not autonomous. The instruction scheduler exists at the multiprocessor level, not at the CUDA core level. Memory coalescing happens at the warp instruction level when threads in the same warp, executing the same memory load/save instruction access memory in a consecutive pattern. There is no fusion of reads from different warps, or different instructions within the same warp.

OK, so the terminology used in the CUDA C Programming Guide is “multiprocessor” and “CUDA core”.

Given that, I can answer (or at least clarify) your question:

Situation 1 and situation 2 can’t happen, because that’s not how the hardware works.

In compute capability 1.x, a multiprocessor contained 8 CUDA cores, and all 8 CUDA cores processe the same warp. A warp instruction finished every 4 clock cycles since the warp size is 32 and each CUDA core can complete one simple instruction per thread per clock. In compute capability 2.x, the multiprocessor has 32 or 48 CUDA cores that are grouped so that 16 CUDA cores process a single warp. (Double precision is an exception, in which case 32 cores work together on one warp.) Compute capability 2.0 devices complete 2 warp instructions (different warps) every 2 clock cycles per multiprocessor. Compute capability 2.1 devices complete up to 3 warp instructions (from 2 warps) every 2 clock cycles.

Keep in mind that CUDA cores are not autonomous. The instruction scheduler exists at the multiprocessor level, not at the CUDA core level. Memory coalescing happens at the warp instruction level when threads in the same warp, executing the same memory load/save instruction access memory in a consecutive pattern. There is no fusion of reads from different warps, or different instructions within the same warp.

Ok, now I understand.

So each cuda core only executes 1 thread.

So the GT 520 has 48 cuda cores so it’s like a 48-core processors.

So it can execute at most 48 threads in parallel.

The “warp” is a “grouping” technology, which try to take as much threads as possible and “group” them into warps.

So this probably means the 3D thread block and 3D grid block is turned into a 2D or perhaps even 1D situation, where it’s simply numbered from 0 to N or 0 to N and 0 to M and re-grouped by the warp schedular which distributes it across cuda cores, and tries to let the cuda cores work together in a warp, which is probably some memory access/fusion technology which tries to make the memory access efficient, perhaps also execution efficiency to share instruction cache among multiple cuda cores.

So while the warp size is 32, this is a bit strange since there are only 48 cuda cores, so this would mean one warp of 32 and one warp of 16, that’s probably why a “half warp” is mentioned.

Why is there a limit of 1024 per block ? What does that limit imply ? Is that the total size of a thread block which can be programmed ? Or is it something else…

Perhaps 1024 threads per block means it can allocate 1024 threads per block, but why wouldn’t it be able to swap in more threads from memory ?!?

Surely it can store “thread information” in global memory… ?!? So this is still a bit strange…

Perhaps 1024 means: “ammount of thread information” that can be stored inside the gpu…

I am still confused about what this limitation actually means.

Ok, now I understand.

So each cuda core only executes 1 thread.

So the GT 520 has 48 cuda cores so it’s like a 48-core processors.

So it can execute at most 48 threads in parallel.

The “warp” is a “grouping” technology, which try to take as much threads as possible and “group” them into warps.

So this probably means the 3D thread block and 3D grid block is turned into a 2D or perhaps even 1D situation, where it’s simply numbered from 0 to N or 0 to N and 0 to M and re-grouped by the warp schedular which distributes it across cuda cores, and tries to let the cuda cores work together in a warp, which is probably some memory access/fusion technology which tries to make the memory access efficient, perhaps also execution efficiency to share instruction cache among multiple cuda cores.

So while the warp size is 32, this is a bit strange since there are only 48 cuda cores, so this would mean one warp of 32 and one warp of 16, that’s probably why a “half warp” is mentioned.

Why is there a limit of 1024 per block ? What does that limit imply ? Is that the total size of a thread block which can be programmed ? Or is it something else…

Perhaps 1024 threads per block means it can allocate 1024 threads per block, but why wouldn’t it be able to swap in more threads from memory ?!?

Surely it can store “thread information” in global memory… ?!? So this is still a bit strange…

Perhaps 1024 means: “ammount of thread information” that can be stored inside the gpu…

I am still confused about what this limitation actually means.