Kernel launch failure plus Warp execution performance

,

Section 1:

Suppose I only launch a single warp and all threads of the launched warp need to update the same 4 byte address in shared memory.

_shared__ int a[5];
a[2] = 5;

suppose all threads of a warp need to update a[2] = 6.

Assumption : (1)warp consist of 32 threads, (2) a few lines later in all scenarios I have a syncthreads(), (3) all data is single precision

(scenario a)

__global__ void test_kernel()
{
  __shared__ int a[5];
  a[2] = 6; 
}

from a warp’s context (scenario a) should execute in the same clock cycle, since all threads of the warp write to the same shared memory location.

(scenario b)

__global__ void test_kernel()
{
  __shared__ int a[5];
  if(threadIdx.x == 1) a[2] = 6; 
}

(scenario b) should execute in single clock cycle.

(scenario c)

__global__ void test_kernel()
{
  __shared__ int a[5];
  if(threadIdx.x == 31) a[2] = 6; 
}

(scenario c) should execute in the same single clock cycle.

Question 1 : In the above scenarios - (a), (b) and (c) , each seem to finish in a single clock cycle. So from performance point of view are (a), (b) and (c) equivalent or is one better than the other. If any one of (a), (b), (c) is better than the other, could you please let me know why ?

Question 2: I have often read memory gets requested in size of 32 bytes , 64 bytes or 128 bytes. But a warp has 32 threads and in context to single precision, 128 bytes (32 * 4 byes = 128 bytes) of cache line or memory should get requested. So, in what scenario does 32 bytes or 64 bytes of memory get requested?

Section 2:

For A100 GPU, a block is allowed 1024 threads and 48 KB of static shared memory. Right now my block size is 1024 threads and I am using 32KB of shared memory . The kernel I wrote did not get executed. Upon canonical error checking using cudaPeekAtLastError, I see the error message 'too many resources requested for launch test_tslu_block.cu 75 '.

test_tslu_block.cu is my code that lunches a kernel call at line 74 and at line 75 I have cudaPeekAtLastError.

//line 74 - launch_kernel_xyz<<<>>>()
//line 75 - cudaPeekAtLastError()

If i change my block size from 1024 threads to 512 threads, the kernel gets executed successfully. And I am only using 32 kb of shared memory less than 48 kb allowed for a block. I am not sure how to make my kernel run with a block size of 1024 threads, since I do not know what excess resources my kernel is requesting when block size is 1024.

Question : how do I investigate, what excess resources my kernel is requesting when my block size is 1024 threads?

They should all be the same performance.

The total bytes requested is determined by the number of threads in the warp participating, and the size of the request from each thread. If each thread is requesting a byte, the there are 32 bytes requested. Similarly, if I have a statement like yours:

if (threadIdx.x < 2) ...

then only the threads with threadIdx.x of 0 and 1 will be considered for the number of bytes requested.

a fairly common question, see here. There are many other questions on forums like it. registers are not the only resource that can give rise to this, but perhaps the most common reason for the report. Shared memory is also a resource, but it will usually receive a different error code. Local/stack memory is also another possible resource, that may also in some cases receive a different error code.

1 Like

To exclude shared memory to be the reason, increase the maximum size from 48 KB with cudaFuncSetAttribute: CUDA C++ Programming Guide

To exclude registers to be the reason use maxrregcount, __maxnreg__ or __launch_bounds__ (CUDA C++ Programming Guide).

Also make sure, it was the kernel launch in line 74 and not a previous kernel launch error.

1 Like

@Robert_Crovella

if (threadIdx.x < 2)

suppose each thread request 4 bytes. and only the two threads of a warp request data from global memory, adjacent data of size 4 bytes each. then the 32 bytes memory gets requested? … and suppose only 9 threads of a warp request adjacent data from global memory of size 4 bytes each, then the memory request is of 64 bytes ? is my understanding correct?

also when only two threads of a warp request data from shared memory, adjacent data of size 4 bytes each. then the read memory gets only the requested is only 8 bytes and not 32 bytes, right?

thank you. it were the registers

global memory is a logical space. It is not enough information to fully answer the question. If the request to global memory hits in the L1 cache, then only the requested data is returned - in this case, 8 bytes. If the request misses in the L1 but hits in the L2, then AFAIK 32 bytes will be transferred from L2 to L1 (a so called “sector”) and then the requested bytes will be delivered. Likewise if it misses also in the L2, then a “segment” from DRAM will be transferred to the L2. For all current DRAM designs I am familiar with, a segment is 32 bytes.

A similar thought process governs your next question. In short, the data size “granularity” of both DRAM and the L1 and L2 cache (in modern GPUs, Pascal or newer) is 32 bytes.

For the purposes of this discussion, the behavior of shared is something like the behavior of L1: only the requested bytes are delivered.

1 Like

Hi Robert,
can you confirm: For L1 as stated by you the granularity for reading/storing is 32 bytes (a warp can access four 32-byte sectors in modern architectures), but to fully use the size of the L1 cache, the L1 cache line size of 128 continuous bytes is to be considered. If only 32 bytes of each 128 byte cache line is used, 75% of L1 stay unused?

That is generally my understanding.

It seems like it should not be difficult to construct a directed test, coupled with nsight compute, to confirm this.

  • determine L1 cache size
  • cyclically access 32-byte segments in memory, which are scattered, i.e. not adjacent
  • do this for 1/4 of the L1 cache size
  • observe L1 hit rate approaching 100%
  • do this for 1/2 of the L1 cache size
  • observe L1 hit rate drop to below 50% or close to zero, depending on pattern and footprint

You need to consider this per-SM, so that may add some coding complexity. I haven’t thought through all the details.

It is inherent in the definition of a sectored cache that the tag applies to the entire cache line. The point of using a sectored cache is to save on tag storage while still allowing relatively fine granularity of transfers on a miss, so each sector has its own status bit(s), e.g. valid or dirty. Historical examples might store the tag bits on-chip but use off-chip SRAM for cached data storage. In general, sectored caches can be beneficial in a multi-level cache hierarchy in which the first-level cache is “small”.

A 100 tensorcore whitepaper

@Robert_Crovella in the white paper on page 22, figure 7 , it shows each SM has 4 warp schedulers and each scheduler has access to 16 functional units. That means a warp will take 2 clock cycles to finish.

Question : then should not scenario(b) (where only threadIdx.x == 0 executes in the warp) provide better performance than scenario(c) (where only threadIdx.x == 31 executes) since scenario(b) will be executed a clock cycle earlier than scenario(c)?

Also, in the figure 7 on page 22, it mentions in the figure , the Warp Scheduler (32 thread/clk), Warp Disptcher (32 thread/clk).

Question : I am not sure what do they mean by ‘(32 thread/clk)’? A single warp scheduler and warp dispatcher, have only 16 functional units available per clock cycle , so should not it be (16 thread/clk)?

Question : So a warp for A100 tensorcore GPU gets executed in two cycles. but can there be a scenario where the first 16 threads from warp_0 get scheduled and executed by warp_scheduler_1 in cycle_1 and in the next cycle(cycle_1 + 1) 16 threads from a different warp (warp_1), get scheduled and executed by the same warp scheduler_1?

Question : Also I was trying to understand the roles of a warp dispatcher. I know warp scheduler is a piece of hardware which picks un-stalled warps. So does warp dispatcher picks warps from warp scheduler queue and dispatches them to functional units?

You’re getting into a level of questions that are probably at the edge of what I can answer. Furthermore, in my view the understanding of these questions/answers is perhaps of academic interest, but has essentially no impact on CUDA programmer behavior or anything measureable about program behavior.

The GPU is largely a throughput machine. Your question is focused (I guess) on latency, and a very micro-focus on latency, at that. How would you measure this? Whether there is or isn’t a better case here, I doubt it would be measurable. To give the most direct answer: I don’t know if one scenario would give better performance. If it did, I have no idea how to measure it. The performance benefit would be miniscule, the way I see it. The usual measurement of performance, when the kernel finishes, would require the entire warp to complete its work. To see the latency effect, you would normally have to string together a dependent sequence. But even a dependent sequence would not obviate the need to issue every half warp, in every other clock cycle. I don’t see how this could yield any benefit.

32 threads per clock is like saying one warp per clock. You’ve concluded (I guess) that there is only one type of functional unit in the SM. What if there were multiple types of functional units (there are) that come in varying granularities (they do). For example, what if the LSU could accept 32 instructions (i.e. an entire warp’s worth)? For the sake of this discussion, I would simply say that the warp scheduler can schedule “up to” 32 threads per clock, and the architectural whitepaper does not give you or me enough information to conclude that that could never be possible.

I don’t think so, but I don’t know that there is published technical documentation to cover this. I think a sensible mental model is that if one half a warp for a particular instruction gets scheduled in clock cycle X, the second half of the warp will always get scheduled in clock cycle X+1 (for that same instruction). Can I point to documentation that states/guarantees that? I cannot.

I don’t know that this is spelled out anywhere. Academically interesting? Perhaps. Useful to know as a CUDA programmer? Not obvious to me.

My role here is generally not to provide previously unreleased information (unless it is discoverable by a competent programmer) nor is it my role to go down to any arbitrary level of depth, such as a register transfer level description of the GPU, to answer such questions. Sorry. I know that my “answers” are of such questionable value that a couple times I asked myself whether I should be writing any kind of response, at all.

Maybe Greg will come by at some point and do a better job.

I think Greg shows this in “Example 2”, here. It’s hard to see an advantage in splitting two warps across two cycles.

1 Like

@Robert_Crovella you have been a great help. i was just working on the code and then trying to figure out the bottlenecks and simultaneously looking at the white paper . so these questions came to my mind out of curiosity which I was not sure who to ask . thank you @Robert_Crovella, I have learned a lot from you. thank you

There are some instructions, which profit, when not the whole warp of 32 threads execute them with different data.

  • memory instructions: The chance for uncoalesced reads and bank conflicts is reduced
  • uniform datapath: If the whole warp uses the same data (guaranteed at compile-time) for integer calculations, predicates, and similar, then the uniform datapath can be used. It frees the threads from register pressure for common tasks like (some) loops and offset calculations

As far as I understand (and Robert hinted at here and see also questions about sp and sm - #6 by Robert_Crovella) deactivating some threads within a warp will not increase computational bandwidth for the remaining threads. It always costs the time as if the whole warp participates. Otherwise people would happily use diverging warps.

This restriction simplifies the scheduling, but also the data paths within the GPU. E.g. if a SM Partition has 16 INT32 units, you know that only the registers of 2 threads of the warp can be used as input, the registers of the other 30 threads will never be used as input.

As we are talking about pipelines, the 16/clk or 16 INT32 cores just specifies the bandwidth generally without giving a hint, whether the hardware has 16 fast units or 32 slow units.

1 Like