Section 1:
Suppose I only launch a single warp and all threads of the launched warp need to update the same 4 byte address in shared memory.
_shared__ int a[5];
a[2] = 5;
suppose all threads of a warp need to update a[2] = 6.
Assumption : (1)warp consist of 32 threads, (2) a few lines later in all scenarios I have a syncthreads(), (3) all data is single precision
(scenario a)
__global__ void test_kernel()
{
__shared__ int a[5];
a[2] = 6;
}
from a warp’s context (scenario a) should execute in the same clock cycle, since all threads of the warp write to the same shared memory location.
(scenario b)
__global__ void test_kernel()
{
__shared__ int a[5];
if(threadIdx.x == 1) a[2] = 6;
}
(scenario b) should execute in single clock cycle.
(scenario c)
__global__ void test_kernel()
{
__shared__ int a[5];
if(threadIdx.x == 31) a[2] = 6;
}
(scenario c) should execute in the same single clock cycle.
Question 1 : In the above scenarios - (a), (b) and (c) , each seem to finish in a single clock cycle. So from performance point of view are (a), (b) and (c) equivalent or is one better than the other. If any one of (a), (b), (c) is better than the other, could you please let me know why ?
Question 2: I have often read memory gets requested in size of 32 bytes , 64 bytes or 128 bytes. But a warp has 32 threads and in context to single precision, 128 bytes (32 * 4 byes = 128 bytes) of cache line or memory should get requested. So, in what scenario does 32 bytes or 64 bytes of memory get requested?
Section 2:
For A100 GPU, a block is allowed 1024 threads and 48 KB of static shared memory. Right now my block size is 1024 threads and I am using 32KB of shared memory . The kernel I wrote did not get executed. Upon canonical error checking using cudaPeekAtLastError, I see the error message 'too many resources requested for launch test_tslu_block.cu 75 '.
test_tslu_block.cu is my code that lunches a kernel call at line 74 and at line 75 I have cudaPeekAtLastError.
//line 74 - launch_kernel_xyz<<<>>>()
//line 75 - cudaPeekAtLastError()
If i change my block size from 1024 threads to 512 threads, the kernel gets executed successfully. And I am only using 32 kb of shared memory less than 48 kb allowed for a block. I am not sure how to make my kernel run with a block size of 1024 threads, since I do not know what excess resources my kernel is requesting when block size is 1024.
Question : how do I investigate, what excess resources my kernel is requesting when my block size is 1024 threads?