hi,
i’ve written a very small test kernel to measure shared memory performance:
extern __shared__ float shared_int[];
__global__ void cuda_empty_kernel() {
unsigned int id = threadIdx.x + blockDim.x * threadIdx.y;
for(int i = 0; i < 500; i++)
shared_int[id] = (float)id;
}
// invocation (5000 times):
{
dim3 block(threads, 1);
dim3 grid(1, 1);
cuda_empty_kernel<<<grid, block, sizeof(float) * 512>>>();
}
i call this kernel 5000 times and record the execution time using the “time stamp counter” (rdtsc) of my host cpu (Intel® Core™2 CPU 6600 @ 2.40GHz / 2GB RAM).
i do this with different block-dimensions (thread counts) but always only one block – i want to test shared memory which is local per multiprocessor…
i do a warm-up call of the kernel once for each thread-count, which is not measured.
attached is a plot of the average execution time of one kernel invocation against the block-dimension.
i would like to ask for comments which could explain the observed behaviour.
things i belief to understand:
-
here, shared memory is accessed via 32bit float’s
-
here, each thread writes on a thread-specific location (shared_int is indexed by thread id)
-
with <= 16 threads i get NO shared memory bank conflicts because two consecutive 32bit values are on two different memory banks (16*1kB memory banks)
-
if i have a blocksize with more than 16 threads, i start to get memory conflicts, because two threads try to access the same memory bank at the same time
-
…nearly the same time, i dont know how the different thread-processor-divisions interact here:
— 8 stream processors per multiprocessor, each can handle one thread (assuming they execute the same code)
— one wrap consists of 32 threads
- execution time only changes in 16 thread increments. (because they are grouped in 2 stream processor cycles or because of 16 memory banks???)
some questions i can formulate so far:
-
up until 16 threads we have only a half wrap, which could be handled in two stream processor cycles. does the cuda hardware utilize this fact? (or is one wrap (four cycles) the smallest execution unit?) is this the explanation of the 1.5us time increment from 16-17 threads? (or is that because of conflicts?)
-
how fast is shared memory? if i have 32 threads – one wrap – and they are writing to 32 consecutive 32bit values in shared memory indexed by thread-id, then they are executed in 4 stream processor cycles.
if the shared memory can be read within one stream processor cycle, no conflicts should occour.
if the shared memory can be read within two stream processor clock cycles, also no conflicts should occour (2*8=16 consecutive 32bit values fall onto the 16 different memory banks)
if the shared memory can be read within four stream processor clock cycles, then we have a conflict, because the threads of the first half of the wrap accesses the same memory banks as the second half.
i think this last assumption is right (one shared memory access per 4 stream processor clock cycles)?
-
at least from 32 threads on, the count of bank conflicts should increase. so i would estimate a nearly constant rising of the execution time here?! but i get the same execution times whether i have 32 threads (-- even 17 here) or 144threads?
-
and where does the nearly constant slow down starting from 177 up to 512 threads come from?
-
what happens between thread count 144 and 176? execution time with 177 threads is faster than with 176??
-
are there other theories about the execution time skip from 16-17?
-
other opinions about the constant time from 16-144 or the raising time from 145 up to 512?
i hope to see more interesting questions about this plot and twice as much good explanations!!
flo.