shared memory performance kernel execution timings with one block

hi,

i’ve written a very small test kernel to measure shared memory performance:

extern __shared__ float shared_int[];

__global__ void cuda_empty_kernel() {

        unsigned int id = threadIdx.x + blockDim.x * threadIdx.y;

       for(int i = 0; i < 500; i++)

                shared_int[id] = (float)id;

}

// invocation (5000 times):

{

        dim3 block(threads, 1);

        dim3 grid(1, 1);

        cuda_empty_kernel<<<grid, block, sizeof(float) * 512>>>();

}

i call this kernel 5000 times and record the execution time using the “time stamp counter” (rdtsc) of my host cpu (Intel® Core™2 CPU 6600 @ 2.40GHz / 2GB RAM).

i do this with different block-dimensions (thread counts) but always only one block – i want to test shared memory which is local per multiprocessor…

i do a warm-up call of the kernel once for each thread-count, which is not measured.

attached is a plot of the average execution time of one kernel invocation against the block-dimension.

i would like to ask for comments which could explain the observed behaviour.

things i belief to understand:

  • here, shared memory is accessed via 32bit float’s

  • here, each thread writes on a thread-specific location (shared_int is indexed by thread id)

  • with <= 16 threads i get NO shared memory bank conflicts because two consecutive 32bit values are on two different memory banks (16*1kB memory banks)

  • if i have a blocksize with more than 16 threads, i start to get memory conflicts, because two threads try to access the same memory bank at the same time

  • …nearly the same time, i dont know how the different thread-processor-divisions interact here:

— 8 stream processors per multiprocessor, each can handle one thread (assuming they execute the same code)

— one wrap consists of 32 threads

  • execution time only changes in 16 thread increments. (because they are grouped in 2 stream processor cycles or because of 16 memory banks???)

some questions i can formulate so far:

  • up until 16 threads we have only a half wrap, which could be handled in two stream processor cycles. does the cuda hardware utilize this fact? (or is one wrap (four cycles) the smallest execution unit?) is this the explanation of the 1.5us time increment from 16-17 threads? (or is that because of conflicts?)

  • how fast is shared memory? if i have 32 threads – one wrap – and they are writing to 32 consecutive 32bit values in shared memory indexed by thread-id, then they are executed in 4 stream processor cycles.

if the shared memory can be read within one stream processor cycle, no conflicts should occour.

if the shared memory can be read within two stream processor clock cycles, also no conflicts should occour (2*8=16 consecutive 32bit values fall onto the 16 different memory banks)

if the shared memory can be read within four stream processor clock cycles, then we have a conflict, because the threads of the first half of the wrap accesses the same memory banks as the second half.

i think this last assumption is right (one shared memory access per 4 stream processor clock cycles)?

  • at least from 32 threads on, the count of bank conflicts should increase. so i would estimate a nearly constant rising of the execution time here?! but i get the same execution times whether i have 32 threads (-- even 17 here) or 144threads?

  • and where does the nearly constant slow down starting from 177 up to 512 threads come from?

  • what happens between thread count 144 and 176? execution time with 177 threads is faster than with 176??

  • are there other theories about the execution time skip from 16-17?

  • other opinions about the constant time from 16-144 or the raising time from 145 up to 512?

i hope to see more interesting questions about this plot and twice as much good explanations!!

flo.
avg.png

A few comments, with varying relevance:

  • in each iteration of the loop you are casting an integer into a float. Since their binary representations are different, this is not a free operation and it definately contributes to your elapsed times. Why not just had a shared array of ints, since you’re assigning ints.

  • you assign the same exact value in each iteration of the loop. While I think it’s probably OK in current CUDA case, some C/C++ compilers would optimize out 499 iterations of your loop. I’ve run into cases where Intel’s C++ compiler would not even execute a loop that called a function if the functions results were not used - I had to articificically “consume” some of functions output to get meaningful timing results. You could insure yourself to a limited extent, for example, by assigning the value of i in each iteration.

  • the way I understand it, half-warp (16 threads) is the execution unit on current hardware (g80). If you use fewer than 16 threads, the computation still takes as long as 16 would have, since computation takes place in lock step. That explains why your results differed in 16-thread increments.

  • Programming Guide section 6.1.2.4 (version 0.8.2) explains bank conflicts. Basically, since half-warp is the execution unit, bank conflicts can only occur between threads in the same half-warp. So, you will not get bank extra bank conflicts by using more than 16 threads in your block.

  • as far as the increase in time as you increase the number of threads per block, that’s expected. With 512 threads you are writing (500 times) to 2KB of shared memory. With 16 - only to 64B. The apparent constant time at the beginning is either due to the timer resolution (executions take less than a timer can notice), or due to the overhead of starting/ending the kernel.

  • dont’ forget that when you time your kernel with CPU instructions, you also time thread launch, termination, etc. Since you’re interested in shared mem performance only, I’d suggest to also use CUDA profiler to check how many microseconds the kernel was executing, just to balance against your CPU timings.

  • finally, I think that shared memory performance should be pretty predictable. According to the Programming Guide, accessing shared mem takes as fast as accessing a register. So, you should be able to just multiply the number of accesses by access clock cycles. If there are bank-conflicts, you just need to multiply your result by how many ways a conflict occurs.

Hope this helps,
Paulius

thank you for your explanations!

the main reason because i use floats here is because they are 32 bit wide. for reference i attached an plot with uint32_t as datatype

extern __shared__ uint32_t shared_int[];

__global__ void cuda_empty_kernel() {

        uint32_t id = threadIdx.x + blockDim.x * threadIdx.y;

	for(int i = 0; i < 500; i++)

  shared_int[id] = id;

}

plot: avg_uint32.png [attachment=3319:attachment]

interestingly in this plot timings for 177 up to 192 look as expected – they are slower that threads from 161-176 (but not in the original float case???)

other thread counts seem to perform identical to the float-version.

i dont think that the timer resolution is a matter, because i can produce much shorter execution times if i decrement the inner loop from 500, to lets say 100. (first plot attached: avg.png [attachment=3317:attachment])

to get a feeling for the starting/ending overhead i’ve also attached a plot with an empty kernel. (avg_empty_kernel.png [attachment=3318:attachment])

code:

__global__ void cuda_empty_kernel() {

        unsigned int id = threadIdx.x + blockDim.x * threadIdx.y;

}

on this plot you can get a feeling for how “sharp” “my timer” is. kernel execution overhead is somewhere between 17 and 18us.

so i don’t think that the constant time at the beginning is due to the overhead or the timer resolution.

i am aware of this, but i assumed that the start-/stop-overhead would be constant regardless of how many threads i start.

and i wanted to get an “objective” measurement i can trust :) (not the nvidia delivered timing values)

furthermore, in an application using cuda hardware, i think start-/stop overhead of kernels is very important for design decisions.


I don’t think start time is a big deal if the kernel does moderate tasks(say, >10ms) and there’re not too many loops containing the kernel.

and I agree cuda (same in shaders) will definitely optimize out the codelines that won’t contribute to final results. In this sense, the compiler is very smart:)