Per thread local memory Per thread local memory specified in C Programming Guide

I have a question:
The CUDA C Programming Guide specifies for CC 2.x devices “per thread” local memory as max 512 KB.
Does it mean that this is the max amount of memory that can be allocated per thread ? In other words is there a max 512 KB memory available to allocate automatic variables (including both scalar and array) created on the stack per thread ?

Correct. Given the large number of concurrent threads (up to 1536 per SM, times the number of SMs on the device which can be up to 16) this means that it is possible to allocate a significant portion of the total GPU memory as thread-local memory. Note that by default the driver allocates a much smaller amount of per-thread local memory, and adjusts the bound upward as needed. Programmers can also set the amount explicilty via cudaDeviceSetLimit (cudaLimitStackSize, ).