Doubt in convolution separable pdf

Please read the page 8 first para (above Figure 6) in the convolution separable document. they are using 16x48 threads in a thread block. My question is how is that possible. we can load maximum 512 threads in a thread block but they are using 16x48=768 threads. or i havent understood it correctly. Please tell me what is the size of thread block in this case.

In SDK the thread block size they used for column is
#define COLUMNS_BLOCKDIM_X 16
#define COLUMNS_BLOCKDIM_Y 8

And for row they used
#define ROWS_BLOCKDIM_X 16
#define ROWS_BLOCKDIM_Y 4

i am not able to understand 16x48 thread block given in pdf

Thanks in advance

To seperate a PDF document, a PDF splitting control may be used.

The maximum number of threads per thread block depends on the hardware. Fermi and Kepler GPUs can run 1024 threads per thread block, see the CUDA programming guide Table 12