Hello,
I think you are mixing threads per blocks and blocks of threads. In CUDA threads are ‘packed’ in blocks of size (x, y, z). When you execute a kernel you decide how many ‘blocks of threads’ you want to launch. So the total number of threads is dimBlock * dimGrid. In your case (1664) in the x dimension and (1664) in the y dimension → 1024x1024 threads but not running at the same time. They are scheduled and assigned to each processor when there are free resources to work. Anyway, from a global point of view you have launched 1,048,576 threads.
Now you have decided your kernel configuration you must work with the unique thread index within a block. As you wrote:
xid = ThreadIdx.x + blockDim.x * blockIdx.x // I get 0 a 1023 for threads in X axis
yid = ThreadIdx.y + blockDim.y * blockIdx.y // I get 0 a 1023 for threads in Y axis
You must keep in mind that each block has also an unique id. So, for the first block, with id (0,0) you will get the index:
xid = ThreadIdx.x + 16 * 0 // 0 to 15 in X axis
yid = ThreadIdx.y + 16 * 0 // 0 to 15 in Y axis
Your global unique id is:
gid = xid + 1024 * yid; // 0 to 1,048,576. Note that 1024 is the width of your array.
To sum up. You can get an unique global index in each block and each block has its own thread and block unique index you can use to generate the global index. Those index can be used to access the shared memory space.
Finally. Where’s thread ID 401,000?
xid = 8 + 16 * 38; // threadIdx.x + blockDim.x * blockIdx.x
yid = 7 + 16 * 24; // threadIdx.y + blockDim.y * blockIdx.y
gid = xid + 1024 * yid; // 401000
It’s in the block (38,24) using thread (8,7).
Hope this help.
Homework: how to get the block and threads index from 401,000 :)
Best regards!