I think the fundamental flaw in your understanding is assuming a coupling between the data and the threads. Each grid consists of a number of blocks (which can have up to 3 dimensions). Each block in turn is composed of threads (which also can have up to 3 dimensions). You have to specify the grid and block dimensions when you invoke your kernel. Also as a word of advice, aim for multiples of 32 (you might not care why yet, but it’s because warps (the smallest atomic unit of threads to execute your code) are in groups of 32 threads). Inside your kernel, each thread will execute your function, and are independent of any data you’re passing in to your function (i.e. there’s no tight coupling between threads and data). You can figure out the thread identities based on its threadIdx and blockIdx values, and use this value in your function logic (if it needs it).
Also, there’s no mandate that you need 1 thread per array element. Depending on your application (and yours has very low arithmetic complexity) it might be more efficient to NOT have 1 thread/element.