.raw file dimension and thread and block relation. How many threads for 512 X 512 X 512 voxels?

I have created 64 x 64 x 64 volume data as 8 bit raw file and then I have used marching cube within Cuda.
It works great but slower than 32 x 32 x 32 version.
I guess because 64 X 64 X 64 has double amount of Voxels.

Now for example if I have a .raw data dimension is 512 x 512 x 284
then how I should chose threads number and block numbers.

For example for 32 version I used 128 threads and for 64 version I used 256.

I am still not clear how I should decide my threads number .

for 512 X 512 X 284, do I need to use voxel of dimention 512 X 512 X 284 ?
Please some one explain me a bit about lines below.
How I should decide threads number, is it related to voxel dimension?

int threads = 256; //128;
dim3 grid(numVoxels / threads, 1, 1);

// get around maximum grid size of 524287 in each dimension
if (grid.x >524287)              //65535)
	grid.y = grid.x / 262144;
	grid.x = 262144;

I think you’ll find that when you double the size of each dimension, the number of voxels grows cubically, that is, it octuples.

Are you talking about threads ? I was asking about number of threads.
for 32X32X32 it was 128

but for 512X512X512 we cannot have 2048 number of threads. Can we?

[1] When going from a volume of 32x32x32 to a volume of 64x64x64 the number of voxels does not double (2x), it octuples (8x).

[2] The number of threads per thread block is limited to 1024 (see CUDA documentation, specifically table 14 in appendix H of the CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities).

Thanks a lot !
for the first explanation, it cleared a big confusion.
Second file is very helpful too ! It gives all information.
I knew the second one, I mean 1024 thread limitation, so I am wondering when the grid size will be 512X512X512 then what will be thread number ? or if I use less threads will it work? may be bit slow, but if it works, let me know.
Any advice will be a big help.

Many thanks again, NJ !:)

I am not sure what you are asking. You get to decide how many voxels each thread will handle in your code. In fact, finding a suitable mapping of threads to data is one of the primary design decisions when writing CUDA code.

Often it is best to design CUDA code so there is a favorable (e.g. regularized) mapping mapping of threads to destination data, then have each thread collect whatever source data it needs to product that destination data.

If you are modifying someone else’s code, you need to understand the mapping of threads to data used by that code, then determine whether that can be scaled to the size of data set you want to handle. That may not always be possible (limited scalability due to resource limitations), and you may need to change how threads are mapped.

Many thanks NJ ! I need lots of digging. But above clue you gave is a big help.
Because I am a beginner. Looks like I have to change how the threads are mapped. But before that I need to learn many things prior. Conceptually I have to understand cuda very clearly and only than I can remap.
Will be back after few days brain storming and digging code.