Allocating threads in CUDA function

The motivation for my question is - I am translating C++ functions with singly and doubly nested for loops to CUDA.

How should I allocate the threads in x and y dimensions?

In the CUDA function, I assume I am correctly assigning them as -

int64_t threadx_dim = blockIdx.x * blockDim.x + threadIdx.x;
int64_t thready_dim = blockIdx.y * blockDim.y + threadIdx.y;

But I am not sure if I am allocating them correctly on the host function.

When I was translating C++ functions with just single for loops or multiple for loops where none of them were nested, I did -

if (std::max(arg1, arg2) > 1024) {
  blocks_per_grid = dim3(ceil(std::max(arg1, arg2) / 1024.0), 1, 1);
  threads_per_block = dim3(1024, 1, 1);
} else {
  blocks_per_grid = dim3(1, 1, 1);
  threads_per_block = dim3(std::max(arg1, arg2), 1, 1);
}

Where arg1 and arg2 are the limits of two non nested for loops.

When it comes to assigning threads for the Y dimension, I could represent the allocation like this -

So, it might look something like this -

if (arg1 > 1024 && arg2 > 1024) {
  blocks_per_grid = dim3(ceil(arg1 / 1024.0), ceil(arg2/1024.0), 1);
  threads_per_block = dim3(1024, 1024, 1);
} else if (arg1 > 1024) {
  blocks_per_grid = dim3(ceil(arg1 / 1024.0), 1, 1);
  threads_per_block = dim3(1024, arg2, 1);
} else if (arg2 > 1024) {
  blocks_per_grid = dim3(1, ceil(arg2/1024.0), 1);
  threads_per_block = dim3(arg1, 1024, 1);
} else {
  blocks_per_grid = dim3(1, 1, 1);
  threads_per_block = dim3(arg1, arg2, 1);
}

Where arg1 is the outer loop and arg2 is the inner loop.

Is there a cleaner way to do this?