These are dummy questions, yet I want to confirm.
In CUDA C, the threads are linearly organized in a way that threadIdx.x increase fastest, then threadIdx.y, and finally threadIdx.z.
Is this the same in CUDA Fortran?
Another question is using cudaMalloc(),the data is guaranteed to be aligned; is this the same with using allocate() ?
Then, using such runtime APIs in Fortran, the data is organized in column-based or row-based like in CUDA C ?