I don’t get why it needs to be flattened. If the 4x3 elements are contiguous in memory, the memory copy should be as simple as copying 12*sizeof(element) bytes.
Sequence of elements will depend on how you organized your data.
In TRT, if an input tensor is marked as NCHW FP32, in that case it will be option 2.