Suggestion from experts: what kind of architecture you suggest?

DISCLAIMER: I know this question could be a bit off topic, not being directly related to a NVIDIA product. Nevertheless, I hope someone among you, being an expert in the field of AI, could give me a suggestion about how to tackle this problem. I apologise for this, but I’d really love some advice from an expert!
Now to the real question.

I am facing an issue in properly understanding which kind of ML architecture to adopt in an academic example.

The physical problem involves fluid dynamics, in a “system of pipes”. There exist ad-hoc numerical methods for solving these problems, which I use to get the (output) data with which later on training the network. The “problem” stands in the (dimensional) non-conformity between input and output. The input is made up of some data related to the “system discretization & characteristics”, which is basically the input used by these models to run. It can be put in either 1d or 2d-tensor format, where in the latter input data is simply grouped for each “segment”.

The output, instead, changes a bit. While sharing the same base system discretization, it “augments” it, in the sense that each base-segment gets divided into a different number of sub-segments. Additionally, it depends on a time discretization (-> dynamics). So, to give a concrete example to further clarify, say the input might be (5, 17) (5 base information for each of the 17 base-segments), while the output (355, 100) (where 355 are the total number of sub-segments, 100 the n. of time steps). Now, I have been using, for “inheritance issues”, both 1d and 2d Dense-Encoder-Decoder CNNs. From the literature I have been able to retrieve, I did get that those (specially 2d) are highly suited when dealing with (2d) images, or with problems which can be traslated to a “state-image” counterpart, where each pixel has a spatial correlation with neighbouring ones. In this case, I really cannot see how this can happen. There is some sort of “spatial coherence” between each of such base-segments, but clearly not the same as an image pixels’.

Now the question: hoping I was able to at least give the idea of the problem, what would you suggest in terms of architecture type when dealing with these kind of problems?

Thanks to everyone.
Cheers :)