Please do check out the RAPIDS and BlazingSQL (built on RAPIDS) for out of core processing (due to your data sizes). This can be helpful in a few ways beyond just speeding up your ETL as you scale out. It would be good to clarify what your reduce step entails. From what I have read so far, it seems like you are performing agg. In that case, our groupby aggs and join operations are extremely performant, especially at scale.
ETL- I think this is a similar usecase (video from Walmart). You may get exactly what you seek. It may take a little rework on your side, as RAPIDS is in Python, but the results may be worth it. We have both JSON readers as well as streaming capability with custreams.
Interactive visual EDA and specialist tools- For insights, RAPIDS allows some interesting “in process, interactive visualizations”. Check out this notebook I made analyzing taxi data, this video on how we used it to build out a specialist tool for genomics and this webapp dashboard instructional with census data for COVID.