Optix 6.5 - Multi-GPU

I want to translate my Optix Application that is currently running on version 6.5 into a Multi-GPU setup.
From the documentation I did infer that by default it does use all of the available GPUs. Which it does, but I am having a slower performance than a single GPU setup. With regards to that I have some questions, I have quite a number of buffers marked as RT_BUFFER_OUTPUT (roughly 10 or so). What happens to these buffers in a multi gpu setup ? Is there a copy of each of them in all the gpus and there is a sync step that happens after the computation is done ? Or all the buffers reside on the host and the data is computed and transferred via PCIe ? Does the same happen for RT_BUFFER_INPUT ?

Please have a look into the following threads about multi-GPU topics on OptiX 6 and earlier:
https://forums.developer.nvidia.com/t/cuda-optix-gpu-utilisation/58621
https://forums.developer.nvidia.com/t/multi-gpu/40472
https://forums.developer.nvidia.com/t/question-about-handling-buffers-when-using-multiple-gpus/54011
https://forums.developer.nvidia.com/t/very-poor-multi-gpu-scaling-on-dgx-1/67139
https://forums.developer.nvidia.com/t/createbufferfromglbo-function-crash-in-multi-gpu-environment/62060/4
Look for “pinned memory” and RT_BUFFER_GPU_LOCAL inside these explanations.

There are also topics inside the OptiX 6.5.0 programming guide touching multi-GPU:
https://raytracing-docs.nvidia.com/optix6/guide_6_5/index.html#cuda#interoperability-with-cuda
https://raytracing-docs.nvidia.com/optix6/guide_6_5/index.html#performance#performance-guidelines

That said, with OptiX 7 you would have explicit control about any multi-GPU behavior because OptiX 7 itself knows nothing about multiple devices. That part is completely handled by the CUDA host code you control!

The OptiX 7 applications linked here contain one example which shows different methods to distribute the rendering workload of one frame over multiple GPUs:
https://forums.developer.nvidia.com/t/optix-advanced-samples-on-github/48410/4