Can I perform parallel reduction accumulation on the data stored in OptiX's CUDAOutputBuffer?

If I create a CUDAOutputBuffer like below:
sutil::CUDAOutputBuffer<float3> result(sutil::CUDAOutputBufferType::CUDA_DEVICE, rayTubeNumPerLine, rayTubeNumPerLine);
Is there a way for me to accumulate float3 directly in the GPU using CUDAOutputBufferType?

Is your question how to implement the reduction algorithm or just how to access the CUDA device data?

That sutil::CUDAOutputBuffer class is just some wrapper over the underlying buffer data.

If that buffer is originally allocated on the CUDA device, there is a 64 bit CUDA device pointer returned by its map() function which you can use inside other CUDA kernels inside the same process.

Generally there is no requirement to use any of the sutil code inside own applications. That is only there to make writing OptiX SDK examples easier. Means you could also allocate and use your CUDA device memory directly with cudaMalloc() or cuMemAlloc() instead.