ROI Pooling with CUDA 9.2

Hi all,

What’s the most efficient way of performing a ROI pooling operation with CUDA 9.2? Is it more efficient to perform a pooling op for each region in a single thread, or split the pooling op into an op which only pools a single channel of a region in a single thread?

Presumably a single thread per region would be fairly expensive and result in a loop unroll for the entire (regionWidth, regionHeight, channel) loop.

Thanks!