Histogram, 4 million points, 140 million bins, 100 FPS, fastest implementation?

I’m relatively new to CUDA programming, and I’m trying to theoretically determine the right path to go on, but I keep getting stuck.

I have 4096 data generators (pixels) that independently generate about 100,000 data points per second. These data points are grouped into chunks of 1000. This data needs to be histogrammed and processed (peak search x4) in real time at 100 frames per second.

Each data point for each pixel represents a histogram bin index from 0 to 8191. Each data point is also qualified as a “valid” or “invalid” datum. Valid datums need to be binned separately from invalid datums, but both pieces of information are required for later processing. As a result, each pixel is assigned two separate histograms - one for valid datums, and one for invalid datums.

Everything I read online says to maximize throughput by using shared memory, but shared memory is so limited that I can practically only assign one pixel to its own SM, and I would need to perform ~512 batches of histograms (maybe only 32 or 48 batches on the newest cards). This seems wasteful.

The other option is to use global memory (high latency), and treat the problem as one super-dataset where any thread can grab any of the 4 million datums from any pixel and just atomic increment the bin at the proper memory location since the memory index will always be equal to uint16_t* addr = (pixelIdx * 8192 + binIdx); – I imagine in this way, reads from raw data can be coalesced, but since writes to the histogram bins are random access, the warps will have major memory access problems.

From your experience as CUDA programmers, is there any type of implementation that jumps out as being better suited to this problem? If so, how did you arrive at this conclusion? Is it possible to theoretically determine the best path, or is it something that just needs to be benchmarked with multiple implementations?

P.S. Currently, I have tried the ArrayFire multidimensional histogramming function, but the maximum number of bins is only 4000, I cannot use logic to decide whether a datum is valid or invalid without doing two separate histogram passes, and it is too slow to run in realtime for my application (with all of the additional processing overhead).