Multi-sampling, deferred shading and OpenGL/Direct-X/CUDA interop

Firstly, I apologize that this is a largely graphics (rather than CUDA) related query. I’d like to use as much CUDA as possible but my feeling is that I need to use a graphics API to make the best use of the rasterization hardware. I haven’t completely ruled out the possibility of creating an entirely CUDA-based rasterizer or ray-tracer though.

I’m looking at creating a X-ray radiograph simulator along the lines of gvirtualxray (http://gvirtualxray.sourceforge.net/gvirtualxray.php).

They use a triangle mesh representation for their object and rely on hardware rasterization. They disable depth testing and back face culling to ensure that all triangles are rendered. They use a custom pixel shader to calculate a signed distance from the X-ray source to the triangle (where the sign indicates whether the intersection represents entering or leaving the object). They use render target blending to calculate the sum of these signed distances. The sum then represents the path length through the object. The intensity of a pixel in the simulated radiograph (which represents the amount of X-ray transmission) is proportional to exp(-attenuation_coefficient * path_length).

The main modification I would like to make is to calculate these path lengths for multiple ray paths within each pixel and then to combine them in a second pass. This has to be done in a second pass is because I need to combine intensities rather than path lengths. Ideally I’d like to use CUDA for the second pass so I can make use of shared memory, thread synchronization, atomic operations, etc.

This is very straightforward if I limit myself to regular grid sampling but I’m interested in experimenting with other sampling patterns.

Would it be possible to perform the first pass using the “sample” interpolation modifier (https://msdn.microsoft.com/en-us/library/windows/desktop/bb509668(v=vs.85).aspx) and a multi-sample render target? Can I still use render target blending on a multi-sample render target?

What about the second pass - can I read the multi-sample render target from a CUDA kernel?

How do I control the number of multi-samples and the pattern (e.g. regular grid, rotated grid, etc.)?

Apparently Maxwell has programmable sampling patterns (http://www.geforce.com/whats-new/articles/multi-frame-sampled-anti-aliasing-delivers-better-performance-and-superior-image-quality) - where do I find documentation on how to program it?

Finally, I read that recent NVIDIA GPUs use a tile-based rendering approach. Would there be any way to leverage that to avoid the huge memory (and memory bandwidth) requirements of a high level of multi-sampling? Essentially I would like to somehow keep the results of the first pass (for the current tile) in shared memory or L1/L2 cache and then perform the second pass (for the current tile) and only actually output the results of the second pass to global memory. I could of course manually divide the image into tiles and render each one separately but I imagine this will result in a huge amount of redundant vertex shading unless I can somehow cull triangles that do not intersect the current tile.