Thanks for the continued replies JerryChang,
It looks like we are dancing around the answer here a bit - perhaps I have enough to move along with the GPIO issue. The function you referenced above is a kernel-space abstraction for use in user space - the funciton __gpio_set_value(); exists where we want to have IO access - this function above referenced will only set one GPIO pin at a time. If we want to set up 8-bits of output, 3-bits of address, a strobe and read 8-bits of input, this would require 21 separate calls to this function. With DMA, this can be done in about 4 (a 5x speedup) as we are writing/reading entire DWORDs rather than masking a single bit after a read and writing back to the IO register. For driver-like code as a general industry best-practice, we want to occupy the CPU as short an interval as possible, allowing for other code to run on a CPU core for as many more cycles as a programmer can get it.
Again that may be enough for me to follow down the rabbit-hole to see what source is actually being used on which registers. Tentatively that will pause my query regarding the GPIO, I am still interested in the preempt_rt kernel patch support previously discussed in this thread.
Thanks to Nvidia and you for the continued responses.