NPP Subtraction Using CUDA NPP for array subtraction.

I’m attempting to use the CUDA NPP function nppiSub_8u_C1RSfs(). Nowhere can I find documentation that explains the order of the subtraction in this operation. Does dest =(src1 - src2) or (src2 - src1)?
Of course this would be simple to establish if I could get an example working.
I have coded up a simple example that allocates memory using cudaMalloc().
It then copies in a simple 3x3 array of data using cudaMemcpy2D().
I have copied data to and from the device using cudaMemcpy2D() and verified that the same data comes out as goes in.
When I try to use the nppiSub_8u_C1RSfs() function, however, I don’t get the expected results. It’s as if only the first 3 values get subtracted while the others simply result in zero.

My src2 is a 3x3 array with all values set to 200.
My src1 is a 3x3 array with all values set to 73.
Upon return I get an array with values:

127 0 0
127 0 0
127 0 0

I’d appreciate any insight into using this function. I’m executing on an NVS 160M, under Windows XP Pro with Cuda v3.2.