zerocopy NPP : how reuse device value without copying back and forth to CPU

Dear All,

is it possible to reuse the result from a NPP function in an other NPP function as argument (without extra cudaMemcpy to host memory) ?
if not possible, what is the good practice for pipelining NPP kernel in a zerocopy manner ?


nppiMean_StdDev_32f_C1R(d_input,X_SIZE * sizeof(Npp32f),total_npp,d_scratch,

nppiThreshold_LTVal_32f_C1IR(d_input,X_SIZE * sizeof(Npp32f),total_npp,
				*d_mean_f+*d_std_f, //BUS ERROR HERE

Yes, its possible. Most NPP data is referenced by device pointers anyway. If you want to use a paritcular single output from one function as a particular single input to the next, just use the output pointer of one NPP function as the input to the next.

You cannot do what you are suggesting in code, however. That is taking two device pointers (two different outputs from previous function(s)) and attempting to add what they point to in host code. That is not allowed. If you want to add scalar device quantities like that, you’ll have to use a library routine to do so (e.g. cublasaxpy)

In short, you cannot do this:


in host code, in CUDA. That is dereferencing a device pointer in host code. Illegal.

I don’t happen to know what that particular parameter expects for that function. But you need to make sure if it is a scalar quantity you are passing it correctly, and if it is a pointer you are passing it correctly.

thank for your help,

the point is that :
nppiMean_StdDev_32f_C1R has Npp64f *d_mean_f, *d_std_f as output argument (pointer in GPU memory)
and that
nppiMean_StdDev_32f_C1R has const Npp32f nThreshold as input argument (scalar in host memory unfortunately )

I obviously can reuse the image data but not the mean or std because the second function requires scalar, not pointer. Can you confirm my understanding ? Are there any work around ?

In my use case even small size data transfer from device to host is an issue… (I am evaluating high throughput data transfer from 2D imaging devices, up to 2000fps x 1024 x 512 pixels x 16bits x 4 or 8 modules (XRays facility) using RDMA)

BTW what is the status of Thrust library from NVIDIA perspective ?


Yes, your understanding is correct. You would need to copy that value from device to host.

The Thrust library is available with all CUDA toolkit versions between 4.0 and 10.0. If you’re asking for forward-looking statements, I wouldn’t be able to provide that.