nppiAbsDiff_8u_C3R incorrect for non-blocking npp stream

When I use a non-blocking stream with npp, I seem to get incorrect values using nppiAbsDiff_8u_C3R.
I attached a minimal example to reproduce the case below. Here’s what I observed:

  • When I use nppSetStream(0), I get the correct result
  • When I use CUDA_LAUNCH_BLOCKING=1 environment variable, I also get the correct result
  • When I use an async stream, I mostly get an incorrect result (sometimes correct).

This smells like a race condition. Additionally, when I investigate with nvvp, I can see my async stream and one other async stream (created by npp I guess), which runs a single kernel ForEachPixelNaive (Screenshot

My system:

// nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

// nppversion.h

Here is the output of the sample program:

Correct:   1
Incorrect: 99

Correct:   100

And this is the source code of my program:

#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <npp.h>
#include <nppi.h>
#include <stdlib.h>
#include <vector>

inline void check(int status) {
  if (status != 0) {
    std::cerr << "Return code: " << status << std::endl;

void test(bool async_stream) {
  NppiSize size{5 * 1024, 1024};
  const size_t step = size.width * 3;
  const size_t bytes = step * size.height;

  uint8_t *src1, *src2, *diff;
  cudaStream_t stream;

  if (async_stream) {
    check(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
  } else {
    stream = 0;

  std::vector<uint8_t> ones(bytes, 1);
  std::vector<uint8_t> twos(bytes, 2);
  std::vector<uint8_t> result(bytes);

  check(cudaMalloc(&src1, bytes));
  check(cudaMalloc(&src2, bytes));
  check(cudaMalloc(&diff, bytes));
  std::vector<int> counters{0, 0};

  for (int i = 0; i < 100; ++i) {
    check(cudaMemcpyAsync(src1,, bytes, cudaMemcpyHostToDevice, stream));
    check(cudaMemcpyAsync(src2,, bytes, cudaMemcpyHostToDevice, stream));
    check(cudaMemcpyAsync(diff,, bytes, cudaMemcpyHostToDevice, stream));

    check(nppiAbsDiff_8u_C3R(src1, step, src2, step, diff, step, size));

    check(cudaMemcpyAsync(, diff, bytes, cudaMemcpyDeviceToHost, stream));

    auto is_correct = std::equal(ones.begin(), ones.end(), result.begin());

  std::cout << "Correct:   " << counters[1] << "\nIncorrect: " << counters[0]
            << std::endl;

  if (async_stream) {

int main() {
  std::cout << "Async:\n";
  std::cout << "\nSync:\n";

I would suggest repeating your test with the latest CUDA 10.1

When I do so, the async also reports 100 correct.

If you read the first page of the npp documentation,

you will see that a bunch of refactoring for stream handling went into npp in the CUDA 10.1 release.

Thanks for the answer. I just tried with Cuda 10.1 and for me the test is still very flaky. I.e. I get a couple of 100/0 but still mostly 6/94 etc.

Here is how I build the executable in docker:

nvidia-docker run -v`pwd`:/host -ti nvidia/cuda:10.1-devel-ubuntu18.04 /bin/bash
$ nvcc -std=c++14 /host/ -lnppial -lnppc

You’re right. I didn’t run it enough times. My suggestion would be to file a bug. The instructions are linked at the top of the cuda programming forum in a sticky post.

In the meantime it seems you already have some ideas for workarounds.