cudaStreamSynchronize vs cudaDeviceSynchronize across threads

I have a C++ program with several threads that use CUDA.
On one thread I am doing custom point cloud stitching.
After a single cycle of stitch, if a correspondence vector has the x component as NaN then it is an invalid correspondence, which is ok.
I then do a cudaStreamSynchronize(nullptr) and dump some debug data to a file.
I check if (isnan(CorrespBuffer[i].x))
and it is not NaN, and then the code…

	if (isnan(CorrespBuffer[i].x)) file_out << "Pre is Nan" << std::endl;
	file_out << i << " : [" << (i % CLOUDLETTE_W) << "," << (i / CLOUDLETTE_W) << "] " << CorrespBuffer[i].x << ", " << CorrespBuffer[i].y << ", " << CorrespBuffer[i].z << std::endl;
	if (isnan(CorrespBuffer[i].x)) file_out << "Post is Nan" << std::endl;

I never get Pre is Nan, but occasionally get Post is Nan.
This indicates that CorrespBuffer[i].x changed from a valid number to NaN during the file output?

If I instead do a cudaDeviceSynchronize the strangeness goes away.

I have carefully checked my code for other threads that change CorrespBuffer and there are none.

(no Pre is Nan)

27065 : [185,84] 0.139842, 0.0362788, 0.00396447
Post is Nan

Additional thoughts. the memory allocation is done (cudaHostAlloc) is done on the main thread, and processing that I am referring to is done much later on another thread (my StitchWorker thread).
Could this cause a synchronization issue with cudaStreamSynchronize ?

More Info: Interesting, I changed the code to ensure the the same C++ thread (on Windows10, Visual C++) that uses the memory in CUDA also allocates the memory and the bug goes away. CORRECTION: Does not go away.

Is this a bug in CUDA? Is it supposed to work this way? Can the same CUDA block of memory be shared between CUDA streams???

Further Update: Using DeviceSynchronize at least makes the correspondence summing work. It is very intermittent.

So the slides at

Explain concurrency and also indicate the problem here. Error between keyboard and chair.

