Problem regarding data transfer overlap between multiple asynchronous streams

I am trying to use one stream to write data to global memory and another stream to load the data filled in by the first stream back to the host concurrently. The code as following:

__global__ void kernel(volatile int* acc)
	int tid = threadIdx.x;
	acc[tid] = tid;

int main()
	cudaStream_t streams[2];

	int* dev_acc;
	int* acc = new int[64];

	cudaMalloc(&dev_acc, sizeof(int) * 64);

	kernel << < 1, 64, 0, streams[0] >> > (dev_acc);

	for (int i = 0; i < 64; i++)
		int temp = 0;
		int temp_2 = 0;
		cudaMemcpyAsync(&temp, dev_acc + i, sizeof(int), cudaMemcpyDeviceToHost, streams[1]);
		cudaMemcpyAsync(&temp_2, dev_acc + i + 1, sizeof(int), cudaMemcpyDeviceToHost, streams[1]);
		if (i != 63) printf("acc[%d] = %d, acc[%d] = %d\n", i, temp, i+1, temp_2);
		else printf("acc[%d] = %d\n", i, temp);


	cudaMemcpy(acc, dev_acc, sizeof(int) * 64, cudaMemcpyDeviceToHost);
	for (int i = 0; i < 64; i++)
		printf("acc[%d] = %d\n", i, acc[i]);

	delete[] acc;

	return 0;

However, the output showed that all the data printed within the for loop at line 21 equals to 0. Yet the memory copy after stream synchronizing at line 36 gave me the correctly filled up data set, i.e. acc[0] = 0, acc[1] = 1, acc[n] = n;

The profiler timeline analysis:
where stream13 running kernel, 14 doing async memory copy, 15 is the default stream doing memory copy at line 17 and 34.

From the profile we can see that the kernel is finished before the async mem copy, thus data filling should be completed before the first time we copy the data back to host. Then why is all the data printed at line 28 equal to 0(the initialized value with cudamalloc)?

My guess is the writes done by stream 13 are still in caches instead of the global memory before the occurrence of first async memory copy. Is there a way to make sure the data goes to the device memory instead of staying in cache before executing next command in the device code?

Program is compiled using CUDA 7.5 runtime on Windows 10 with GTX980M.

Thanks in advance.

What outcome do you expect if one stream reads the same memory that another stream writes, without synchronization?
I wouldn’t give too much on the pretty profiler chart. Your program’s behavior is still undefined.

The so-called without synchronization is footless when the profiler rendered the fancy chart that actually reflects what happened down in registers and memory spaces. The problem of this post is confirmed to be caused by L2 cache and the solution to it is to launch another kernel as data loader to force cache flush.
The Nsight debugger, nvvp and mem-check standalone is once again proved more reliable than the actual CUDA API where lots of features are missed on Windows yet functional on Linux.

Not sure what you are referring to. You may want to clarify. It is certainly true that the Windows WDDM driver model imposes certain restrictions on software like CUDA that can make it difficult to expose maximum concurrency. If you switch to a GPU that supports the TCC driver on Windows you should see no significant differences to performance characteristics observed with Linux. Of course, simply using Linux instead of Windows is an equally attractive alternative (in my experience, actually more attractive).

Yes, simply using Linux for a non-TCC card is the to-go choice for exploring the power of CUDA. To clarify the missing features on Windows, one of them is the CudaAsyncmemcpy(). On various versions of Windows, the Async function simply does not work at all with a non-TCC card(Please correct me if I am wrong, for the god of love I do hope I am wrong since I have to migrate my previous project to Linux to make the Async work.).

Another example is the unified memory, it only works on Windows 7 and 8 with kepler and Maxwell arch in CUDA 6+. With the simplicity it provides with, the performance is much worse than a well-tuned ayncmemcpy system. ofc with CUDA 8 coming, unified memory is boosted with Pascal arch, but whether it supports Windows 10 remains unknown to me(would love to know about it if you have codes that can prove it works on Windwos 10).

Thank you for your response.

WDDM certainly complicates things. The non-obvious rule is that CUDA operations may only run concurrently if they are within the same WDDM command buffer. The command buffer is pushed to the device at non-obvious points, so this makes concurrency on WDDM considerably more difficult to achieve in a predictable manner, but it is not impossible. If you’re serious about high-performance coding with maximum control of concurrency, WDDM is not the best choice.

In the initial CUDA 6/7 version of UM, it has been repeatedly stated in a variety of places that it will not do better than an expert programmer.

This statement is not unique to windows, or windows WDDM.

UM on any platform is limited to Kepler and newer GPUs.

This statement is not unique to windows, or windows WDDM.

It’s correct that CUDA 8 demand-paged UM memory management (when it becomes available) is a linux only feature. It is not available on Windows WDDM OR Windows TCC.

Hi, txbob.

Thanks a lot for clarifying all of the facts for anyone who may concern about such non-obvious issues at the moment or in the future.
I collected this kinda information by experiments, GTC notes and run through the CUDA Samples. Will you recommend me some resources to learn more facts like this thoroughly? (books with in-depth thought or maybe these facts are actually laid in the release note or the CUDA documentation which I have not yet fully read through with full attention.)

Thanks again!

As txbob explained, cudaMemcopyAsync() as such is fully functional under Windows. However, it is iffy to get the desired concurrency when using WDDM. With TCC (and also the obsolete Windows XP driver model) it works exactly as one would naively expect.

As for unified memory, prior to the demand-paging introduced with CUDA 8 (yet to be released!), it uses the same hardware features and software approach as user-level software (essentially, software-initiated asynchronous copies across the PCIe interconnect), which means it is a convenience feature rather than a performance feature.

Book recommendations are always difficult because different people prefer different kind of books, and come with different levels of knowledge and general programming expertise. For CUDA programmers that are past the introductory stage, my standard recommendation has been “The CUDA Handbook” by Nicholas Wilt. Nick was a member of the team that originally created CUDA (as was I). I find his writing to be very clear, and virtually free of errors, including distracting typos [full disclosure: I reviewed a few sections of the book prior to publication]. There may be newer books covering the latest features in CUDA, but I am no familiar with them.

Personally, I would recommend learning by experimenting, using the extensive and free documentation by NVIDIA as the resource. Note, however, that NVIDIA is notoriously secretive about various aspects of their platform (both hardware and software), e.g. don’t expect details about the machine language used by the different GPU architectures to be forthcoming, or detailed microarchitectural descriptions. some people have unlocked some of that information through diligent reverse engineering.

I would recommend using Linux as the platform with the fewest gotchas. While I am an equal-opportunity Windows and Linux (Unix) developer, having spent about 50% of my career on either, I tend to build a 30% “Windows tax” into development schedules as developing on Windows makes everything just a tad more convoluted, confusing, and difficult (as you found out the hard way here). In terms of distros, I am personally partial to RHEL, and observe the most problems reported with Ubuntu, not just because many people use it but also because the Ubuntu folks seem to insist to have things “their own way” in some aspects.

Hi Njuffa, Thank you very much for your reply!