Problem regarding data transfer overlap between multiple asynchronous streams

Irokin · August 10, 2016, 11:48pm

I am trying to use one stream to write data to global memory and another stream to load the data filled in by the first stream back to the host concurrently. The code as following:

__global__ void kernel(volatile int* acc)
{
	int tid = threadIdx.x;
	
	acc[tid] = tid;
}

int main()
{
	cudaStream_t streams[2];
	cudaStreamCreate(&streams[0]); 
	cudaStreamCreate(&streams[1]);

	int* dev_acc;
	int* acc = new int[64];

	cudaMalloc(&dev_acc, sizeof(int) * 64);

	kernel << < 1, 64, 0, streams[0] >> > (dev_acc);

	for (int i = 0; i < 64; i++)
	{
		int temp = 0;
		int temp_2 = 0;
		cudaMemcpyAsync(&temp, dev_acc + i, sizeof(int), cudaMemcpyDeviceToHost, streams[1]);
		cudaMemcpyAsync(&temp_2, dev_acc + i + 1, sizeof(int), cudaMemcpyDeviceToHost, streams[1]);
		cudaStreamSynchronize(streams[1]);
		if (i != 63) printf("acc[%d] = %d, acc[%d] = %d\n", i, temp, i+1, temp_2);
		else printf("acc[%d] = %d\n", i, temp);
	}

	cudaStreamSynchronize(streams[0]);

	cudaMemcpy(acc, dev_acc, sizeof(int) * 64, cudaMemcpyDeviceToHost);
	
	for (int i = 0; i < 64; i++)
	{
		printf("acc[%d] = %d\n", i, acc[i]);
	}

	delete[] acc;
	cudaDeviceReset();

	return 0;
}

However, the output showed that all the data printed within the for loop at line 21 equals to 0. Yet the memory copy after stream synchronizing at line 36 gave me the correctly filled up data set, i.e. acc[0] = 0, acc[1] = 1, acc[n] = n;

The profiler timeline analysis:
https://drive.google.com/open?id=0B-0UQdS3Z6R-cHF4MlVqYkZXV2M
where stream13 running kernel, 14 doing async memory copy, 15 is the default stream doing memory copy at line 17 and 34.

From the profile we can see that the kernel is finished before the async mem copy, thus data filling should be completed before the first time we copy the data back to host. Then why is all the data printed at line 28 equal to 0(the initialized value with cudamalloc)?

My guess is the writes done by stream 13 are still in caches instead of the global memory before the occurrence of first async memory copy. Is there a way to make sure the data goes to the device memory instead of staying in cache before executing next command in the device code?

Program is compiled using CUDA 7.5 runtime on Windows 10 with GTX980M.

Thanks in advance.

tera · August 11, 2016, 1:35am

What outcome do you expect if one stream reads the same memory that another stream writes, without synchronization?
I wouldn’t give too much on the pretty profiler chart. Your program’s behavior is still undefined.

Irokin · September 6, 2016, 12:51am

The so-called without synchronization is footless when the profiler rendered the fancy chart that actually reflects what happened down in registers and memory spaces. The problem of this post is confirmed to be caused by L2 cache and the solution to it is to launch another kernel as data loader to force cache flush.
The Nsight debugger, nvvp and mem-check standalone is once again proved more reliable than the actual CUDA API where lots of features are missed on Windows yet functional on Linux.

njuffa · September 6, 2016, 2:32am

Not sure what you are referring to. You may want to clarify. It is certainly true that the Windows WDDM driver model imposes certain restrictions on software like CUDA that can make it difficult to expose maximum concurrency. If you switch to a GPU that supports the TCC driver on Windows you should see no significant differences to performance characteristics observed with Linux. Of course, simply using Linux instead of Windows is an equally attractive alternative (in my experience, actually more attractive).

Irokin · September 10, 2016, 11:57pm

Yes, simply using Linux for a non-TCC card is the to-go choice for exploring the power of CUDA. To clarify the missing features on Windows, one of them is the CudaAsyncmemcpy(). On various versions of Windows, the Async function simply does not work at all with a non-TCC card(Please correct me if I am wrong, for the god of love I do hope I am wrong since I have to migrate my previous project to Linux to make the Async work.).

Another example is the unified memory, it only works on Windows 7 and 8 with kepler and Maxwell arch in CUDA 6+. With the simplicity it provides with, the performance is much worse than a well-tuned ayncmemcpy system. ofc with CUDA 8 coming, unified memory is boosted with Pascal arch, but whether it supports Windows 10 remains unknown to me(would love to know about it if you have codes that can prove it works on Windwos 10).

Thank you for your response.

Robert_Crovella · September 11, 2016, 12:27am

WDDM certainly complicates things. The non-obvious rule is that CUDA operations may only run concurrently if they are within the same WDDM command buffer. The command buffer is pushed to the device at non-obvious points, so this makes concurrency on WDDM considerably more difficult to achieve in a predictable manner, but it is not impossible. If you’re serious about high-performance coding with maximum control of concurrency, WDDM is not the best choice.

In the initial CUDA 6/7 version of UM, it has been repeatedly stated in a variety of places that it will not do better than an expert programmer.

This statement is not unique to windows, or windows WDDM.

UM on any platform is limited to Kepler and newer GPUs.

This statement is not unique to windows, or windows WDDM.

It’s correct that CUDA 8 demand-paged UM memory management (when it becomes available) is a linux only feature. It is not available on Windows WDDM OR Windows TCC.

Irokin · September 11, 2016, 12:42am

Hi, txbob.

Thanks a lot for clarifying all of the facts for anyone who may concern about such non-obvious issues at the moment or in the future.
I collected this kinda information by experiments, GTC notes and run through the CUDA Samples. Will you recommend me some resources to learn more facts like this thoroughly? (books with in-depth thought or maybe these facts are actually laid in the release note or the CUDA documentation which I have not yet fully read through with full attention.)

Thanks again!

njuffa · September 11, 2016, 1:22am

As txbob explained, cudaMemcopyAsync() as such is fully functional under Windows. However, it is iffy to get the desired concurrency when using WDDM. With TCC (and also the obsolete Windows XP driver model) it works exactly as one would naively expect.

As for unified memory, prior to the demand-paging introduced with CUDA 8 (yet to be released!), it uses the same hardware features and software approach as user-level software (essentially, software-initiated asynchronous copies across the PCIe interconnect), which means it is a convenience feature rather than a performance feature.

Book recommendations are always difficult because different people prefer different kind of books, and come with different levels of knowledge and general programming expertise. For CUDA programmers that are past the introductory stage, my standard recommendation has been “The CUDA Handbook” by Nicholas Wilt. Nick was a member of the team that originally created CUDA (as was I). I find his writing to be very clear, and virtually free of errors, including distracting typos [full disclosure: I reviewed a few sections of the book prior to publication]. There may be newer books covering the latest features in CUDA, but I am no familiar with them.

Personally, I would recommend learning by experimenting, using the extensive and free documentation by NVIDIA as the resource. Note, however, that NVIDIA is notoriously secretive about various aspects of their platform (both hardware and software), e.g. don’t expect details about the machine language used by the different GPU architectures to be forthcoming, or detailed microarchitectural descriptions. some people have unlocked some of that information through diligent reverse engineering.

I would recommend using Linux as the platform with the fewest gotchas. While I am an equal-opportunity Windows and Linux (Unix) developer, having spent about 50% of my career on either, I tend to build a 30% “Windows tax” into development schedules as developing on Windows makes everything just a tad more convoluted, confusing, and difficult (as you found out the hard way here). In terms of distros, I am personally partial to RHEL, and observe the most problems reported with Ubuntu, not just because many people use it but also because the Ubuntu folks seem to insist to have things “their own way” in some aspects.

Irokin · September 11, 2016, 10:50am

Hi Njuffa, Thank you very much for your reply!

Topic		Replies	Views
cudaMemcpyAsync CUDA Programming and Performance	10	20735	October 16, 2015
Cannot get any stream parallelism. CUDA Programming and Performance	13	1280	December 31, 2019
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2227	January 18, 2023
Question on Stream, Connection and Performance CUDA Programming and Performance hw , cuda	6	1193	February 23, 2024
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2107	February 5, 2020
Unexpected Synchronization Behavior in Windows vs. Linux for CUDA Async Operations with Multiple Streams CUDA Programming and Performance synchronization	6	22	March 13, 2025
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13126	March 30, 2011
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	576	April 6, 2023
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
Why doesn't overlapping data transfers and kernel execution work here? CUDA Programming and Performance	88	252	April 8, 2025

Problem regarding data transfer overlap between multiple asynchronous streams

Related topics