If I have a for loop invoking cudaMemcpyAsync where I always use the zero stream (the default stream), can I expect the data to be copied to the destination in parallel and asynchronously, and therefore see a speedup in my program? Or do I need to associate a distinct stream with each value of i to see a speedup? For example:

for(int i=0;i<100;i++){

src is pinned memory allocated using cudaHostAlloc


cudaMemcpy operations, issued in the same direction (i.e. host to device) will always serialize. The data will not be “copied in parallel”. This is due to the characteristics of the PCIE bus: only one outstanding operations can be transmitted at a time.

It’s not really clear what you are trying to accomplish. The usual reasons for use of the async API are for overlap:

kernel - kernel
memcpy - kernel
memcpy - memcpy (one is one direction, the other is in the other direction)
host - device

There are many nuances to get this correct. I would suggest that you start by reading the section on asynchronous concurrency in the programming guide.

I’m going to bump this because I’m doing something similar, except I’m trying to use cudaMemcpyAsync to pull back data from the GPU randomly. What I want to do is launch a kernel on the default null stream, and then create another stream to handle the async memory copies. I’m trying to get a kernel running that’s just constantly copying back data to the host until it completes.

For example:

cudaMallocHost(img); //fixed typo and moved before kernel launch

kernel<<<blocks, threads>>>(d_img);
while(...) {
   cudaMemcpyAsync(img, d_img, size, cudaMemcpyDeviceToHost, stream1);
cudaMemcpy(finalimg, d_img, size, cudaMemcpyDeviceToHost);


But when I do this, I get back an array of zeros in “img”, which it cannot be because the final processed image with real and complex values always comes out fine at the end when I do a regular cudaMemcpy.

Does this just not work in the way I am intending? I’m not even sure if the portion between the kernel and cudaMemcpy is happening asynchronously to the kernel computing, because if I add a “Sleep(5000)” right under kernel, it adds almost exactly that much time to the program run time. That doesn’t make sense to me because I thought kernel launches were asynchronous with respect to the host, and because the kernel takes a lot longer than 5 seconds to complete, I figured I wouldn’t see the increase of total run time at all.

Kernel launches are asynchronous with the host. Kernel launches are synchronous with respect to other activity in the same stream, i.e. all operations in the same stream execute in-order and consecutively. If you want to overlap of kernel execution and asynchronous copies, avoid the null-stream as it has special properties.

I am reasonably confident that all host calls to CUDA memory allocation and de-allocation API calls are completely synchronous with the GPU, i.e. they only take place when the GPU is idle (otherwise one might “pull out the rug” from underneath running kernels).

The logic of pulling data from the GPU at random times, while a kernel is running, escapes me. How do you make sure each data set pulled is consistent (e.g. from the same iteration of an algorithm running on the kernel)?

There is potentially a lot of ground to cover and a lot of things to unravel here. You might want to run things with the visual profiler to understand what is overlapping with what.

  1. In general I don’t recommend using the default null stream at all when you are trying to do multi-stream concurrency. If you read the programming guide, it explains why. So my suggestion would be to create a non-default stream for the kernel.

  2. There is no cudaHostMalloc. I assume you meant cudaHostAlloc. (there is also cudaMallocHost) A cudaHostAlloc issued after a kernel like this might not begin until the kernel has completed. The reason for this is that cudaHostAlloc can mess with the GPU memory map (depends on parameters of call you haven’t shown as well as whether UVA is active in your setup), and modification of the GPU memory map cannot occur while the kernel is running. Therefore I would expect this to be a blocking call, which might be inconsistent with your statements about adding the sleep() function. Apart from that, I would do the cudaHostAlloc prior to launching the kernel.

  3. Given the above, your kernel should have completed by the point at which your very first cudaMemcpyAsync begins. This means that the first call should retrieve whatever is in d_img. Again, inconsistent with your report.

  4. In a producer-consumer model like this, there is no guarantee that global memory holds what you think it does. This depends on the states of the caches and what your code looks like exactly. If you want global memory to definitely have data, there are various code mechanisms like volatile, __syncthreads(), __threadfence(), and atomics, which will affect visibility of data in global memory. Having said that, L2 cache should be a pretty good proxy, and your cudaMemcpyAsync function will hit in L2 before it goes to global.

So my suggestion would be a short, complete test case that demonstrates your behavior. And if this is on windows, WDDM can have all sorts of interfering effects on concurrency, so a description of your test setup would be in order also.

It’s doesn’t need to be consistent, but only to show “compute progress”. No matter when or what I pull out, I see should some kind of image coming that slowly “comes into focus”. The further the kernel gets, the more clear the image will be. I want to be able to do this without having to break at some point during the computation.

  1. I’ll take that into account.

  2. Sorry, I meant cudaMallocHost. I’ll fix that typo. And I’ve also moved it to before the kernel launch.

Visual Profiler just doesn’t want to work with this program, for reasons unknown, and it’s definitely not fun going without it. It’s some larger extremely embedded C program that I’ve now added some GPU acceleration into. But when I run it through Profiler, I see the program outputting in the console, with an error that comes out saying no cuda application was profiled (I’ve added cudaProfilerStart() and Stop() in the relevant portions of the code).

I’m going to work on this a little bit more and see if I can get anywhere. I’ll probably just start a new CUDA project and try to make something similar where profiler should at least work.

“Visual Profiler just doesn’t want to work with this program”

your kernel/ code sounds persistent, or semi-persistent

i am wondering whether the profiler can indeed profile persistent code - it seems to hate kernels that hardly terminate, as it then struggles to ‘wrap’ the kernel

perhaps txbob can elaborate on this

The profiler cannot profile a kernel that never terminates. You may be able to work around this in some long-running scenarios with profiler start/stop controls, but a kernel that never terminates cannot be profiled.

The profiler also has a variety of limits in what it can successfully profile. Many of these limits arise from the underlying counters used to keep track of statistics. The underlying counters have finite widths, and when an overflow/rollover condition occurs, profiling will be impacted. The usual suggestion is to trim down your kernel in some way:

  • reduce the run time
  • reduce the number of blocks launched
  • reduce the number of kernel invocations


I’m just thinking it may not be possible to do what I want.

Here’s some code I quickly wrote up demonstrating what I am trying to accomplish: I want to launch a kernel on stream0, and then continuously do some cudaMemcpyAsync on stream1 to work on the data on the host side. The problem is that if I do not call cudaStreamSychronize() directly after the D2H async copy, I get an array of zeros on host. And if I do call cudaStreamSychronize(), it completely blocks the host, and in a really awkward way!

Case 1:
Without cudaStreamSychronize()
Result: D2H overlaps as it should, but I cannot work with the data as I get zeros - notice the sums in the lower left in the image below. I’m assuming this is because the D2H doesn’t finish before I try working on it.

With cudaStreamSychronize()
Result: Completely blocked, and somehow the D2H copies are happening before the kernel launch, even though the kernel launch comes first in the code? What is happening???

The code is below. This should just compile and run as is on Windows/VS2013/CUDA 7.5RC.

#include <stdio.h>
#include <cstdlib>

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cuda_profiler_api.h>

#define NPS 512

#define NX 4096
#define NY 4096
#define BLOCKSIZEX 16
#define BLOCKSIZEY 16

__global__ void fKernel(float *arrout) {
	__shared__ float var[BLOCKSIZEX][BLOCKSIZEY];

	//Global coordinates and index
	size_t i = (blockIdx.x * blockDim.x) + threadIdx.x;
	size_t j = (blockIdx.y * blockDim.y) + threadIdx.y;
	size_t index = j * NX + i;

	if (index < NX * NY) {

		//Local tile coordinates
		size_t idx = threadIdx.x;
		size_t idy = threadIdx.y;

		//Copy to shared memory
		var[idx][idy] = arrout[index];

		//Work on the shared memory
		for (int i = 0; i < NPS; i++)
			var[idx][idy] *= 1.01;

		//Copy back to the global memory
		arrout[index] = var[idx][idy];

float sum(float *arr, int len)
	float sum = 0;
	for (int i = 0; i < len; i++) {
		sum += abs(arr[i]);
	return sum;

int main()
	dim3 numBlocks(NX / numThreads.x, NY / numThreads.y);

	float *h_outVector = (float *)malloc(NX * NY * sizeof(float));
	for (int i = 0; i < NX * NY; i++)
		h_outVector[i] = (float)i;

	float *d_outVector;
	cudaMalloc(&d_outVector, NX * NY*sizeof(float));
	cudaMemcpy(d_outVector, h_outVector, NX * NY *
		sizeof(float), cudaMemcpyHostToDevice);

	float *res00, *res01, *res02;
	cudaMallocHost(&res00, NX * NY * sizeof(float));
	cudaMallocHost(&res01, NX * NY * sizeof(float));
	cudaMallocHost(&res02, NX * NY * sizeof(float));

	cudaStream_t stream00;
	fKernel << <numBlocks, numThreads, 0, stream00 >> >(d_outVector);

	cudaStream_t stream01;

	size_t asyncCounter = 0;
	double sum00 = 5, sum01 = 3, sum02 = 0;

        /*The reason for this loop is to break when two subsequent copies are identical
        which should only be possible after kernel has completed. But I cannot test that 
        yet since my overlapped transfers always return array of zeros */
	while (asyncCounter < 6) {
		if (asyncCounter % 2 == 0) {
			cudaMemcpyAsync(res00, d_outVector, NX * NY *
				sizeof(float), cudaMemcpyDeviceToHost, stream01);
			cudaStreamSynchronize(stream01); //Result is array of zeros without this
			sum00 = sum(res00, NX * NY);
			printf("sum00: %.2f\n", sum00);
		else {
			cudaMemcpyAsync(res01, d_outVector, NX * NY *
				sizeof(float), cudaMemcpyDeviceToHost, stream01);
			cudaStreamSynchronize(stream01); //Result is array of zeros without this
			sum01 = sum(res01, NX * NY);
			printf("sum01: %.2f\n", sum01);
		asyncCounter += 1;

	cudaMemcpy(res00, d_outVector, NX * NY *
		sizeof(float), cudaMemcpyDeviceToHost);

	sum02 = sum(res00, NX * NY);
	printf("Final sum02: %.2f\n", sum02);


	cudaFreeHost(res00); cudaFreeHost(res01); cudaFreeHost(res02);

    return 0;

Here is what I get, running your code, as-is, on a GT640, on linux:

CUDA_VISIBLE_DEVICES="1" ./t948 sum00: 146610319261696.00 sum01: 156922049200128.00 sum00: 194720093437952.00 sum01: 281474959933440.00 sum00: 281474976710656.00 sum01: 329981733371904.00 Final sum02: 22464500024213504.00

So it’s perhaps closer to what you were expecting (I think.)

The problem is that independent streams are asynchronous to each other, so they can have any timing relationship at all, with respect to each other, including the behavior you’ve demonstrated, where a CUDA kernel, issued before a cudaMemcpyAsync operation, executes after the cudaMemcpyAsync operation. Windows isn’t helping you here, but the problem is not windows, the problem is that you are assuming a particular kind of synchronization should occur, which your application does not enforce.

Try taking a look at this example instead (i.e. the code in the answer that provides a worked code), which does something similar:

The use of the “mailboxes” there forces the host-issued cudaMemcpyAsync operations to “wait” until the data is ready to be consumed.

As an aside, I would create all my streams up front, in an application like this. Avoid creating streams, doing cudaMalloc, initialiizing libraries, doing cudaHostAlloc or other operations like these in time-critical loops or in areas where you are trying to achieve a particular concurrency.

And not calling cudaStreamSynchronize,after a cudaMemcpyAsync operation, when you intend to “consume” the data on the host, is completely broken. You’re getting all zeroes because without that barrier, your sum function is using data that the cudaMemcpyAsync operation has not even written to yet, because that operation has not even run yet. The cudaStreamSynchronize forces the operation to complete before the sum routine tries to use the data.

That’s the behavior I was looking to see (increasing random values every time after an async memory copy is made, because that is what the kernel is doing). I’ll take a look at the link early next week and see what I can do, as I’d like the code to work multi platform. I really appreciate the time and help, txbob!