Is thrust::copy synchrous or asynchronus?

Hi All,
According to my tests which are detailed below, it seems that thrust::copy is asynchronous with cuda7.0.
Same tests indicate that with cuda6.5 thrust::copy is synchronous.
Is it indeed correct?

Which thrust functions are asynchronous (transform, copy_if, reduce_by_key etc)?
Here’s the test procedure:

  1. Fill a device buffer with 0s and a host buffer with 1s.
  2. Copy from device to host buffer using thrust::copy.
  3. Count host elements which are !=0 before and after cudaDeviceSynchronize().
  4. Print result to screen.
#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/sort.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/functional.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/execution_policy.h>
#include <iostream>

// test modified values after thrust::copy before and after cudaDeviceSynchronize 

int main(int argc, char **argv)
  unsigned int length = 99999;  
	unsigned int* d_inVal;
  // alocate device buffer, fill with 0s:

	if(cudaSuccess !=  cudaMalloc((void **) &d_inVal, length * sizeof(unsigned int))){
		std::cout << "cudaMalloc err\n";
  cudaMemset(d_inVal, 0, length * sizeof(unsigned int));

	// alocate  hostBuf fill  with 1s::
  unsigned int* hostBuf;
	cudaHostAlloc((void**)&hostBuf, length*sizeof(unsigned int), cudaHostAllocPortable);
	for(unsigned int idx =0; idx < length;idx++){
		hostBuf[idx] = 1;

	// copy to host buffer:
	thrust::device_ptr<unsigned int> d_inVPtr(d_inVal);
	thrust::copy(d_inVPtr, d_inVPtr + length,  hostBuf);

  // if it were synchronous, buffer was all 0s:
	unsigned int cnt = 0;
	for(int idx =length -1 ; idx >= 0;idx--){
		if(hostBuf[idx] != 0){
	std::cout << "mismatches before sync = " << cnt << "\n";
  // Now sync an test again:
	cnt = 0;
	for(unsigned int idx =0; idx < length;idx++){
		if(hostBuf[idx] != 0){
	std::cout << "mismatches after sync = " << cnt << "\n";

	return 0;


nvcc -arch=compute_52 -o run



resuts with cuda7.0:

mismatches before sync = 7879
mismatches after sync = 0

resuts with cuda6.5:

mismatches before sync = 0
mismatches after sync = 0

Would be grateful to get your advice about this.
Ronen Halevy.

Yes, that particular copy call is being done by a cudaMemcpyAsync, which means it can be asynchronous with respect to the host code that follows. Thrust has been migrating in the direction of supporting streams and full concurrency in the CUDA backend for several releases now, so it’s not surprising the behavior in this respect has changed.

I don’t know that it is documented which thrust calls are asynchronous, and under what circumstances. However, thrust is an open-source template library, so this is theoretically discoverable. Given the above statement, however, the safe assumption is to be aware of concurrency potential in general, and make the assumption that calls may be asynchronous.

Because, in general, thrust operations may be asynchronous. Normally, using ordinary allocations instead of pinned allocations and not specifying any stream execution policies is enough to ensure synchronous behavior. However there are possibly exceptions to this, such as thrust calls that return a scalar parameter (such as a reduction sum, or an iterator.)

In this case, you’ve facilitated the asynchronous host/device activity with the use of cudaHostAlloc. The safe assumption with thrust is to assume the behavior in general is asynchronous with respect to host code.

You can get a little bit of insight here by profiling your code with

nvprof --print-gpu-trace …


nvprof --print-api-trace …

and I think you’ll witness and then be able to explain for yourself the difference between cuda 6.5 and cuda 7.0

Note that there is a thrust google group discussion list, called thrust-users, which you can access here:!forum/thrust-users

You might want to sign up for this list. If you wanted to suggest changes in behavior or new features for thrust, that is a good place to do it.

tnx txbob.