Concurrent memcpy Performance RTX A2000 vs. Tesla T4


After doing quite a bit of searching for this topic without any luck of finding a solution, I thought I would make my initial post in hopes of gaining some insight and possibly find an answer.

We are using Windows 10 21H2 and are currently on CUDA 10.1; however, the issue we see also occurs on 11.6. The GPU is not being used to drive any displays, i.e., nothing is connected to any of its ports. EDIT: we are using the latest display driver as of this post, 512.59.

We have been using a Telsa T4 and are now testing a more recent Ampere card, RTX A2000. Hardware-accelerated GPU scheduling has been enabled in the Windows settings. While performing concurrent memcpy’s (cudaMemcpyAsync) with a subsequent cudaStreamSynchronize within our application, we see delays of 20+ms on the A2000. These do not happen at all with the T4.

Our code generation in the VS project is set to: compute_61,sm_61;compute_70,sm_70;compute_75,sm_75

I was able to recreate this with a sample application. This is different than our typical operation: we are using streams in our application with multiple threads. The following sample is only using cudaMemcpy() but the size of the memory being copied is the same as ours. Only meant to serve as an example to highlight what I believe is the same behavior:

class Stopwatch

	void Start();
	void Stop();

	long long Current() const { return _delta; }
	long long Milliseconds();

	LARGE_INTEGER _startTick;
	long long _delta;
	double _period;

	: _delta(0)

	_period = 1.0 / (double)freq.QuadPart;
	_startTick.QuadPart = 0;


void Stopwatch::Start()
	_delta = 0;

void Stopwatch::Stop()

	_delta = li.QuadPart - _startTick.QuadPart;

long long Stopwatch::Milliseconds()
	return static_cast<long long>(_delta * _period * 1000.0);

const int HOST_SIZE = 4096;
const int DEVICE_SIZE = 12256;

void RunTest(const char* id)
	unsigned char* hostBuffer;
	unsigned char* deviceBuffer;

	if (cudaMallocHost((void**)&hostBuffer, HOST_SIZE) != cudaSuccess)
		printf("Unable to malloc host buffer!\n");
	if (cudaMalloc((void**)&deviceBuffer, DEVICE_SIZE) != cudaSuccess)
		printf("Unable to malloc host buffer!\n");

	Stopwatch swTotal;

	long long minTime = 9999999;
	long long maxTime = 0;

	printf("cudaMemcpy test started: %s\n", id);

	std::mutex fallback;

	int failures = 0;
	for (int i = 0; i < 100000 && failures < 10; i++)
		Stopwatch thisTime;
		if (cudaMemcpy(hostBuffer, deviceBuffer, HOST_SIZE, cudaMemcpyDeviceToHost) != cudaSuccess)
			printf("cudaMemcpy failed\n");

		const auto ms = thisTime.Milliseconds();
		if (ms < minTime)
			minTime = ms;
		if (ms > maxTime)
			maxTime = ms;

		if (ms > 5)
			printf("cudaMemcpy took %lldms\n", ms);

	const auto total = swTotal.Milliseconds();

	printf("cudaMemcpy test complete (%s), min: %lldms; max: %lldms; total: %lldms\n", id, minTime, maxTime, total);

int main(int argc, char** argv) {

	if (cudaSetDevice(0) != cudaSuccess)
		printf("Unable to set device!\n");
		return -1;

	cudaDeviceProp devProp{};
	if (cudaSuccess != cudaGetDeviceProperties(&devProp, 0))
		printf("Unable to read properties!\n");
		return -1;

	printf("Device properties major: %d, minor: %d, name: %s\n", devProp.major, devProp.minor,;


	std::thread t1([&] { RunTest("1/2"); });
	std::thread t2([&] { RunTest("2/2"); });


Here is the output for each GPU type:
Device properties major: 7, minor: 5, name: Tesla T4
cudaMemcpy test started: sync
cudaMemcpy test complete (sync), min: 0ms; max: 1ms; total: 666ms
cudaMemcpy test started: 1/2
cudaMemcpy test started: 2/2
cudaMemcpy test complete (1/2), min: 0ms; max: 2ms; total: 1392ms
cudaMemcpy test complete (2/2), min: 0ms; max: 2ms; total: 1472ms

Device properties major: 8, minor: 6, name: NVIDIA RTX A2000
cudaMemcpy test started: sync
cudaMemcpy test complete (sync), min: 0ms; max: 0ms; total: 3481ms
cudaMemcpy test started: 1/2
cudaMemcpy test started: 2/2
cudaMemcpy took 76ms
cudaMemcpy took 14ms
cudaMemcpy took 14ms
cudaMemcpy took 166ms
cudaMemcpy took 149ms
cudaMemcpy took 143ms
cudaMemcpy took 36ms
cudaMemcpy took 25ms
cudaMemcpy took 63ms
cudaMemcpy took 165ms
cudaMemcpy took 75ms
cudaMemcpy took 111ms
cudaMemcpy took 7ms
cudaMemcpy took 117ms
cudaMemcpy took 59ms
cudaMemcpy took 10ms
cudaMemcpy took 2003ms
cudaMemcpy took 171ms
cudaMemcpy took 14ms
cudaMemcpy took 369ms
cudaMemcpy took 34ms
cudaMemcpy took 62ms
cudaMemcpy took 33ms
cudaMemcpy took 431ms
cudaMemcpy took 28ms
cudaMemcpy took 520ms
cudaMemcpy took 48ms
cudaMemcpy took 246ms
cudaMemcpy test complete (1/2), min: 0ms; max: 2003ms; total: 8148ms
cudaMemcpy took 895ms
cudaMemcpy test complete (2/2), min: 0ms; max: 895ms; total: 9406ms

What might be going on with the A2000, and why is it so different than the T4? What can we do to avoid this?

Even if the copy takes a couple of milliseconds, that’s fine for our needs, but 20+ms seems a bit perplexing.

Thank you!

After abandoning making code changes to further diagnose, I stumbled upon the nvidia-smi utility. It appears that the Tesla T4 defaults to TCC mode whereas the RTX A2000 defaults to WDDM mode. Switching it to TCC appears to have an immediate impact, and I do see that the T4 is in TCC mode already.

Switched the A2000 to TCC mode (see: Tesla Compute Cluster (TCC)) by issuing:
nvidia-smi -g 0 -dm 1

Maybe this will be of use for someone in the future.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.