Handful of Slow Memory Transfers

AlexP312 · June 16, 2016, 4:26pm

I’m trying to figure out why a number of cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost calls are significantly slower than the rest of my transfers. I have a program that copies a matrix to the gpu, finds the inverse of each element, and then copies it back which then repeats this process on the cpu. It does this 1000 times, printing out the time taken for each matrix. For the most part, all of the memory transfers max out my 970M’s bandwidth except for around 10 percent of them.

Here’s the code

#include <iostream>
#include <numeric>
#include <stdlib.h>

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

/**
 * CUDA kernel that computes reciprocal values for a given vector
 */
__global__ void reciprocalKernel(float *data, unsigned vectorSize) {
	unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	if (idx < vectorSize)
		data[idx] = 1.0/data[idx];
}

/**
 * Host function that copies the data and launches the work on GPU
 */
float *gpuReciprocal(float *data, unsigned size)
{
	float *rc = new float;
	float *gpuData;

	CUDA_CHECK_RETURN(cudaMalloc((void **)&gpuData, sizeof(float)*size));
	CUDA_CHECK_RETURN(cudaMemcpy(gpuData, data, sizeof(float)*size, cudaMemcpyHostToDevice));
	
	static const int BLOCK_SIZE = 256;
	const int blockCount = (size+BLOCK_SIZE-1)/BLOCK_SIZE;
	reciprocalKernel<<<blockCount, BLOCK_SIZE>>> (gpuData, size);

	CUDA_CHECK_RETURN(cudaMemcpy(rc, gpuData, sizeof(float)*size, cudaMemcpyDeviceToHost));
	CUDA_CHECK_RETURN(cudaFree(gpuData));
	return rc;
}

float *cpuReciprocal(float *data, unsigned size)
{
	float *rc = new float;
	for (unsigned cnt = 0; cnt < size; ++cnt) rc[cnt] = 1.0/data[cnt];
	return rc;
}

void initialize(float *data, unsigned size)
{
	for (unsigned i = 0; i < size; ++i)
		data[i] = .5*(i+1);
}

int main(void)
{
	static const int WORK_SIZE = 65530;
	float *data = new float[WORK_SIZE];

	float time;
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);

	for (int i = 0; i < 1000; i++)
	{
		cudaEventRecord(start, 0);

		initialize (data, WORK_SIZE);

		float *recCpu = cpuReciprocal(data, WORK_SIZE);
		float *recGpu = gpuReciprocal(data, WORK_SIZE);
		float cpuSum = std::accumulate (recCpu, recCpu+WORK_SIZE, 0.0);
		float gpuSum = std::accumulate (recGpu, recGpu+WORK_SIZE, 0.0);

		cudaEventRecord(stop, 0);
		cudaEventSynchronize(stop);
		cudaEventElapsedTime(&time, start, stop);

		/* Verify the results */
		std::cout << i << ") gpuSum = "<<gpuSum<< " cpuSum = " <<cpuSum<< " time = " <<time<< std::endl;

		delete[] recCpu;
		delete[] recGpu;
	}

	/* Free memory */
	delete[] data;

	return 0;
}

/**
 * Check the return value of the CUDA runtime API call and exit
 * the application if the call has failed.
 */
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
	if (err == cudaSuccess)
		return;
	std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
	exit (1);
}

Has anyone experienced this before? I’ve uploaded some screenshots (Sli Mag) of the data I’ve collected. The first two graphs show data from a 970, then a 960 and then a 745. What’s interesting is this curve is apparent on all the 900 series GPU’s I’ve tested, but not in the 745.

njuffa · June 16, 2016, 6:30pm

Does the behavior change when you pin this CUDA app to a particular CPU core for the entire run? Are there any other processes running concurrently with this CUDA app? Is this a dual CPU socket platform?

Robert_Crovella · June 17, 2016, 2:25am

970M so that GPU may be servicing a display. The display servicing may be periodically sucking up some of your available memory bandwidth. And it’s not clear from your timing measurement that you could even narrow this down to cudaMemcpy operations. It’s also not clear to me how to interpret your output data - what are the x axis categories from 1-12?

AlexP312 · June 17, 2016, 1:20pm

Hi njuffa, thanks for the response.

Is there a way to make the profiler use a single core? I’m using a Intel® Core™ i7-4700EQ

AlexP312 · June 17, 2016, 2:20pm

The integrated graphics adapter is driving the display. x-axis is transfer speed in GB/s.

njuffa · June 17, 2016, 3:25pm

On Windows, SetProcessAffinityMask() allows an application to lock to a specific logical CPU, and you can use this to pin a CUDA app.

I would also suggest alternative measurements of transfer rates without the profiler, under the theory that any profiler on an platform tends to be intrusive and usually has negative impact on performance (there may be periodic bursts of activity due to the profiler itself, such as writing buffers to logs).

As txbob has pointed out, as long as he GPU is also driving the display, there may be interference with the CUDA copy operations on the GPU side. My questions were more focused on interference on the CPU side: transfers between system and device memory may be slowed down or delayed due to other system memory activity, or because a process may be schedule to either the “near” or the “far” CPU in a dual socket system. These are a very common sources of variability in CPU/GPU transfer performance. Since you are running Windows with the WDDM driver model (since the GPU is driving the display), memory management activity initiated by the operating system could also impact the transfers.

AlexP312 · June 17, 2016, 6:12pm

I’m actually using Ubuntu 14. I thought the profiler might be causing the issue but I’ve tried running both on my other linux machines and they didn’t have this problem.

njuffa · June 17, 2016, 8:20pm

In that case, disregard my comments with regard to WDDM. Unfortunately this information about different Linux machines is inconclusive.

My long-time observation is that Ubuntu has a higher incidence of “weird behavior” not seen with any other common Linux distro, so maybe the problem is specific to Ubuntu. However, the issue could also be specific to a hardware platform; do your other Linux boxes use the exact same hardware? Or the machines could be configured or used differently in ways we don’t know at this stage.

Since only one of your machine exhibits the objectionable behavior, you might want to debug this through one-by-one elimination of the differences between the “bad” and the “good” systems.

Topic		Replies	Views
950M faster than a 970M CUDA Programming and Performance	4	812	May 12, 2016
Slow memory transfers CUDA Programming and Performance	7	1997	May 23, 2011
DMA transfers in parallel 2-way SLI with 2 GTX 280 CUDA Programming and Performance	8	3694	March 16, 2009
Memory copy very slow memory copy, image CUDA Programming and Performance	10	12508	April 7, 2011
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	799	September 11, 2016
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38117	February 1, 2011
Data transfers are slower when overlapped than when running sequentially CUDA Programming and Performance	9	1421	September 29, 2021
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Really strange memcpy time in matrixMul at SDK CUDA Programming and Performance	6	5103	July 9, 2009
cudaMemcpy latency unusually high on some machines CUDA Programming and Performance	9	147	November 11, 2024

Handful of Slow Memory Transfers

Related topics