Handful of Slow Memory Transfers

I’m trying to figure out why a number of cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost calls are significantly slower than the rest of my transfers. I have a program that copies a matrix to the gpu, finds the inverse of each element, and then copies it back which then repeats this process on the cpu. It does this 1000 times, printing out the time taken for each matrix. For the most part, all of the memory transfers max out my 970M’s bandwidth except for around 10 percent of them.

Here’s the code

#include <iostream>
#include <numeric>
#include <stdlib.h>

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

/**
 * CUDA kernel that computes reciprocal values for a given vector
 */
__global__ void reciprocalKernel(float *data, unsigned vectorSize) {
	unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;
	if (idx < vectorSize)
		data[idx] = 1.0/data[idx];
}

/**
 * Host function that copies the data and launches the work on GPU
 */
float *gpuReciprocal(float *data, unsigned size)
{
	float *rc = new float;
	float *gpuData;

	CUDA_CHECK_RETURN(cudaMalloc((void **)&gpuData, sizeof(float)*size));
	CUDA_CHECK_RETURN(cudaMemcpy(gpuData, data, sizeof(float)*size, cudaMemcpyHostToDevice));
	
	static const int BLOCK_SIZE = 256;
	const int blockCount = (size+BLOCK_SIZE-1)/BLOCK_SIZE;
	reciprocalKernel<<<blockCount, BLOCK_SIZE>>> (gpuData, size);

	CUDA_CHECK_RETURN(cudaMemcpy(rc, gpuData, sizeof(float)*size, cudaMemcpyDeviceToHost));
	CUDA_CHECK_RETURN(cudaFree(gpuData));
	return rc;
}

float *cpuReciprocal(float *data, unsigned size)
{
	float *rc = new float;
	for (unsigned cnt = 0; cnt < size; ++cnt) rc[cnt] = 1.0/data[cnt];
	return rc;
}

void initialize(float *data, unsigned size)
{
	for (unsigned i = 0; i < size; ++i)
		data[i] = .5*(i+1);
}

int main(void)
{
	static const int WORK_SIZE = 65530;
	float *data = new float[WORK_SIZE];

	float time;
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);

	for (int i = 0; i < 1000; i++)
	{
		cudaEventRecord(start, 0);

		initialize (data, WORK_SIZE);

		float *recCpu = cpuReciprocal(data, WORK_SIZE);
		float *recGpu = gpuReciprocal(data, WORK_SIZE);
		float cpuSum = std::accumulate (recCpu, recCpu+WORK_SIZE, 0.0);
		float gpuSum = std::accumulate (recGpu, recGpu+WORK_SIZE, 0.0);

		cudaEventRecord(stop, 0);
		cudaEventSynchronize(stop);
		cudaEventElapsedTime(&time, start, stop);

		/* Verify the results */
		std::cout << i << ") gpuSum = "<<gpuSum<< " cpuSum = " <<cpuSum<< " time = " <<time<< std::endl;

		delete[] recCpu;
		delete[] recGpu;
	}

	/* Free memory */
	delete[] data;

	return 0;
}

/**
 * Check the return value of the CUDA runtime API call and exit
 * the application if the call has failed.
 */
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
	if (err == cudaSuccess)
		return;
	std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
	exit (1);
}

Has anyone experienced this before? I’ve uploaded some screenshots (http://sli.mg/a/ENVCJl) of the data I’ve collected. The first two graphs show data from a 970, then a 960 and then a 745. What’s interesting is this curve is apparent on all the 900 series GPU’s I’ve tested, but not in the 745.

Does the behavior change when you pin this CUDA app to a particular CPU core for the entire run? Are there any other processes running concurrently with this CUDA app? Is this a dual CPU socket platform?

970M so that GPU may be servicing a display. The display servicing may be periodically sucking up some of your available memory bandwidth. And it’s not clear from your timing measurement that you could even narrow this down to cudaMemcpy operations. It’s also not clear to me how to interpret your output data - what are the x axis categories from 1-12?

Hi njuffa, thanks for the response.

Is there a way to make the profiler use a single core? I’m using a Intel® Core™ i7-4700EQ

The integrated graphics adapter is driving the display. x-axis is transfer speed in GB/s.

On Windows, SetProcessAffinityMask() allows an application to lock to a specific logical CPU, and you can use this to pin a CUDA app.

I would also suggest alternative measurements of transfer rates without the profiler, under the theory that any profiler on an platform tends to be intrusive and usually has negative impact on performance (there may be periodic bursts of activity due to the profiler itself, such as writing buffers to logs).

As txbob has pointed out, as long as he GPU is also driving the display, there may be interference with the CUDA copy operations on the GPU side. My questions were more focused on interference on the CPU side: transfers between system and device memory may be slowed down or delayed due to other system memory activity, or because a process may be schedule to either the “near” or the “far” CPU in a dual socket system. These are a very common sources of variability in CPU/GPU transfer performance. Since you are running Windows with the WDDM driver model (since the GPU is driving the display), memory management activity initiated by the operating system could also impact the transfers.

I’m actually using Ubuntu 14. I thought the profiler might be causing the issue but I’ve tried running both on my other linux machines and they didn’t have this problem.

In that case, disregard my comments with regard to WDDM. Unfortunately this information about different Linux machines is inconclusive.

My long-time observation is that Ubuntu has a higher incidence of “weird behavior” not seen with any other common Linux distro, so maybe the problem is specific to Ubuntu. However, the issue could also be specific to a hardware platform; do your other Linux boxes use the exact same hardware? Or the machines could be configured or used differently in ways we don’t know at this stage.

Since only one of your machine exhibits the objectionable behavior, you might want to debug this through one-by-one elimination of the differences between the “bad” and the “good” systems.