Small Memory Transfers with CudaMemcpyAsync

Hello,

I develop a large scientific software package for materials science. I am currently implementing a GPU version of the code. I created a minimal GPU example which mimics the workload in the main code for the rate limiting step. This is still somewhat complex, however, as there are multiple POSIX threads and each runs its own kernel for its own job. I setup the threads to use stream concurrency with each POSIX thread using it’s default stream.

I then profiled the code using NVTX and NSIGHT Systems. I have looked at the code both with the CLI and GUI. One thing that sticks out to me is that small DtoH memory transfers are taking an outsized share of the runtime. After each kernel run there are two doubles transferred DtoH, and these take longer than 64,000 doubles transferred HtoD before the kernel run.

This is the runtime breakdown of my code from nsys --stats

Time (%) Total Time (s) Range


27.4 280 :ENERGY_OUT
25.8 264 :FENERGY_OUT
22.9 234 :CUDAMalloc
10.3 105 :KERNEL
10.1 103 :CUDAFree
2.8 28 :ELEVELS_IN

ELEVELS_IN is a 64k double HtoD transfer, ENERGY_OUT is a double DtoH transfer, and FENERGY_OUT is another double DtoH transfer.

I had assumed that the stream concurrency would have helped to queue up constant data transfers, but I see long empty periods in between transfers. I am having some difficulty understanding these results as the asynchronous calls can sometimes confound the profiling results. I have seen before that depending on where you put cudaStreamSynchronize calls it can look like one call is taking a long time when it’s actually the previous call finishing first. To mitigate this problem, I call cudaStreamSynchronize between each GPU transfer or kernel.

I would really appreciate any help or advice you can provide as this is outside of my capabilities to fix and I am stuck.

Why does it look like the small memory transfers are so slow? I assumed these would be almost negligible.

Is there is an intrinsic latency to the memory transfer?
Is there a reason that DtoH would be slower than HtoD?
Are all the calls to cudaStreamSynchronize slowing down the run?
Are there conflicts between the memory transfers on different streams?
Is there a problem with my setup of the streams so they are not as concurrent as I expect?
Is the profile inaccurate because of concurrency/asynchronous calls?

The code is given below. It is compiled with:

nvcc my_program.cu -lm -lpthread --default-stream per-thread -o my_program

A typical run is:
my_program 256 1 1024 1 1024 64000 27

The arguments here are:

256 - POSIX threads
1 - block (don't change this)
1024 - threads per block
1 -  run_mask (1 is GPU, 2 is CPU, 3 is both)
1024 - repeats
64,000 - vector size
27 - Memory Transfer Mask (There are 5 memory transfers available, the value of 27 produces the 3 transfers mentioned in this post)

Once again, thank you for your help and let me know if there is any other information I can provide that would be of assistance.

#include <stdio.h>
#include <math.h>
#include <time.h>
#include <pthread.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <nvtx3/nvToolsExt.h>

enum MEMCPY_BITMASK {ELEVELS_IN, FENERGY_IN, WEIGHTS_OUT, FENERGY_OUT, ENERGY_OUT};

__device__ void fermiIteration(double *energyLevels, double fermiGuess, double temperature, int numEnergyLevels, double *weights, double *energyWeights, double *weightSum)
{
	double threadEnergy=0;//Sum for each thread
	double threadWeight=0;
	int iLevel; 
	for(iLevel = threadIdx.x;iLevel < numEnergyLevels;iLevel+=blockDim.x)
	{
		weights[iLevel] = 1.0/(1.0+exp((*(energyLevels+iLevel)-fermiGuess)/temperature));
		threadEnergy +=weights[iLevel] * energyLevels[iLevel];
		threadWeight += weights[iLevel];
	}

	energyWeights[threadIdx.x] = threadEnergy; //Store sum from each thread at base value
	weightSum[threadIdx.x] = threadWeight; //Store sum from each thread at base value
	for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) 
	{
		__syncthreads();

		if(threadIdx.x < stride)
		{
			weightSum[threadIdx.x ] += weightSum[threadIdx.x+stride];
			energyWeights[threadIdx.x] += energyWeights[threadIdx.x+stride];
		}
	}
	__syncthreads();
}

__global__ void fermiKernel(double *energyLevels, double *fermiEnergy, double currentExcitations, double fermiConvergence, double temperature, int numEnergyLevels,  double *weights, double *energyWeights, double *d_energy, double *weightSum)
{
	double fermiGuess = *fermiEnergy;
	double fermiMin = fermiGuess - 7.5*fermiConvergence;
	double fermiMax = fermiGuess + 8.49*fermiConvergence;//This is slightly offset so that the fermi level could be almost unchanged. 
	double totalCarriersMin=0;
	double totalCarriersMax=0; 
	double totalCarriers;
	//Try to bound the fermi level within a small box around the previous result which requires not more than 6 iterations to converge.
	fermiIteration(energyLevels, fermiMin, temperature, numEnergyLevels, weights, energyWeights, weightSum);
	totalCarriersMin = *weightSum;
	fermiIteration(energyLevels, fermiMax, temperature, numEnergyLevels, weights, energyWeights, weightSum);
	totalCarriersMax = *weightSum;
	
	if((totalCarriersMin > currentExcitations) || (totalCarriersMax < currentExcitations))//If the solution is not within the bounds expand the bounds.  	
	{
		//printf("Fermi level outside bounds, Min %.10g max %.10g, min carriers %.10g, max carriers %.10g, target carriers %g\n",fermiMin,fermiMax,totalCarriersMin,totalCarriersMax,currentExcitations);
		fermiMin = optics_llimit;
		fermiMax = optics_ulimit;
	}
	
	while(fermiMax - fermiMin > fermiConvergence)
	{
 
		fermiGuess = (fermiMax+fermiMin)/2.0;	
		
		__syncthreads();
		fermiIteration(energyLevels, fermiGuess, temperature, numEnergyLevels, weights, energyWeights, weightSum);
		totalCarriers = *weightSum;
	
		if(totalCarriers > currentExcitations) //Lower search range
			fermiMax = fermiGuess;
		else
			fermiMin = fermiGuess;
		//printf("Fermi range %g %g Carriers %g %g\n",fermiMin,fermiMax,totalCarriers,currentExcitations);
                __syncthreads();
	}
	
	if(threadIdx.x == 0)
	{
		*fermiEnergy = fermiGuess;
		*d_energy = *energyWeights;
	}
}


void fermiSum( double *energyLevels, double fermiEnergy, double temperature, int numEnergyLevels, double *weights, double *weightSum, double *energySum)
{
 	/*
	Evaluates the fermi Sum, N=sum(fi) and E=sum(fi * Ei)
	where fi is the fermi function fi = 1/(1+exp((Ei-Ef)/kT))
	
	Inputs are:
	energyLevels (Ei)
	fermiEnergy (Ef)
	temperature (kT)
	numEnergyLevels (n)
	
	outputs are:
	weights (fi)
	weightSum (N)
	energySum (E)
	*/
	*weightSum=0;
	*energySum=0;
	for(int i = 0;i<numEnergyLevels;i++)
	{
		*(weights+i) = 1.0/(1.0+exp((*(energyLevels+i)-fermiEnergy)/temperature));
		*weightSum += *(weights+i);
		*energySum += *(energyLevels+i) * *(weights+i);
	}
}


double solveFermiEnergy(double currentExcitations, double fermiGuess, double fermiConvergence, double *energyLevels,double *weights, int numEnergyLevels, double temperature, double *energy)
{
	double fermiMin = fermiGuess - 7.5*fermiConvergence;
	double fermiMax = fermiGuess + 8.49*fermiConvergence;//This is slightly offset so that the fermi level could be almost unchanged. 

	double totalCarriersMin=0;
	double totalCarriersMax=0; 
	//Try to bound the fermi level within a small box around the previous result which requires not more than 6 iterations to converge.
	fermiSum( energyLevels, fermiMin, temperature, numEnergyLevels, weights, &totalCarriersMin, energy);
	fermiSum( energyLevels, fermiMax, temperature, numEnergyLevels, weights, &totalCarriersMax, energy);
	
	
	if((totalCarriersMin > currentExcitations) || (totalCarriersMax < currentExcitations))//If the solution is not within the bounds expand the bounds.  	
	{
		//printf("Fermi level outside bounds, Min %.10g max %.10g, min carriers %.10g, max carriers %.10g, target carriers %g\n",fermiMin,fermiMax,totalCarriersMin,totalCarriersMax,currentExcitations);
		fermiMin = optics_llimit;
		fermiMax = optics_ulimit;
	}
		
	double totalCarriers=0;
	while(fermiMax-fermiMin > fermiConvergence)
	{
		fermiGuess = (fermiMax+fermiMin)/2.0;	
		fermiSum( energyLevels, fermiGuess, temperature, numEnergyLevels, weights, &totalCarriers, energy);
		if(totalCarriers > currentExcitations) //Lower search range
			fermiMax = fermiGuess;
		else
			fermiMin = fermiGuess;
		//printf("Fermi range %g %g Carriers %g %g\n",fermiMin,fermiMax,totalCarriers,currentExcitations);
	}
	return fermiGuess;
}

double RandDouble(double low, double high, unsigned int *state) {
	double t = (double)rand_r(state) / (double)RAND_MAX;
	return (1.0f - t) * low + t * high;
}

int num_pthreads = 256;
// Number of elements per vector; arbitrary,
int ELEMENT_N = 64000;
// Total number of data elements
int sDATA_SZ;
int num_blocks = 1;
int num_threads = 128;
int run_mask = 3;
int repeats = 1;
int memcpy_mask = 31;
double fermiConvergence = 1e-7;
	
void * single_thread(void *data)
{
	int thread_number = *((int *)data);
	int num_gpu;
	cudaGetDeviceCount(&num_gpu);
	cudaSetDevice((double)thread_number/num_pthreads*num_gpu);
	double *h_energyLevels, *h_weights, *weights_CPU, fermiEnergy_CPU, h_fermiEnergy, h_energy, energy_CPU;
	double *d_energyLevels, *d_weights, *d_energyWeights, *d_weightSum, *d_fermiEnergy, *d_energy;
	int i,iRepeat;	
	int saccum_size = num_threads*sizeof(double);
	if(thread_number==0)
		printf("Initializing data...\n");
	if(thread_number==0)
		printf("...allocating CPU memory.\n");
	h_energyLevels = (double *)malloc(sDATA_SZ);
	h_weights = (double *)malloc(sDATA_SZ);
	weights_CPU = (double *)malloc(sDATA_SZ);

	if(thread_number==0)
		printf("...allocating GPU memory.\n");
	
	nvtxRangePushA(":CUDAMalloc");
	cudaMalloc((void **)&d_energyLevels, sDATA_SZ);
	cudaMalloc((void **)&d_weights, sDATA_SZ);
	cudaMalloc((void **)&d_energyWeights, saccum_size);
	cudaMalloc((void **)&d_energy, sizeof(double));
	cudaMalloc((void **)&d_weightSum, saccum_size);
	cudaMalloc((void **)&d_fermiEnergy, sizeof(double));
	
  	cudaStreamSynchronize(0);//Finish Mallocs before starting Memcpy
  	nvtxRangePop();
  	
  	if(thread_number==0)
		printf("...initializing data.\n");
	
	unsigned int state;	
	for (i = 0; i < ELEMENT_N; i++) 
	{
		h_energyLevels[i] = RandDouble(0.0f, 1.0f,&state);
	}
	h_fermiEnergy = 0.5;
	
	if((run_mask / 2) % 2 == 1)
	{
		if(thread_number==0)
			printf("...Beginning CPU calculation.\n");
		for(iRepeat=0;iRepeat<repeats;iRepeat++)
		{
			fermiEnergy_CPU =  solveFermiEnergy( ELEMENT_N/2, fermiEnergy_CPU, fermiConvergence, h_energyLevels,weights_CPU,ELEMENT_N,0.025,&energy_CPU);
		
		}
		if(thread_number==0)	
			printf("Fermi Level from CPU is %g energy %g\n",fermiEnergy_CPU,energy_CPU); 
	}

	if ((run_mask % 2 == 1))
	{
		if(thread_number==0)
		printf("...copying input data to GPU mem.\n");
		
		// Copy data to GPU memory for processing
		cudaMemcpyAsync(d_energyLevels, h_energyLevels, sDATA_SZ, cudaMemcpyHostToDevice,0);
		cudaMemcpyAsync(d_fermiEnergy, &h_fermiEnergy, sizeof(double), cudaMemcpyHostToDevice,0);
  		cudaStreamSynchronize(0);
  		
  		if(thread_number==0)
		printf("...Running GPU Kernel.\n");
			
		for(iRepeat=0;iRepeat<repeats;iRepeat++)
		{
			nvtxRangePushA(":ELEVELS_IN"); 
			if((memcpy_mask >> ELEVELS_IN) %2)
			cudaMemcpyAsync(d_energyLevels, h_energyLevels, sDATA_SZ, cudaMemcpyHostToDevice,0);
			cudaStreamSynchronize(0);
			nvtxRangePop();
			nvtxRangePushA(":FENERGY_IN"); 
			if((memcpy_mask >> FENERGY_IN) %2)
			cudaMemcpyAsync(d_fermiEnergy, &h_fermiEnergy, sizeof(double), cudaMemcpyHostToDevice,0);
			cudaStreamSynchronize(0);
			nvtxRangePop();
			nvtxRangePushA(":KERNEL");	
			fermiKernel<<<1, num_threads,0,0>>>(d_energyLevels, d_fermiEnergy,  ELEMENT_N/2, fermiConvergence , 0.025, ELEMENT_N, d_weights,d_energyWeights,d_energy,d_weightSum);
 			cudaStreamSynchronize(0);		
			nvtxRangePop();
			nvtxRangePushA(":WEIGHTS_OUT");	
			
			if((memcpy_mask >> WEIGHTS_OUT) %2)
			cudaMemcpyAsync(h_weights, d_weights, sDATA_SZ, cudaMemcpyDeviceToHost,0);
			cudaStreamSynchronize(0);
			nvtxRangePop();
			nvtxRangePushA(":FENERGY_OUT");
			if((memcpy_mask >> FENERGY_OUT) %2)
			cudaMemcpyAsync(&h_fermiEnergy, d_fermiEnergy,  sizeof(double), cudaMemcpyDeviceToHost,0);
			cudaStreamSynchronize(0); 
			nvtxRangePop();
			nvtxRangePushA( ":ENERGY_OUT");
			if((memcpy_mask >> ENERGY_OUT) %2) 
			cudaMemcpyAsync(&h_energy, d_energyWeights, sizeof(double), cudaMemcpyDeviceToHost,0);
			cudaStreamSynchronize(0);
			nvtxRangePop(); 
  		}

  		cudaMemcpyAsync(h_weights, d_weights, sDATA_SZ, cudaMemcpyDeviceToHost,0);
		cudaMemcpyAsync(&h_fermiEnergy, d_fermiEnergy,  sizeof(double), cudaMemcpyDeviceToHost,0);  
		cudaMemcpyAsync(&h_energy, d_energyWeights, sizeof(double), cudaMemcpyDeviceToHost,0); 	
  		
  		cudaStreamSynchronize(0);	
 	
  		if(thread_number==0)
  		printf("Fermi Level GPU is %g Energy GPU is %g\n",h_fermiEnergy,h_energy);
  		cudaStreamSynchronize(0);//Finish Memcpys before Cuda_free
  	}
  	
	if(thread_number==0)
  	printf("...freeing GPU memory.\n");
  	nvtxRangePushA(":CUDAFree");
	cudaFree(d_energyLevels);
	cudaFree(d_weights);
	cudaFree(d_energyWeights);
	cudaFree(d_weightSum);
	cudaFree(d_fermiEnergy);
	cudaFree(d_energy);
	cudaStreamSynchronize(0);
	nvtxRangePop(); 
	return 0;
}

int main(int argc, char **argv)
{

	if (argc == 8)
	{
		num_pthreads = atoi(argv[1]);
		num_blocks = atoi(argv[2]);
		num_threads = atoi(argv[3]);
		run_mask = atoi(argv[4]);
		repeats = atoi(argv[5]);
		ELEMENT_N = atoi(argv[6]);
		memcpy_mask = atoi(argv[7]);
	}
	else
		printf("Usage: distribution_GPU  num_pthreads num_blocks num_threads run_mask repeats elements memcpy_mask[0-31]\n");
	
	
	printf("Running with %d pthreads %d blocks %d threads, %d run_mask and %d repeats %d elements\n",num_pthreads,num_blocks,num_threads,run_mask,repeats,ELEMENT_N);
	if((memcpy_mask >> ELEVELS_IN) %2)
		printf("Copying energy levels in\n");
	if((memcpy_mask >> FENERGY_IN) %2)
		printf("Copying fermi energy in\n");
	if((memcpy_mask >> WEIGHTS_OUT) %2)
		printf("Copying weights out\n");
	if((memcpy_mask >> FENERGY_OUT) %2)
		printf("Copying fermi energy out\n");
	if((memcpy_mask >> ENERGY_OUT) %2)
		printf("Copying energy out\n");
	printf("%s Starting...\n\n", argv[0]);
	sDATA_SZ = ELEMENT_N * sizeof(double);

  	int iThread = 0;
  	int *threadNumber = (int *)malloc(num_pthreads*sizeof(int));
  	void *retval;
  	pthread_t *threads = (pthread_t *)malloc(num_pthreads*sizeof(pthread_t));
  	for(iThread=0;iThread<num_pthreads;iThread++)
  	{
  		*(threadNumber+iThread) = iThread;
  		pthread_create(threads+iThread,NULL,(void *(*)(void *))single_thread,(void *)(threadNumber+iThread));
  	}
  	for(iThread=0;iThread<num_pthreads;iThread++)
  	{
  		pthread_join(*(threads+iThread),&retval);
  	}
}

Sanity check: All of the host memory involved in the cudaMemcpyAsycnc calls is pinned, correct?

Yes. Most of that is PCIe hardware communication overhead, a minor portion is overhead in the CUDA software stack. If you are using a fast host system (CPU base frequency >= 3.5 GHz highly recommended) you would be doing all you can to drive the latter to epsilon. You can easily visualize lower throughput at smaller transfer sizes by making a shmoo plot that tracks transfer size on the x-axis (use a logarithmic scale) and measured throughput on the y-axis. Maximum throughput should be achieved once transfer size reaches 16 MB or thereabouts.

Multiple possible reasons. Hardware: slow host system with low system memory throughput and high system memory latency. Software: The CUDA software stack can optimize small HtoD transfers by sending data from an async copy through the command queue (for some value of “small”). How big is the directional throughput difference being observed, at what transfer size?

What GPU is being used here? What are the specifications of the host system’s [1] CPU [2] memory subsystem (channels, DDR type and speed grade)?

Not sure what you mean by “conflict”. Generally speaking, any time multiple agents act upon a shared resource (such as a physical memory or an interconnect) there is a chance for each agent to observe (1) increased latency (2) reduced throughput. In addition, the use of multiple agents usually involves overhead for (A) locking (B) communication.

256 threads targetting CUDA may not be a great design. There are a few reasons for this:

  1. You effectively have 256 streams. There will be some false serialization (lack of independence) that you might not have been expecting.
  2. It may encourage you to issue work in small amounts to the GPU. That isn’t advisable.
  3. The CUDA runtime API requires acquisition of locks internally in a multi-threaded situation. For this reason people report from time to time in multi-threaded scenarios that the CUDA runtime API calls are taking longer than expected. (Here is a recent example. I’m not suggesting its exactly duplicating your case or highly instructive. It’s just an example. And there are many examples/reports.)

None of that is a direct answer to your question, just some advice. CUDA allows asynchronous work delivery to the GPU, so your usage of cudaStreamSynchronize(); after every work item issuance completely destroys that paradigm, but I think a better paradigm is to issue all work from one thread, asynchronously. You already have a loop. Loop over all the work. Again, just a suggestion.

If I were going to proceed further with what you have, then the first thing I would do is rerun your timing with a single thread. That would at least (hopefully) confirm your expectation, that the D2H copies can be shorter than the H2D copies. After that, it might be necessary to plow into the multi-threaded case to see if the observation can be explained.

I did not know this was a thing. Switching to CudaMallocHost() for the single element data transfers made a substantial difference. I believe the differences between the DtoH and HtoD may have actually been a difference in transferring stack-allocated memory versus heap-allocated memory. The stack is particularly far from the device.

I am using the Perlmutter GPU nodes at NERSC. They are very nice. 4x A100s (40 GB), AMD EPYC 7763 processor. It looks like 8 sticks of 32GB DDR4 each with total throughput of 204.8 GB/s. Architecture - NERSC Documentation

This system has multiple GPUs so it should be 64 streams per GPU. Still, I am uncertain whether I am actually filling up the GPU cores. Are there separate locks for each GPU?

I was thinking that if one 8 element transfer is faster than 8 one element transfers, then it would be better to have a single thread transfer a block rather than each thread transfer their own memory. I think what you are suggesting takes this one step further. Have a single thread issue all the commands to the device. I will have to think on this one as there are a lot of ways it could go wrong. E.g. if there are 7 jobs ready and you typically do batches of 8, what do you do? I can also see it being difficult to ensure the job was completed and the data available without stumbling over wait/race conditions. After the last improvement I clocked the GPU version as 7.8x faster than the CPU version. It’s good, but I think it should still be more considering this is a 4 GPU system.

For clarity, I am calling cudaStreamSynchronize(0); with each thread using its own default stream, so I believe the streams are not blocking each other. However, once this gets merged into the full code, the host can do some work before checking if the GPU finished it’s job.

I found an answer to this question:

My profiling results are much different if I use 128 p_threads versus 256 p_threads. I believe this is a 128 core/256 thread CPU and I observe higher throughput with 256 p_threads, but often the thread has to go to sleep to allow another thread to finish. When I use a high number of p_threads the time taken for different calls looks more even. This is reflected in a high max call time, which I believe the max is from a context switch not the time for the call itself. Here is the profiling result with 256 pthreads (note how all 4 calls, kernel, 64k in, and two different 1 element out transfers all have same max time):

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)     Min (ns)    Max (ns)      StdDev (ns)    Style      Range
 --------  ---------------  ---------  -------------  -------------  --------  -------------  -------------  -------  ------------
     43.9  250,044,289,522    262,144      953,843.3      404,374.0   350,650     20,718,314    1,529,691.8  PushPop  :KERNEL
     24.5  139,402,611,650    262,144      531,778.8      359,277.0    29,257     17,159,312      631,668.5  PushPop  :ELEVELS_IN
     12.9   73,431,921,535    262,144      280,120.6       50,267.0     9,689     17,544,098      786,868.7  PushPop  :FENERGY_OUT
      9.3   52,971,212,582    262,144      202,069.1       48,544.0     9,578     19,465,275      473,665.7  PushPop  :ENERGY_OUT
      6.3   35,664,992,164        256  139,316,375.6  111,839,820.0    78,041  1,057,843,289  153,440,675.1  PushPop  :CUDAMalloc
      3.1   17,788,378,300        256   69,485,852.7   52,823,088.5   585,896    299,917,528   64,844,321.6  PushPop  :CUDAFree

If I use 128 p_threads, the time for the 64k transfer is apparently much higher than single element transfer. I believe this is closer to the truth.

Time (%)  Total Time (ns)  Instances    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)    Style      Range
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  -------  ------------
     47.5   66,645,569,361    131,072     508,465.3     494,534.0    29,427   11,219,580     283,247.1  PushPop  :ELEVELS_IN
     38.8   54,460,186,533    131,072     415,498.2     392,756.5   350,450   10,886,955     142,290.6  PushPop  :KERNEL
      5.4    7,544,313,291    131,072      57,558.5      40,659.0    10,009    5,634,528      93,292.4  PushPop  :FENERGY_OUT
      5.3    7,480,814,518    131,072      57,074.1      40,769.0     9,518    9,735,211      90,775.0  PushPop  :ENERGY_OUT
      2.5    3,441,146,288        128  26,883,955.4  23,891,682.0   152,005  109,727,832  16,553,211.3  PushPop  :CUDAMalloc
      0.6      797,412,478        128   6,229,785.0   4,632,279.0   428,892   25,223,711   4,543,624.4  PushPop  :CUDAFree

You could define a time t (depending on the typical running time of the jobs).

If t after the last job invocation to the GPU, 8 jobs are not ready, less than 8 jobs are still invoked.

I am afraid I have no clue what “far from the device means” means here. Far in a NUMA sense, i.e. number of hops?

Yes, I think it is more hops from device to stack than device to heap. Otherwise, I cannot think of any explanation for why the small transfer took longer than the large transfer.

regards,

You could experiment with numactl to see whether fixing processor and memory affinity such that each GPU communicates with the nearest CPU core complex and its associated memory controller(s) improves performance. EPYC-based systems do have NUMA characteristics, but usually the effect is quite mild, which allows programmers to treat them as “almost UMA” in many situations. Wouldn’t hurt to check into, though.

1 Like

Using NUMA made a big difference.

I used numactl -H to find the mapping between my cores and numa nodes. Then lstopo and nvidia-smi showed the mapping between gpus and numa nodes. Finally, I mapped each thread to a gpu with cudaSetDevice and to a numa node. Unfortunately, the mapping is system specific, so I am not certain if the code will translate for other users. c - How to set CPU affinity of a particular pthread? - Stack Overflow

Here is an example profiling run:

Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)    Min (ns)    Max (ns)      StdDev (ns)    Style      Range
 --------  ---------------  ---------  -------------  ------------  --------  -------------  -------------  -------  ------------
     34.6   49,082,614,693    131,072      374,470.6     363,672.0   350,687     17,128,793       74,552.9  PushPop  :KERNEL
     32.5   46,082,798,665        128  360,021,864.6  31,328,988.5   100,324  1,022,184,745  378,240,968.2  PushPop  :CUDAMalloc
     21.4   30,329,103,454    131,072      231,392.7     189,055.5    50,918      2,685,647      192,856.0  PushPop  :ELEVELS_IN
      6.0    8,551,019,179        128   66,804,837.3  19,945,552.5   448,676    688,951,435  143,984,350.0  PushPop  :CUDAFree
      3.1    4,364,703,292    131,072       33,300.0      23,275.0     9,247      2,770,120       63,060.8  PushPop  :FENERGY_OUT
      2.3    3,330,166,834    131,072       25,407.2      20,910.0     8,927      2,771,362       31,318.9  PushPop  :ENERGY_OUT

I still feel like I can’t completely explain why using CudaMallocHost and setting the affinity made a difference however. The computers I am running on do not have any swap space, there is no paging. I did notice that if I tried to use numa_alloc_onnode for the single element allocations, the performance was awful. I think numa_alloc_onnode rounds up the allocation to the nearest page size. Another possibility, is that even though there is no swap space, using CudaMallocHost made the Cuda API aware that the memory could safely be assumed to be pinned.

Anyway, I clocked the GPU version as being 11.4x faster than the CPU version now even with a few CudaMallocs , which is close to what I think the maximum would be.

Also, I realized that this node should use 128 threads not 256. It has a single 64 core/128 thread CPU. The case I was thinking of was 2x 64 core/128 CPUs.

Can you remove the memory allocations for each run? E.g. have a background service keep it allocated? That would improve your overall performance.

There is difference between “pinned memory” and “pageable memory, but zero-sized page file”.

Using cudaMallocHost to allocate pinned host memory cuts down on the total system memory traffic and the latency of the transfers.

The DMA engines (“copy engines”) in the GPU transfer data between physically contiguous memory blocks. When pageable system memory is used with CUDA copy APIs, the CUDA driver transfers data from/to a pinned memory block of limited size (typically, a few MB), which GPU DMA transfers use as a target / source.

This two-step process doubles the system memory load, and markedly increases latency of GPU ↔ host data transfer, especially in host systems with low system memory performance. In addition, for large transfers, it necessitates splitting the transfer into multiple chunks based on the size of the pinned memory buffer maintained by the driver. That is why, as a best practice, cudaMemcpyAsync should always use programmer-allocated pinned memory allocations.

The reason fixing processor and memory affinity with numactl can improve performance is because the number of hops in CPU-CPU interconnect (either between CPU sockets or between core complexes inside a single CPU) traversed per host/device data transfer is minimized. This reduces latency. When finite buffering is used, an increase in latency is often accompanied by a reduction in throughput, and this effect becomes noticeable when approaching peak throughput.

In addition, for some workloads the CPU-CPU interconnect can become a bottleneck, with contention for the same link for transfers between different CPU sockets or core complexes. Fixing processor and memory affinity, such that each GPU communicates with the nearest core complex and its associated memory controllers minimizes use of the CPU-CPU interconnect and thus contention for the links between sockets / core complexes.

Unfortunately, the mapping is system specific, so I am not certain if the code will translate for other users.

That is par for the course in NUMA environments. It is also the reason CPU designers try to minimize NUMA effects as much as possible so programmers can largely avoid dealing with them; but there are limits to what they can do. You could try to automate (script, program) the manual process you used to find the mappings relevant to the numactl bindings.

The system I am using has 4 NUMA nodes and 4 GPUs, which is probably not a coincidence. I noticed that if I have cores from different nodes bind to the same GPU (but still evenly divide the work across nodes and GPUs), the performance for the CudaMemcpys drops quite bad. I am curious what happens for those with a different number of NUMA nodes as GPUs. If you had 2 CPU sockets and a single GPU, it might be better to have only one of the CPUs talk to the GPU.

I think I understand the logic behind the mapping. It’s not entirely general, but this code is consistent with the numbering on my system. It assumes one p_thread per hardware thread. (stick_this_thread_to_core comes from above stackoverflow link):

#define hyperthreads 2
#define sockets 1
#define nodesPerSocket 4
#define numPthreads 128
stick_this_thread_to_core(iThread);
int numGPU;
cudaGetDeviceCount(&numGPU);

int cores = numPthreads/hyperthreads;
int iNUMA = sockets * nodesPerSocket * ((double)(iThread % cores))/(cores);
int iGPU = numGPU * ((double)(iThread % cores))/(cores);

cudaSetDevice(iGPU);	

That is very likely. Keep in mind that each EPYC CPU itself is internally constructed from multiple chiplets, so access to the GPU might not even be uniform for all the chiplets inside that one CPU socket.