ECC Impact on memcpy performance

Hi

I just found a performance bottleneck in my code when running it on multiple GPUs on a big cluster.

After some profiling effort I pinned the problem down to ECC memory being used.

The culprit are cudaMemcpyToSymbol instructions which are used quiet heavily and whose number is constant for each process (basically they are copying single ints and floats).

I wrote a little testprogram (see further down) and did some tests on two clusters and a workstation. The basic conclusion is that with ECC activated copy operations seem to take 4 times as long as without ECC - note though that this is mostly a latency thing due to copying single integers and such.

The output is:

NICS Keeneland (M2050 ECC on):

Start

no error

Done Const: 2.169880e-05

Done Mem: 2.027810e-05 (BW: 1.881191e-01)

eStella (C2050 ECC on):

Start

no error

Done Const: 1.934510e-05

Done Mem: 1.817390e-05 (BW: 2.098998e-01)

eStella (C2050 ECC off):

Start

no error

Done Const: 5.480500e-06

Done Mem: 5.147300e-06 (BW: 7.411065e-01)

Workstation (GTX470 no ECC):

Start

no error

Done Const: 5.013000e-06

Done Mem: 4.917500e-06 (BW: 7.757391e-01)

Is a performance hit like that expected?

Cheers

Christian

Code (compiled with “nvcc -arch=sm_20 -lrt test.cu”):

#include <cstdio>

#include "cuda_runtime.h"

#include <ctime>

__device__ __constant__ int c;

void UpdateConst(int a)

{

  cudaMemcpyToSymbol("c", &a, sizeof(int));

}

void UpdateMem(int* d_a, int* a,int size)

{

  cudaMemcpy(d_a,a,size*sizeof(int),cudaMemcpyHostToDevice);

}

int main()

{

  printf("Start\n");

  int size=1;

  int loopsC=10000;

  int loopsD=10000;

  cudaThreadSynchronize();

  int* d_a;

  cudaMalloc(&d_a,size*sizeof(int));

  int* a=new int;

  printf("%s\n",cudaGetErrorString(cudaGetLastError()));

timespec time1,time2;

  UpdateConst(0);

  clock_gettime(CLOCK_REALTIME,&time1);

  for(int i=0;i<loopsC;i++)

    UpdateConst(i);

  clock_gettime(CLOCK_REALTIME,&time2);

double time=time2.tv_sec-time1.tv_sec+1.0*(time2.tv_nsec-time1.tv_nsec)/1000000000;

printf("Done Const: %e\n",time/loopsC);

UpdateMem(d_a,a,size);

  clock_gettime(CLOCK_REALTIME,&time1);

  for(int i=0;i<loopsD;i++)

    UpdateMem(d_a,a,size);

  clock_gettime(CLOCK_REALTIME,&time2);

time=time2.tv_sec-time1.tv_sec+1.0*(time2.tv_nsec-time1.tv_nsec)/1000000000;

printf("Done Mem: %e (BW: %e)\n",time/loopsD,loopsD*size*sizeof(int)/time/1024/1024);

return 0;

}

The following code is another example of severe performance reduction when activating ECC (2x-3x).

It is basically random device memory access + coalesced write (some data reordering I need in my code).

The bad thing is, when using more GPUs for a given simulation I actually need more of this code part.

As a consequence ECC is having an increasing impact when using more GPUs, thus reducing the scaling performance significantly.

I compiled again with: nvcc -arch=sm_20 check_ecc.cu -lrt and run with: a.out -nmax 100000 -first 80000 -nlist 10000 -nloop 10000

nmax nfirst nlist ECC noECC

100000 80000 10000 6.1 3.2

100000 90000 2000 4.5 1.8

Any thoughts? Is this expected behaviour? I mean the usual documents say something about 20-25% performance reduction and more on scattered write (which does not take place here right?).

Cheers

Christian

#include <cstdio>

#include <ctime>

__global__ void Move_Kernel(int* list,int n,float* x,int nmax,int first)

{

  int i=blockIdx.x*blockDim.x+threadIdx.x;

if(i<n)

  {

    int j=list[i];

x[i+first]=x[j];

    x[i+first+nmax] = x[j+nmax];

    x[i+first+2*nmax] = x[j+2*nmax];

  }

}

int main(int narg, char** arg)

{

  int nmax=100000;

  int first=80000;

  int nlist=10000;

  int nloop=10000;

  for(int i=0;i<narg;i++)

  {

    if(strstr(arg[i],"-nmax")) nmax=atoi(arg[++i]);

    if(strstr(arg[i],"-first")) first=atoi(arg[++i]);

    if(strstr(arg[i],"-nlist")) nlist=atoi(arg[++i]);

    if(strstr(arg[i],"-nloop")) nloop=atoi(arg[++i]);

  }

  printf("%i %i %i\n",nmax,first,nlist);

  if(nmax<first+nlist) exit(0);

float* x=new float[3*nmax];

  int* list=new int[nlist];

  float* d_x;

  int* d_list;

  cudaMalloc(&d_x,sizeof(float)*3*nmax);

  cudaMalloc(&d_list,sizeof(int)*nlist);

srand(19181);

  for(int i=0;i<nlist;i++)

    list[i]=rand() % first;

cudaMemcpy(d_list,list,sizeof(int)*nlist,cudaMemcpyHostToDevice);

Move_Kernel<<<(nlist+127)/128,128>>>(d_list,nlist,d_x,nmax,first);

  cudaThreadSynchronize();

timespec time1,time2;

  clock_gettime(CLOCK_REALTIME,&time1);

  for(int i=0;i<nloop;i++)

  {

    Move_Kernel<<<(nlist+127)/128,128>>>(d_list,nlist,d_x,nmax,first);

    cudaThreadSynchronize();

  }

  clock_gettime(CLOCK_REALTIME,&time2);

  double time=time2.tv_sec-time1.tv_sec+1.0*(time2.tv_nsec-time1.tv_nsec)/1000000000;

  printf("Done Time: %e\n",time/nloop);

}

Batch them using async calls and synchronize only at the end.

Thats for the cudaMemcpyToSymbol right? Any thought on the random read scenario?

Cheers

Christian