Passing textures to a __global__ function

I’m trying to pass two textures to one of my global functions, but doing so gets a compiler error that I cannot trace down.

An abstract of the code would be:

__global__ void ApplyEfrcSplines(texture<int2, 1, cudaReadModeElementType> ErfcIndexTable,
                                 texture<float4, 1, cudaReadModeElementType> ErfcCoeffsTable,
                                 float* r2, float* results)
{
  int i, cidx;
  int2 eidx;
  float tr2;
  float4 ecoef;
  unsigned int i16r2, iSr2;

  i = threadIdx.x + blockIdx.x*gridDim.x;
  while (i < 256*37) {
    tr2 = r2[i];
    i16r2 = trunc((float)16.0*tr2);
    iSr2 = i16r2 >> 5;
    eidx = tex1Dfetch(ErfcIndexTable, iSr2);
    cidx = eidx.x + (i16r2 >> eidx.y);
    ecoef = tex1Dfetch(ErfcCoeffsTable, cidx);
    results[i] = (ecoef.x * tr2) + ecoef.y + (ecoef.z / tr2) + (ecoef.w / (tr2*tr2));
    i += blockDim.x*gridDim.x;
  }
}

int main()
{
 float* testpts;
  float* results;
  float* gpu_testpts;
  float* gpu_results;
  int2* ErfcIndexTable;
  int2* gpu_ErfcIndexTable;
  float4* ErfcCoeffsTable;
  float4* gpu_ErfcCoeffsTable;
  texture<int2, 1, cudaReadModeElementType> texErfcIndexTable;
  texture<float4, 1, cudaReadModeElementType> texErfcCoeffsTable;

  // Compute the indexing and coefficients tables                                               
  ErfcIndexTable = MakeIndexTable();
  ErfcCoeffsTable = MakeCoeffsTable();

  // Make a list of r2 test points                                                              
  testpts = MakeTestPoints();
  results = (float*)malloc(256*37*sizeof(float));

  // Allocate memory on the GPU and upload the computed textures                                
  cudaMallocAdvise((void**)&gpu_ErfcIndexTable, 256*sizeof(int2), "main", "ErfcIndexTable");
  cudaMallocAdvise((void**)&gpu_ErfcCoeffsTable, 256*sizeof(float4), "main",
                   "ErfcCoeffsTable");
  cudaMallocAdvise((void**)&gpu_testpts, 256*37*sizeof(float), "main", "ErfcIndexTable");
  cudaMallocAdvise((void**)&gpu_results, 256*37*sizeof(float), "main", "ErfcIndexTable");
  cudaMemUpload(gpu_ErfcIndexTable, ErfcIndexTable, 256*sizeof(int2), "main",
		"ErfcIndexTable");
  cudaMemUpload(gpu_ErfcCoeffsTable, ErfcCoeffsTable, 256*sizeof(float4), "main",
		"ErfcCoeffsTable");
  cudaMemUpload(gpu_testpts, testpts, 256*37*sizeof(float), "main", "testpts");

  // CHECK                                                                                      
  int i;
  for (i = 0; i < 256*37; i++) {
    printf("%12.6lf\n", testpts[i]);
  }
  // END CHECK                                                                                  

  // Bind textures                                                                              
  texErfcIndexTable.normalized = 0;
  texErfcIndexTable.filterMode = cudaFilterModePoint;
  texErfcIndexTable.addressMode[0] = cudaAddressModeClamp;
  texErfcIndexTable.channelDesc.x = 32;
  texErfcIndexTable.channelDesc.y = 32;
  texErfcIndexTable.channelDesc.z = 0;
  texErfcIndexTable.channelDesc.w = 0;
  cudaBindTexture(NULL, texErfcIndexTable, gpu_ErfcIndexTable, 256*sizeof(int2));
  texErfcCoeffsTable.normalized = 0;
  texErfcCoeffsTable.filterMode = cudaFilterModePoint;
  texErfcCoeffsTable.addressMode[0] = cudaAddressModeClamp;
  texErfcCoeffsTable.channelDesc.x = 32;
  texErfcCoeffsTable.channelDesc.y = 32;
  texErfcCoeffsTable.channelDesc.z = 32;
  texErfcCoeffsTable.channelDesc.w = 32;
  cudaBindTexture(NULL, texErfcCoeffsTable, gpu_ErfcCoeffsTable, 256*sizeof(float4));

  // CHECK                                                                                      
  ApplyEfrcSplines<<<128, 128>>>(texErfcIndexTable, texErfcCoeffsTable, gpu_testpts,
                                 gpu_results);
  cudaMemDownload(results, gpu_results, 256*37*sizeof(float), "main", "results");
  // END CHECK                                                                                  

  // Evaluate the results                                                                       
  EvaluateErfcSplines(ErfcIndexTable, ErfcCoeffsTable, 0.4, 20.0);

return 0;
}

Ignoring for the moment what MakeIndexTable, MakeCoeffsTable, and some of the (OK, OK, lame) functions I have for encapsulating CUDA memory allocation and error checking are, I think that the relevant code is in there.

And, the error I’m getting is:

In file included from tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:2:
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:11: error: expected ‘,’ or ‘…’ before ‘::’ token
In file included from tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:2:
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c: In function ‘void _device_stub__Z16ApplyEfrcSplines7textureI4int2Li1EL19cudaTextureReadMode0EES_I6float4Li1ELS1_0EEPfS5(_Z7textureI4int2Li1EL19cudaTextureReadMode0EE&)’:
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:11: error: ‘::__par0’ has not been declared
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:11: error: ‘::__par0’ has not been declared
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:11: error: ‘::__par1’ has not been declared
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:11: error: ‘::__par1’ has not been declared
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:11: error: ‘__par2’ was not declared in this scope
/tmp/tmpxft_00002f86_00000000-4_Driver.cudafe1.stub.c:11: error: ‘__par3’ was not declared in this scope

Any help on this would be great. I’m working on one of the early scientific computing successes of NVIDIA’s CUDA (AMBER pmemd), and if I can get these textures in order my estimates (from simulating the same work with garbage numbers in the existing code) are that it will be a new leap forward in speed.

Maybe you need to study some texture sample codes?

http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-texture

You appear to be attempting to use texture references:

texture<int2, 1, cudaReadModeElementType> texErfcIndexTable;
texture<float4, 1, cudaReadModeElementType> texErfcCoeffsTable;

but:

  1. texture references are declared at global (or file) scope, not at main (or function) scope
  2. we don’t pass texture references as kernel arguments.

This blog also talks about differences between texture references and texture objects:

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/

Indeed, it was the file scope that I needed to understand properly and the blog post makes that clear. After declaring them outside the function but in the file I’ve got things compiling (had to fix a couple of other bugs, like schizophrenia regarding what was texErfcIndex… and what was just ErfcIndex…).

The texture OBJECTS look much more like what I want to do, and indeed the reason I was trying to pass them as function arguments. The objects are much more the way I’d like to approach textures.

So you are using tabulated coefficients to approximate erfc() with splines because CUDA’s built-in math function is too slow? How much faster is that? What kind of accuracy do you achieve?

Ah, master Juffa! I had wondered if we might spar with our light sabers before my training is complete.

As I said, I’ve made some estimates by running pmemd.cuda with garbage numbers but doing the operations I would need in order to spline for d/dr [erfc®/r]. I’m not just splining for erfc()–if I was it would be pointless–but rather taking pmemd back to its Fortran roots and replacing the mammoth r^2-indexed spline with one indexed by something akin to log(r^2). When we just need forces, this appears to be a good bet. When we need forces and energies, I’m sticking to your (very clever) fasterfc() function.

The protocol is as follows:

  • Compute i16r2 = trunc(r2*(PMEFloat)16.0) [ the data typing is WAY important I’ve realized ]
  • Compute iSr2 = i16r2 >> 5
  • Read idx = ErfcIndexTable[iSr2], which gives an int2
  • Read coefs = ErfcCoeffsTable[idx.x + i16r2 >> idx.y]
  • Compute r2inv = 1.0/r2 (this is needed for Lennard-Jones calculations anyway)
  • Compute d_swtch_dx = coefs.xr2 + coefs.y + (coefs.z + coefs.wr2inv)*r2inv [ working on how to explicitly fuse the mults and adds there ]

Size of ErfcIndexTable with cut = 10.0 Angstroms ~ 400 bytes
Size of ErfcCoeffsTable with cut = 10.0 Angstroms ~ 2.7 kb

Size of ErfcIndexTable with cut = 22.0 Angstroms ~ 4.0 kb
Size of ErfcCoeffsTable with cut = 22.0 Angstroms ~ 4.0 kb

For r < 2.0, the interval covered by each set of spline coefficients is 0.0625 A^2. The interval covered by each set of spline coefficients for r > 16.0 is 8.0 A^2, and I don’t foresee anyone needing even that long a cutoff (it’s a switching function, after all, not the end of the world). As you can see above, by the time we’re out to 16A cutoff and beyond, the indexing table (which grows in size as the square of the cutoff) is already the size of the coefficients table and growing twice as fast, so I don’t make provisions for spline intervals any longer than 8.0 A^2.

I could express the indexing table, and i16r2 and iSr2, as short ints–i16r2 would be at most 16384 for a mammoth 32A cutoff, if you think that could go faster. It would be a very modest savings in memory for the lengths of cutoffs that anyone will USE, however, and my expectation would be that array indices want to see a 32-bit int. You surely know this stuff better than I do, though.

If you’re familiar with pmemd.cuda, d_swtch_dx (it should really be called d_swtch_dr) is the value we’re trying to get in 999/1000 time steps. Compared to your fasterfc() function, this will save us a handful of arithmetic operations, replace rsqrt(r2) with just 1.0/r2, and spare us the need to compute either exp2f() as happens in fasterfc() or the straight up exp() as is needed after fasterfc() to complete the derivative. The value we spline for will be ready out-of-the-box to get added to the Lennard-Jones force magnitude and then multiply through by dx, dy, dz to get the components acting on each atom. When we do forces, we don’t need 1/r, just 1/r2 and multiples thereof. When we need energies, we DO need 1/r and we need erfc®/r, not just d/dr [erfc®/r]. I could use the same indexing table and write another spline for erfc®/r, but it would be used on so few steps that I’m happy to save the texture memory.

Oh, and accuracy! As good as the floating point numbers that I’m doing this in. I compute the coefficients by least squares fitting in double precision, then reduce to 32-bit floating point precision. The doubles would be accurate to within a few parts per billion, but after reduction to float the precision is about 5 to 10 parts in a hundred million. Not bad for numbers that are only precise to one part in 8 million anyway, but last night I ginned up a system to fiddle with the last couple of bits in the significand of each coefficient and improve the error cancellation to increase the precision ~35% on average. I haven’t done a head-to-head comparison with fasterfc (or, perhaps more fairly, the number d_swtch_dx that comes out at the end), but my expectation is that I’m getting that number about as precisely as I can with 32-bit floats.

Funny (peculiar): I have no recollection of providing such a function to anyone. When and/or where was that? None of the fasterfc() version I can find on the internet look like code I wrote. Is my memory this hazy?

Not sure what you mean. That computation seems to naturally map to three fmaf() calls, which then turn directly into three FFMA instructions on the GPU (and any x86 or ARM processor that supports FMA, when using appropriate compiler switches):

d_swtch_dx = fmaf (r2inv, fmaf (coefs.w, r2inv, coefs.z), fmaf (coefs.x, r2, coefs.y));

[Later:] I see that Enenkel et. al. recommended the cubic spline approach based on their work on molecular dynamics on BlueGene: https://www.research.ibm.com/bluegene/jrd_2005/cust_math/cust-math-secure.pdf

[Even later:] I think I found the code you are referring to :-) I totally forgot about this quick one-off.

// Faster ERFC approximation courtesy of Norbert Juffa. NVIDIA Corporation
static __forceinline__ __device__ float fasterfc(float a) 
{
  /* approximate log(erfc(a)) with rel. error < 7e-9 */
  [...]
  return exp2f(t);
}

Here is your fasterfc as it now appears in the Amber code:

//---------------------------------------------------------------------------------------------
// cudaERFC: struct to hold the ten constants required by the fasterfc function below
//---------------------------------------------------------------------------------------------
struct cudaERFC
{
    PMEFloat c0;
    PMEFloat c1;
    PMEFloat c2;
    PMEFloat c3;
    PMEFloat c4;
    PMEFloat c5;
    PMEFloat c6;
    PMEFloat c7;
    PMEFloat c8;
    PMEFloat c9;
};
static __constant__ cudaERFC cERFC;

//---------------------------------------------------------------------------------------------
// SetkCalculatePMENonbondEnergyERFC: set constants for the fasterfc() function on the device.
//
// Arguments:
//   ewcoeff:    the Ewald coefficient, 1/(2*Gaussian sigma)
//---------------------------------------------------------------------------------------------
void SetkCalculatePMENonbondEnergyERFC(double ewcoeff)
{
  cudaERFC c;
  c.c0 = (PMEFloat)(-1.6488499458192755E-006 * pow(ewcoeff, 10.0));
  c.c1 = (PMEFloat)( 2.9524665006554534E-005 * pow(ewcoeff,  9.0));
  c.c2 = (PMEFloat)(-2.3341951153749626E-004 * pow(ewcoeff,  8.0));
  c.c3 = (PMEFloat)( 1.0424943374047289E-003 * pow(ewcoeff,  7.0));
  c.c4 = (PMEFloat)(-2.5501426008983853E-003 * pow(ewcoeff,  6.0));
  c.c5 = (PMEFloat)( 3.1979939710877236E-004 * pow(ewcoeff,  5.0));
  c.c6 = (PMEFloat)( 2.7605379075746249E-002 * pow(ewcoeff,  4.0));
  c.c7 = (PMEFloat)(-1.4827402067461906E-001 * pow(ewcoeff,  3.0));
  c.c8 = (PMEFloat)(-9.1844764013203406E-001 * ewcoeff * ewcoeff);
  c.c9 = (PMEFloat)(-1.6279070384382459E+000 * ewcoeff);
  cudaError_t status;
  status = cudaMemcpyToSymbol(cERFC, &c, sizeof(cudaERFC));
  RTERROR(status, "cudaMemcpyToSymbol: SetERFC copy to cERFC failed");
}

//---------------------------------------------------------------------------------------------
// __internal_fmad: encapsulates a call to __fmaf_rn (FMAD = Floating point Multiply-ADd), an
//                  intrinsic that returns a*b + c by the nomenclature given below.
//---------------------------------------------------------------------------------------------
static __forceinline__ __device__ float __internal_fmad(float a, float b, float c)
{
  return __fmaf_rn (a, b, c);
}

//---------------------------------------------------------------------------------------------
// Faster ERFC approximation courtesy of Norbert Juffa, NVIDIA Corporation
//
// Arguments:
//   a:     the argument to erfc(a)--take the complimentary error function of the number a
//---------------------------------------------------------------------------------------------
static __forceinline__ __device__ PMEFloat fasterfc(PMEFloat a)
{
  // Approximate log(erfc(a)) with rel. error < 7e-9
  PMEFloat t, x = a;
  t = cERFC.c0;
  t = __internal_fmad(t, x, cERFC.c1);
  t = __internal_fmad(t, x, cERFC.c2);
  t = __internal_fmad(t, x, cERFC.c3);
  t = __internal_fmad(t, x, cERFC.c4);
  t = __internal_fmad(t, x, cERFC.c5);
  t = __internal_fmad(t, x, cERFC.c6);
  t = __internal_fmad(t, x, cERFC.c7);
  t = __internal_fmad(t, x, cERFC.c8);
  t = __internal_fmad(t, x, cERFC.c9);
  t = t * x;
  return exp2f(t);
}

As long as what I’m doing maps cleanly by the compiler, I’m happy. The system I’ve got tests things on the GPU to see how they roll out, but I may still write my own intrinsics explicitly to make sure that I don’t get fmaf_rn when I’m testing it and fmaf_rz when it’s actually implemented in the inner loop over particles. I’m really just trying to follow the way that one flavor of fmaf is explicitly called by fasterfc().

Here’s the error I’m getting. If the relative error of fasterfc is 7e-9, it looks like I can actually exceed that with my splines, at least for the region where d/dr [erfc®/r] has any significant value.

Init. Error: error of the floating point spline coefficients as the come dumped out of double precision
Final Error: error of the optimized 32-bit coefficients

: the number of the trial that did the best (I just loop over x, z, and w perturbing them by up to three parts per 2^23, then find the optimal adjustment to y)

Std. Dev./Mean : standard deviation of the error divided by the mean value of d/dr [erfc®/r] for the applicable interval

Each line pertains to one set of cofficients. Not easy, but you can back out what they are. Line 1, r2 = [0.0625, 0.125); Line 2, r2 = [0.125, 0.1875); Line 32, r2 = [2.0, 2.125); Line 64, r2 = [6.0, 6.25); Line 96, r2 = [14.0, 14.5); Line 128, r2 = [30.0, 31.0). I’d have to divide the second column by the average value of d/dr [erfc®/r] to get something directly comparable to fasterfc(), but it looks like I’m meeting or exceeding 7.0e-9 relative error for at least the first 48 intervals. After that target diminishes a great deal and the rate of diminution get a bit ahead of my basis functions, so I lose a bit of relative accuracy but the absolute error remains pretty low.

Init. Error Final Error ## Std. Dev./Mean
3.1434e-07 --> 3.1349e-07 (147) ( 1.0773e-09)
4.4618e-08 --> 4.4017e-08 (197) ( 2.9290e-10)
1.2725e-08 --> 1.2300e-08 (329) ( 1.1989e-10)
5.1317e-09 --> 4.5173e-09 ( 89) ( 5.7587e-11)
2.4835e-09 --> 2.2609e-09 (254) ( 3.5439e-11)
1.8836e-09 --> 1.7257e-09 (159) ( 3.1992e-11)
1.2252e-09 --> 1.1127e-09 (125) ( 2.3754e-11)
1.0616e-09 --> 9.1099e-10 ( 56) ( 2.1963e-11)
1.4440e-09 --> 8.9719e-10 (222) ( 2.4069e-11)
1.0237e-09 --> 9.3979e-10 ( 33) ( 2.7730e-11)
1.4883e-09 --> 8.0993e-10 ( 22) ( 2.6039e-11)
1.2324e-09 --> 1.1068e-09 (145) ( 3.8476e-11)
1.4063e-09 --> 1.0897e-09 ( 86) ( 4.0692e-11)
1.6889e-09 --> 1.0780e-09 ( 24) ( 4.3009e-11)
1.5535e-09 --> 1.3324e-09 (265) ( 5.6524e-11)
1.4271e-09 --> 1.0203e-09 ( 46) ( 4.5835e-11)
1.3610e-09 --> 1.1223e-09 ( 40) ( 5.3201e-11)
1.3561e-09 --> 9.9882e-10 ( 12) ( 4.9800e-11)
1.1715e-09 --> 1.1076e-09 (336) ( 5.7921e-11)
1.4653e-09 --> 1.0862e-09 ( 15) ( 5.9425e-11)
1.2562e-09 --> 1.1893e-09 (111) ( 6.7920e-11)
1.5633e-09 --> 1.2659e-09 ( 28) ( 7.5307e-11)
1.4169e-09 --> 1.2246e-09 (328) ( 7.5742e-11)
1.4044e-09 --> 1.2613e-09 ( 93) ( 8.0968e-11)
1.5548e-09 --> 1.2667e-09 (167) ( 8.4270e-11)
1.5924e-09 --> 1.2171e-09 (125) ( 8.3783e-11)
2.0641e-09 --> 1.2761e-09 ( 64) ( 9.0782e-11)
2.0252e-09 --> 1.5488e-09 (323) ( 1.1373e-10)
2.2901e-09 --> 1.3862e-09 ( 87) ( 1.0494e-10)
1.8783e-09 --> 1.4628e-09 ( 35) ( 1.1404e-10)
2.0393e-09 --> 1.6082e-09 ( 11) ( 1.2899e-10)
1.6911e-09 --> 1.2941e-09 (141) ( 1.0669e-10)
1.7648e-09 --> 1.2845e-09 ( 6) ( 1.0876e-10)
1.5111e-09 --> 1.3358e-09 ( 87) ( 1.1606e-10)
1.6526e-09 --> 1.2796e-09 ( 30) ( 1.1400e-10)
2.1484e-09 --> 1.3482e-09 ( 23) ( 1.2306e-10)
1.9572e-09 --> 1.4248e-09 (294) ( 1.3316e-10)
1.8583e-09 --> 1.3072e-09 (123) ( 1.2501e-10)
1.5261e-09 --> 1.2460e-09 ( 61) ( 1.2185e-10)
1.8207e-09 --> 1.4365e-09 (301) ( 1.4357e-10)
1.9779e-09 --> 1.4548e-09 (182) ( 1.4852e-10)
1.9542e-09 --> 1.5293e-09 (253) ( 1.5939e-10)
2.0981e-09 --> 1.2433e-09 (203) ( 1.3223e-10)
2.0980e-09 --> 1.2058e-09 ( 90) ( 1.3079e-10)
1.6998e-09 --> 1.1160e-09 (258) ( 1.2341e-10)
1.7884e-09 --> 1.5046e-09 ( 90) ( 1.6954e-10)
1.8466e-09 --> 1.3442e-09 (107) ( 1.5428e-10)
2.7623e-09 --> 1.9834e-09 (100) ( 2.3178e-10)
3.1721e-09 --> 2.0996e-09 (235) ( 2.4972e-10)
2.7273e-09 --> 1.6913e-09 (262) ( 2.0467e-10)
2.9431e-09 --> 1.8863e-09 (186) ( 2.3215e-10)
2.3184e-09 --> 2.1173e-09 (138) ( 2.6494e-10)
2.5919e-09 --> 2.0864e-09 (128) ( 2.6536e-10)
2.7148e-09 --> 2.1283e-09 ( 2) ( 2.7503e-10)
2.9566e-09 --> 2.2516e-09 (300) ( 2.9556e-10)
2.8850e-09 --> 2.3750e-09 (223) ( 3.1658e-10)
2.5070e-09 --> 1.9651e-09 ( 10) ( 2.6592e-10)
3.2062e-09 --> 2.7766e-09 (122) ( 3.8135e-10)
2.1137e-09 --> 1.5225e-09 (288) ( 2.1217e-10)
2.8201e-09 --> 1.7518e-09 ( 28) ( 2.4764e-10)
2.2678e-09 --> 1.9365e-09 (136) ( 2.7764e-10)
3.0133e-09 --> 1.8456e-09 (214) ( 2.6829e-10)
3.5994e-09 --> 2.0041e-09 (306) ( 2.9531e-10)
4.5060e-09 --> 2.0189e-09 (278) ( 3.0151e-10)
3.3577e-09 --> 2.4073e-09 (218) ( 3.6429e-10)
3.5218e-09 --> 2.1379e-09 ( 51) ( 3.2774e-10)
3.2838e-09 --> 2.4032e-09 (168) ( 3.7314e-10)
3.4635e-09 --> 2.1898e-09 (212) ( 3.4431e-10)
3.4757e-09 --> 2.7280e-09 ( 41) ( 4.3429e-10)
3.6044e-09 --> 2.7823e-09 (226) ( 4.4838e-10)
3.4396e-09 --> 2.9117e-09 (299) ( 4.7491e-10)
5.6013e-09 --> 2.8373e-09 ( 40) ( 4.6831e-10)
2.9719e-09 --> 2.1181e-09 (328) ( 3.5372e-10)
3.8681e-09 --> 2.2141e-09 (320) ( 3.7405e-10)
4.7534e-09 --> 2.1589e-09 (135) ( 3.6889e-10)
2.9467e-09 --> 2.2819e-09 (104) ( 3.9431e-10)
3.9112e-09 --> 2.3830e-09 (122) ( 4.1638e-10)
4.5938e-09 --> 2.1545e-09 (190) ( 3.8059e-10)
4.1953e-09 --> 2.3918e-09 ( 29) ( 4.2710e-10)
3.8552e-09 --> 2.2796e-09 (167) ( 4.1143e-10)
6.7325e-09 --> 2.7216e-09 (221) ( 4.9640e-10)
3.7623e-09 --> 2.6993e-09 ( 67) ( 4.9747e-10)
2.9020e-09 --> 2.4771e-09 (310) ( 4.6123e-10)
4.2443e-09 --> 2.4486e-09 (188) ( 4.6059e-10)
5.4001e-09 --> 2.0725e-09 ( 0) ( 3.9376e-10)
6.8210e-09 --> 2.1336e-09 (105) ( 4.0941e-10)
2.8506e-09 --> 2.2777e-09 (320) ( 4.4137e-10)
4.6237e-09 --> 2.8073e-09 (240) ( 5.4928e-10)
3.8416e-09 --> 2.4061e-09 (247) ( 4.7530e-10)
3.0601e-09 --> 2.0694e-09 (310) ( 4.1267e-10)
3.6324e-09 --> 2.2368e-09 ( 37) ( 4.5026e-10)
4.2412e-09 --> 2.6652e-09 (251) ( 5.4149e-10)
2.8227e-09 --> 2.4042e-09 ( 22) ( 4.9294e-10)
3.3220e-09 --> 2.3244e-09 (320) ( 4.8091e-10)
3.0816e-09 --> 2.6741e-09 ( 4) ( 5.5826e-10)
3.3154e-09 --> 2.8518e-09 (250) ( 6.0064e-10)
4.4901e-09 --> 2.3376e-09 (180) ( 4.9669e-10)
4.4469e-09 --> 3.0099e-09 ( 38) ( 6.4512e-10)
3.7499e-09 --> 2.7872e-09 (265) ( 6.0255e-10)
5.5215e-09 --> 4.8564e-09 (319) ( 1.0588e-09)
5.6186e-09 --> 5.0147e-09 (149) ( 1.1026e-09)
5.8466e-09 --> 5.2435e-09 (111) ( 1.1626e-09)
5.4114e-09 --> 4.7855e-09 ( 12) ( 1.0698e-09)
6.2728e-09 --> 6.1197e-09 (261) ( 1.3793e-09)
8.7793e-09 --> 5.2765e-09 ( 24) ( 1.1990e-09)
8.5436e-09 --> 6.8527e-09 ( 89) ( 1.5697e-09)
9.5902e-09 --> 6.5095e-09 (318) ( 1.5030e-09)
9.6415e-09 --> 7.3511e-09 (339) ( 1.7107e-09)
1.1773e-08 --> 7.2014e-09 (268) ( 1.6890e-09)
1.0707e-08 --> 8.3449e-09 (148) ( 1.9724e-09)
1.1359e-08 --> 8.1193e-09 (150) ( 1.9338e-09)
1.3362e-08 --> 7.9280e-09 (244) ( 1.9027e-09)
1.4226e-08 --> 8.9983e-09 (140) ( 2.1758e-09)
1.9533e-08 --> 1.0720e-08 ( 17) ( 2.6116e-09)
1.6381e-08 --> 1.0526e-08 ( 87) ( 2.5834e-09)
1.2775e-08 --> 1.1525e-08 (132) ( 2.8495e-09)
1.8099e-08 --> 1.2787e-08 (338) ( 3.1845e-09)
2.4530e-08 --> 1.3855e-08 (191) ( 3.4754e-09)
2.1664e-08 --> 1.5354e-08 (127) ( 3.8791e-09)
2.2988e-08 --> 1.5258e-08 (121) ( 3.8822e-09)
2.2156e-08 --> 1.7657e-08 (236) ( 4.5243e-09)
2.9178e-08 --> 1.5962e-08 (122) ( 4.1187e-09)
3.1375e-08 --> 2.1073e-08 ( 22) ( 5.4753e-09)
3.4838e-08 --> 2.3826e-08 ( 70) ( 6.2333e-09)
3.7226e-08 --> 2.2116e-08 (239) ( 5.8254e-09)
3.7732e-08 --> 2.5274e-08 ( 22) ( 6.7022e-09)
4.2201e-08 --> 2.8086e-08 ( 42) ( 7.4982e-09)
5.2723e-08 --> 3.5166e-08 (124) ( 9.4509e-09)
4.9193e-08 --> 2.6694e-08 (258) ( 7.2217e-09)
4.2201e-08 --> 3.5546e-08 (230) ( 9.6796e-09)
4.1274e-08 --> 3.6738e-08 (149) ( 1.0069e-08)
4.9703e-08 --> 3.8330e-08 (250) ( 1.0574e-08)
1.5781e-07 --> 4.6875e-08 (214) ( 1.3014e-08)
8.5343e-08 --> 4.9357e-08 ( 24) ( 1.3791e-08)
8.5768e-08 --> 5.9663e-08 (260) ( 1.6776e-08)
1.3083e-07 --> 6.9799e-08 (194) ( 1.9749e-08)
7.4918e-08 --> 5.7078e-08 (204) ( 1.6251e-08)
9.1375e-08 --> 6.8937e-08 (332) ( 1.9748e-08)
1.7323e-07 --> 6.8124e-08 ( 98) ( 1.9636e-08)
1.8718e-07 --> 7.7021e-08 ( 53) ( 2.2336e-08)
1.7202e-07 --> 9.0428e-08 (335) ( 2.6383e-08)
1.4648e-07 --> 1.1615e-07 (124) ( 3.4092e-08)
2.2633e-07 --> 1.4247e-07 ( 30) ( 4.2068e-08)
3.0307e-07 --> 1.0856e-07 (160) ( 3.2245e-08)
3.0171e-07 --> 1.1748e-07 (150) ( 3.5100e-08)
1.7691e-07 --> 1.1560e-07 (163) ( 3.4741e-08)
1.7506e-07 --> 1.4182e-07 (171) ( 4.2868e-08)
2.0710e-07 --> 1.6426e-07 (150) ( 4.9938e-08)
2.2531e-07 --> 1.8668e-07 ( 29) ( 5.7080e-08)
1.9181e-07 --> 1.4761e-07 (136) ( 4.5391e-08)
2.3185e-07 --> 1.7474e-07 (261) ( 5.4040e-08)
4.3110e-07 --> 1.8425e-07 (184) ( 5.7302e-08)
4.7326e-07 --> 2.1579e-07 (278) ( 6.7487e-08)
3.1345e-07 --> 2.3429e-07 (316) ( 7.3681e-08)
3.4825e-07 --> 2.5356e-07 (315) ( 8.0180e-08)
4.1234e-07 --> 2.1408e-07 (326) ( 6.8068e-08)
5.6897e-07 --> 2.3746e-07 ( 46) ( 7.5915e-08)
6.1640e-07 --> 2.4222e-07 (194) ( 7.7855e-08)
3.5517e-07 --> 2.2052e-07 (246) ( 7.1263e-08)
7.8048e-07 --> 3.3005e-07 ( 27) ( 1.0723e-07)
6.0888e-07 --> 4.1853e-07 ( 81) ( 1.3670e-07)
9.8973e-07 --> 3.8919e-07 ( 93) ( 1.2779e-07)
7.4127e-07 --> 5.2642e-07 (132) ( 1.7376e-07)
4.5575e-07 --> 4.1007e-07 ( 67) ( 1.3606e-07)
1.0030e-06 --> 5.1136e-07 (176) ( 1.7055e-07)
8.4837e-07 --> 4.1775e-07 ( 20) ( 1.4006e-07)
1.2071e-06 --> 5.5080e-07 (103) ( 1.8561e-07)
1.0835e-06 --> 6.5317e-07 (185) ( 2.2123e-07)
1.1224e-06 --> 6.6885e-07 (206) ( 2.2769e-07)
1.5826e-06 --> 6.8932e-07 (104) ( 2.3584e-07)
1.8712e-06 --> 7.6983e-07 (234) ( 2.6471e-07)
9.1783e-07 --> 8.3636e-07 (158) ( 2.8903e-07)
1.3412e-06 --> 9.6348e-07 (293) ( 3.3461e-07)
1.2736e-06 --> 9.0916e-07 (117) ( 3.1731e-07)
1.3763e-06 --> 1.0063e-06 (213) ( 3.5293e-07)
2.3129e-06 --> 1.3588e-06 (103) ( 4.7890e-07)
2.0138e-06 --> 1.1252e-06 (119) ( 3.9848e-07)
3.1468e-06 --> 1.4532e-06 ( 52) ( 5.1715e-07)

Not sure what you mean. Unless you explicitly code an __fmaf_rz() in your code, you won’t get an FMA with truncation in your machine code. The __internal_fmad() was a convention I used in CUDA math library code to abstract sm_1x’s FMAD vs the FFMA available in later architectures, as fmaf() on sm_1x would have called dog slow emulation. CUDA no longer supports sm_1x (that was dropped years ago), so these days you can just code the standard math function fmaf() and be done.

The coefficients used in fastererfc() look like I dumped them straight out of the Remez minimax generator, they could certainly be carefully adjusted to optimize accuracy for single-precision evaluation. It might even be possible to shave a term off the polynomial. I might look into that when I get back home in three weeks.

The accuracy of the final result will be limited by the representational error of the IEEE-754 single-precision format at best (so think order of 1e-8), and in practice my code cannot get there because of inaccuracy in exp2f(). This should be clear when running an exhaustive test over the relevant argument range

That all sounds sensible. I’m going to put a ribbon on the coefficients evaluation and then get on to integrating this into pmemd. While I’ve got you on the line, I notice that the first allocation of memory on the GPU takes 2-3 seconds (it’s just allocating a few kb with cudaMalloc as far as I can see). After that things go pretty fast–I’ve got a lot of memory uploading and donwloading further down the line which I may try to get eliminate, but my kernel execution is fine. Is this normal behavior? Is there a lot of stuff that has to get primed as soon as anything wants to access the GPU that just counts as overhead? I feel like I kind of see this behavior with pmemd.cuda itself–there’s 5-10 seconds where it seems to be getting its ducks in a row but then it’s like WHOOSH.

First GPU memory allocation slow? Common question. CUDA runtime has lazy initialization. First API call triggers context creation, PTX code JIT compilation, etc. The overhead grows with the amount of memory in the system because CUDA needs to map all of system memory and all of GPU memory into a single unified address space. Lots of OS calls used for that, not much parallelism there. Use host with high single-thread performance for smallest overhead. CPUs with base core clocks > 3.5 GHz recommended (e.g. http://ark.intel.com/products/92994/Intel-Xeon-Processor-E5-1650-v4-15M-Cache-3_60-GHz)

(1) Put driver into persistent mode if on Linux (on Windows the driver never unloads so is persistent after system startup).
(2) Control where the context initialization kicks in by placing a cudaFree(0) in your code.
(3) Build fat binaries that include SASS for all targeted GPU architectures and PTX only for the latest one (for forward compatibility with future, as yet unknown GPU architecture). Never JIT compile unless you absolutely have to.

That makes sense; thanks for taking care of my common question–I didn’t think of specific keywords to search for. I will remember that about getting the GPU initialization started with a cudaFree(0) early on, but it sounds like other work on the host won’t be able to hide all that much of the GPU initialization if it’s all about having a fast CPU.

At this point, access to the GPU is driver based, and since the driver runs on the CPU, pretty much all the driver activity happens there. There are some data structures the driver creates in GPU memory, I think, but I am reasonably sure they are manipulated only from the CPU.

In a GPU accelerated system, the GPU should take care of most of the parallel portion of workloads, leaving the serial portion to the CPU. That means CPUs used in such systems should have very high single-thread performance to avoid bottlenecking on that serial portion. High system memory throughput can also help, that is why a four-channel DDR4 based memory subsystem is a good idea.

In the best of all possible future worlds, GPUs will become true processor peers to CPUs, and the OS will take care of all the work currently done by the CUDA driver.

OK guys (Norbert in particular). You were wondering how much faster the spline approach is than directly computing erfc(). It looks like the spline approach makes the code run about 5-7% faster overall, which is significant but not a game changer. As for the spline versus fasterfc, I’d estimate that the spline is about twice the speed of fasterfc + (other math needed to complete the number the spline gives straight away).

The success hinges on a different trick, though: I’m now only using the coefficients table, not an indexing table in front of it. What I do is one floating-point add to r2, then grab the int I need out of the result by doing two bit shifts on it. (I’m taking IEEE-754 for granted, but that’s not a big ask.) Much faster than float-to-int conversion and I roll out with exactly the index I need for the adaptively indexed spline table. I interlace two splines, in fact, one of them d/dr [erfc®/r - 1/r], so that I can eliminate some branching further down the line, which is also part of the success.

I wonder what effect on accuracy it would have to use polynomials in frac(16.0f*tr2) rather than tr2.
And if I had time on my hands (which I don’t unfortunately) I’d also try to create a function for d/dr [erfc®/r] rather than erfc© and see if that can be made faster than the spline (potentially optimizing it just for the range of the spline). Would be interesting to see if there really is a case where such table lookup beats calculation on the GPU.

Hi Tera,

My expectation is that, if you were to try and create a function like fasterfc() above for d/dr [erfc®/r] directly, it’d be hard, especially if you were to try and do it as any function of r2. log(erfc(a * r)), where a is a constant, is a very well-behaved curve as function of r. This was probably Norbert’s insight when he wrote fasterfc(). If you try to get that as a function of r2, or fold in a division by r to get the (log of the) energy function erfc(a * r)/r, or virtually anything else, I’d bet you get an ill-behaved hockey-stick plot that becomes very hard to attack with a polynomial. fasterfc() is a very well constructed function with a solid approach–I was only able to do better when I discovered the bit shifting trick on floating point numbers, and even then about half my success seems to come from the fact that my particular problem has branching down the line that I can eliminate by doing one add (take 2*(original index) + (logical evaluation of condition)) to modulate the spline table lookup.

I am back in town and have started looking into tweaked variants of fasterfc(). One problem is that I do not recall for what range of arguments fasterfc() was optimized. My test data suggests that the relevant range is [0,4].

With regard to accuracy, I will note that the error bound stated in the code applies to the approximation of log(erfc()) by a polynomial, not the results from fasterfc(), which incurs additional error: evaluation error when evaluating the polynomial approximation in finite precision, which is then amplified by the well-known error magnification inherent in any exponentiation, finally adding the error in exp2f() itself.

Preliminary experiments show that the degree of the polynomial used by fasterfc() can’t be reduced without a significant negative impact on accuracy, which leaves the tuning of the coefficients for best possible accuracy.

Welcome back!
Thanks for looking into the fasterfc() tune up. I’ve done some accuracy tests in addition to the speed tests for the splines I already posted. The upshot is this:

Mean unsigned error, spline method: 9.9576e-10
Mean unsigned error using fasterfc: 5.2608e-09

Please don’t take that as a dig at fasterfc()—rather, the error I’m posting is the error you get after evaluating rsqrt(r2), taking fasterfc®, then dividing by r3 and adding exp(a2*r2)/r2 to it. The spline gets that quantity straight away. I was surprised at how well the bit-optimized splines did to hammer out errors—I didn’t expect to get a 5-6 fold reduction. I can share more of the results outside of the thread if you like.

Have you looked into what happens if you replace expf() with __expf()? Computational error should increase but so should performance. Is ‘a2’ a constant or a variable? If the former, one might even switch to the direct use of exp2f().

As you are finding in your case, piecewise quadratic or cubic interpolation can be both more accurate and faster than “global” approximations. But in the context of GPU computing the required table lookups can be a significant drag on performance if there is any significant amount of “access divergence”, which is why use of tables usually is not indicated if more than about three different addresses are used across the 32 threads in a warp. That’s a heuristic I worked out in the Kepler time frame.

On the other hand, computing any function as exp(something) is usually a no-go area in terms of accuracy due to the error magnification issue inherent to exponentiation. It just so happens that in the given context, fasterfc() is accurate enough for the purposes of this code.

The available trade-offs make the superior choice a close call here, possibly also dependent on architecture.

That’s a very useful metric to keep in mind, and I’m grateful that you’re continuing to help me understand this. One other thing that may be helping me, I should bring up again, is that I’m able to use the splines to reduce thread divergence a bit–the code hit an if-then-else after the fasterfc® computation, and the else consists of adding 1/r3 (as rinv * r2inv, which have both already been calculated). I’ve traded thread divergence for more access divergence, making a second spline which has the 1/r3 already folded in, so I can skip the else part but I still have to do the if. This seemed to be of some help to me, but perhaps I need to reconsider where the performance really came from.

From your description, it seems that at least one side of the if-then-else contains just a couple of instructions, so I would expect the compiler to use if-conversion to convert that into non-branchy code. It has long been the case that with CUDA, programmers should not “sweat the small stuff” when it comes to thread divergence on data-dependent branches. You may want to examine the generated SASS to convince yourself that the branch got eliminated.

A useful source-level idiom to maximize the chances that the impact from potentially divergent branches is minimized is to compute the common-path result, then use an if-statement to conditional overwrite it with a special-case result. If I recall correctly, this technique was initially pointed out to me by Alex Fit-Florea around 2009, and it still seems to be a useful idiom with CUDA 8 based on some of my recent experiments. To be sure, this is one of those things that exploits compiler artifacts to some degree, so it is a good idea to always double check where performance is critical.

Over the years, I had several discussions with the CUDA compiler team trying to convince them to support likelihood attributes for branches to allow the compiler to optimize branches even better, to no avail. The heuristics used by the current CUDA compiler to deal with branches are pretty clever, bad choices are are from what I can see.