Double precision and threads per block Slightly different results in DP calculations when changed th

Hello,

I have one question about DP math and CUDA.

My kernel produces some array of 20 million doubles. When I change amount of threads per block I get slightly different results compared to some other threads/block setup.

I have dumped buffers for 128 thr/block and 256 thr/block in files as binary and compared them - there are differences - few thousand bytes out of 160 MB.

When I dumped those doubles as strings in files with 10 decimals both files (for 128 thr/block & 256 thr/block) were identical. But when I dumped them as 16 decimal digits differences appeared.

Results for fixed threads/block scenario are the same to the last decimal place.

Since calculations are independent of thread setup, I’m puzzled what is causing such slight DP precision changes ?

Do you guys have any idea what is causing this ?

Here are some thread configurations and hash results

32 thread/block

    4096x32 9a8245b0899750daa22d2523331c3d72cc0f7350

    8192x32 9a8245b0899750daa22d2523331c3d72cc0f7350

64 thread/block

    1024x64 d14316cd1a5f53f3b07041b2bb03b684fbb58aee

    8192x64 d14316cd1a5f53f3b07041b2bb03b684fbb58aee

128 thread/block

    2048x128 12ce58c340077083deb0544665ed37581c02f76a

    4096x128 12ce58c340077083deb0544665ed37581c02f76a

    8192x128 12ce58c340077083deb0544665ed37581c02f76a

256 thread/block

    1024x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46

    2048x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46

    4096x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46

many thanks

Mirko

This seems very weird unless there is some way that the thread configuration can impact the order of operations in your kernel. Can you say more about how each thread computes its array values?

Thanks Seibert,

There are totaly 1024x1024x20 doubles. Those doubles are stored in one dimensional array at global memory - initialized to zero at the beginning of calculation. Let’s call this array resultArray.

Every iteration calculates 20 doubles in local memory and stores them back to their position in resultArray. There is no need for any atomic updates of resultArray since every iteration calculates unique 20 doubles.

If program creates

1024 blocks with 64 threads it will result that each thread calculates 16x20 doubles - 16 iterations. Explanation 1024x16 = 65536. Total number of iterations 1024x1024 = 1048576. Number of iterations/thread 1048576/65536 = 16.

this is actual Kernel

it has following input parameters

  1. startscenario start range - i.e. 0

  2. endscenario end range - i.e. 1048576

  3. c_nMarkets number of markets - i.e. 10

  4. etc…

  5. etc…

  6. double *stream - OUTPUT array resultArray

  7. double *stakes - pre calculated stakes - same for every kernel

  8. unsigned long long *devcombinations - pre calculated stakes - same for every kernel

__global__ void CalculateFullScenario(int startscenario, int endscenario, int c_nMarkets, int c_nSelections, int c_nLegs, int c_nTotalComb, double *stream, double *stakes, unsigned long long *devcombinations)

{

	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

	int nstride = gridDim.x * blockDim.x;

	unsigned int scenario = startscenario + tid;

	register unsigned long long int combinations = 0;

	unsigned  int nSelVar = pow(1.0*c_nSelections, c_nLegs);

	unsigned long long globalindexA = 0;

	int ii = 0;

		

	register unsigned long long int outcome = 0;

	register unsigned long long int LegVariation = 0;

	

	double localresults[32] = {0.0};

	while (scenario < endscenario) <b>//repeat until you reach final number i.e. 1048576</b>

	{

		globalindexA = scenario * 2 * c_nMarkets; <b>//calculate starting index in global array - stream, c_nMarkets = 10</b>

		ii = 0;

		while ( ii < c_nTotalComb) //start calculation for one scenario, in our case it will repeat it 252*243 = 61236 times

		{

			combinations = devcombinations[ii]; //take some value from global memory each thread takes it at same order 0..252 in this test case

		

			outcome = 0;

			for (int jk = 0; jk < c_nLegs; jk++)

				outcome = outcome | ((((scenario >> (((combinations >> (jk * 4)) & 15ULL))*2)) & 3ULL) << (jk * 2)); <b> //calculate some values based on scenario - it is TID based actually</b>

		

			for (int varcnt = 0; varcnt < nSelVar; varcnt++)

			{

					int decimal = varcnt;

					double dStake = stakes[ii * nSelVar + varcnt];

					LegVariation = 0;

				

					for(int ix=0; ix < c_nLegs; ix++)

					{

						LegVariation |= (unsigned long long)(decimal%c_nSelections) << (ix*2);

						decimal = decimal/c_nSelections;

					}

			

					int nLoosers = 0;

					int nWinners = 0;

					int nVoids = 0;

					double dSumP = 0.0;

					double dPrice = 1.0;

					for (int jk = 0; jk < c_nLegs; jk++)

					{

						int nOutcomePart = (outcome >> (jk * 2)) & 3ULL;

						int nLegValue = (LegVariation >> (jk * 2)) & 3ULL;

						if (nOutcomePart == 3)

						{

							nVoids++;

						}else if (nOutcomePart == nLegValue){

							nWinners++;

							dPrice = dPrice * aPrices[((combinations >> (jk * 4)) & 15ULL)*c_nSelections +  nLegValue]; 

							dSumP += aPrices[((combinations >> (jk * 4)) & 15ULL)*c_nSelections +  nLegValue];

						}else{

							nLoosers++;

							dPrice = 0;

						}

					}

									

					

					if (nLoosers > 0)

					{

						for (int jjk=0; jjk < c_nLegs; jjk++)

						{

							int nOutcomePart = (outcome >> (jjk * 2)) & 3ULL;

							int nLegValue = (LegVariation >> (jjk * 2)) & 3ULL;

							if (nOutcomePart != 3 && nOutcomePart != nLegValue)

							{

								double value = (dStake/nLoosers);

								localresults[((combinations >> (jjk * 4)) & 15)*2] += value;

							}

						}

							

					}else if (nWinners > 0)

					{

						for (int jjk=0; jjk < c_nLegs; jjk++)

						{

							int nOutcomePart = (outcome >> (jjk * 2)) & 3ULL;

							int nLegValue = (LegVariation >> (jjk * 2)) & 3ULL;

							if (nOutcomePart == nLegValue)

							{

								double value1 = (dStake/nWinners);

								double value2 = (dStake/nWinners + dStake*(dPrice - 1)*aPrices[((combinations >> (jjk * 4)) & 15ULL)*c_nSelections +  nLegValue]/dSumP);

								localresults[((combinations >> (jjk * 4)) & 15)*2] += value1;

								localresults[((combinations >> (jjk * 4)) & 15)*2 + 1] += value2;

							}

						}

							

					}else

					{

							for (int jjk=0; jjk < c_nLegs; jjk++)

							{

								double value1 = (dStake/nVoids);

								localresults[((combinations >> (jjk * 4)) & 15)*2] += value1;

								localresults[((combinations >> (jjk * 4)) & 15)*2 + 1] += value1;

							}

					}

			}

			ii++;

		}

		for (int i = 0; i < 2 * c_nMarkets; i++)

		{

			stream[globalindexA + i] = localresults[i]; //store values back in global memory

			localresults[i] = 0.0; //reset localresults and prepare it for next iteration

		}

		scenario += nstride; //advance to next scenario

	}

}

Some values are pre-calculated before calling this kernel in some other kernels (stakes and devcombinations). Selection prices are stored in constant memory (aPrices).

It doesn’t seem to me that thread ordering should have any impact on calculation.

Basically this kernel calculates 642105999360 doubles for given test case (c_nMarkets = 10, c_nLegs = 5, c_nSelections = 3).

I don’t know if this bunch of words made anything more clear to you :)

regards

Mirko

This issue occurs on Tesla based C1060 card but it does not occur on FERMI based GTX 580. I will retest it at GTX 285 and GTX 570.

regards
Mirko

This is a minor issue, but it surprised me:

double localresults[32] = {0.0};

Does this really initialize all elements of the array to zero?

I have checked it at nVidia parallel nSight and it shows all elements initialized to zero.

All elements get reseted to zero at the end of iteration as well.

M.

Just to be really sure, could you change it to explicitly set all 32 elements to zero? (I can’t find a reference that says an array initializer with only one element initializes all the elements…) Uninitialized memory could easily result in strange, inconsistent behavior on different devices. It might not help, but it would make your code less surprising. :)

Hello Seibert,

Thank you very much for your help.

You were right. Explicit initialization of localvalues did the trick.

With explicit initialization to 32 zeros Tesla produces exact same result as Fermi for any thread configuration.

Correct me if I’m wrong but isn’t it addressed by C99 standard that if you specify fewer initializers in braces, the rest of them are initialized to zeroes ? (http://c0x.coding-guidelines.com/6.7.8.html)

thanks

Mirko

EDIT: I would classify this as nvcc bug. I’m still confused how different thread configurations managed to produce exact results for different number of blocks.

By my reading of the ANSI C99 standard this

double localresults[32] = {0.0};

should be equivalent to this

double localresults[32] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

The applicable section of the standard appears to be:

6.7.8 Initialization

[...]

21 If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, 

or fewer characters in a string literal used to initialize an array of known size than there are elements in the

array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage

duration.

The 1998 ANSI C++ standard contains equivalent language in section 8.5.1 paragraph 7. If you have traced the undesired behavior in your code conclusively to this variable initialization, I would recommend filing a compiler bug, attaching a self-contained repro app. The smaller the repro code the easier it will be for the compiler team to track down what’s going on. Please also note the difference in behavior you saw across various platforms in the bug report, as this may provide valuable clues as to where the problem may be. Thanks.

Many thanks for advice njuffa, where can I file compiler bug ? Can I file it here on forums ?

Thank you for your help. You need to be a registered developer to file a bug. The forums are intended to provide a CUDA developer community, not a bug reporting channel. Signing up as a registered developer should be a fairly streamlined process these days.

Ok, thanks again, I’m registered developer. I’m finishing test cases and test scripts to reproduce this behavior. I will pack complete project /source/ + binaries + test case scripts.

I will generate result sets for different thread configurations

1024 blocks 64 threads

1024 blocks 128 threads

2048 blocks 64 threads

8192 blocks 128 threads

for both cases

a) without explicitly initializing all elements to zero i.e.

double localresults[32] = {0.0};

b) with explicitly initializing all elements to zero i.e.

double localresults[32] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

than I will calculate SHA1 for output array, dump results in the files

a) results.bin - raw binary dump of result array

b) results.txt - formatted file with 20 doubles per row

And run file compare against reference files.

It will demonstrate that in case of implicit array initialization result depends on thread configuration. In case of explicitly initialized array result is the same no matter what thread configuration is used.

M.

POST REMOVED by ME :)