Double precision and threads per block Slightly different results in DP calculations when changed th

mipko · August 19, 2011, 7:41am

Hello,

I have one question about DP math and CUDA.

My kernel produces some array of 20 million doubles. When I change amount of threads per block I get slightly different results compared to some other threads/block setup.

I have dumped buffers for 128 thr/block and 256 thr/block in files as binary and compared them - there are differences - few thousand bytes out of 160 MB.

When I dumped those doubles as strings in files with 10 decimals both files (for 128 thr/block & 256 thr/block) were identical. But when I dumped them as 16 decimal digits differences appeared.

Results for fixed threads/block scenario are the same to the last decimal place.

Since calculations are independent of thread setup, I’m puzzled what is causing such slight DP precision changes ?

Do you guys have any idea what is causing this ?

Here are some thread configurations and hash results

32 thread/block

[*]4096x32 9a8245b0899750daa22d2523331c3d72cc0f7350

[*]8192x32 9a8245b0899750daa22d2523331c3d72cc0f7350

64 thread/block

[*]1024x64 d14316cd1a5f53f3b07041b2bb03b684fbb58aee

[*]8192x64 d14316cd1a5f53f3b07041b2bb03b684fbb58aee

128 thread/block

[*]2048x128 12ce58c340077083deb0544665ed37581c02f76a

[*]4096x128 12ce58c340077083deb0544665ed37581c02f76a

[*]8192x128 12ce58c340077083deb0544665ed37581c02f76a

256 thread/block

[*]1024x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46

[*]2048x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46

[*]4096x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46

many thanks

Mirko

seibert · August 19, 2011, 2:44pm

This seems very weird unless there is some way that the thread configuration can impact the order of operations in your kernel. Can you say more about how each thread computes its array values?

mipko · August 19, 2011, 3:23pm

Thanks Seibert,

There are totaly 1024x1024x20 doubles. Those doubles are stored in one dimensional array at global memory - initialized to zero at the beginning of calculation. Let’s call this array resultArray.

Every iteration calculates 20 doubles in local memory and stores them back to their position in resultArray. There is no need for any atomic updates of resultArray since every iteration calculates unique 20 doubles.

If program creates

1024 blocks with 64 threads it will result that each thread calculates 16x20 doubles - 16 iterations. Explanation 1024x16 = 65536. Total number of iterations 1024x1024 = 1048576. Number of iterations/thread 1048576/65536 = 16.

this is actual Kernel

it has following input parameters

startscenario start range - i.e. 0
endscenario end range - i.e. 1048576
c_nMarkets number of markets - i.e. 10
etc…
etc…
double *stream - OUTPUT array resultArray
double *stakes - pre calculated stakes - same for every kernel
unsigned long long *devcombinations - pre calculated stakes - same for every kernel

__global__ void CalculateFullScenario(int startscenario, int endscenario, int c_nMarkets, int c_nSelections, int c_nLegs, int c_nTotalComb, double *stream, double *stakes, unsigned long long *devcombinations)

{

	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

	int nstride = gridDim.x * blockDim.x;

	unsigned int scenario = startscenario + tid;

	register unsigned long long int combinations = 0;

	unsigned  int nSelVar = pow(1.0*c_nSelections, c_nLegs);

	unsigned long long globalindexA = 0;

	int ii = 0;

		

	register unsigned long long int outcome = 0;

	register unsigned long long int LegVariation = 0;

	

	double localresults[32] = {0.0};

	while (scenario < endscenario) <b>//repeat until you reach final number i.e. 1048576</b>

	{

		globalindexA = scenario * 2 * c_nMarkets; <b>//calculate starting index in global array - stream, c_nMarkets = 10</b>

		ii = 0;

		while ( ii < c_nTotalComb) //start calculation for one scenario, in our case it will repeat it 252*243 = 61236 times

		{

			combinations = devcombinations[ii]; //take some value from global memory each thread takes it at same order 0..252 in this test case

		

			outcome = 0;

			for (int jk = 0; jk < c_nLegs; jk++)

				outcome = outcome | ((((scenario >> (((combinations >> (jk * 4)) & 15ULL))*2)) & 3ULL) << (jk * 2)); <b> //calculate some values based on scenario - it is TID based actually</b>

		

			for (int varcnt = 0; varcnt < nSelVar; varcnt++)

			{

					int decimal = varcnt;

					double dStake = stakes[ii * nSelVar + varcnt];

					LegVariation = 0;

				

					for(int ix=0; ix < c_nLegs; ix++)

					{

						LegVariation |= (unsigned long long)(decimal%c_nSelections) << (ix*2);

						decimal = decimal/c_nSelections;

					}

			

					int nLoosers = 0;

					int nWinners = 0;

					int nVoids = 0;

					double dSumP = 0.0;

					double dPrice = 1.0;

					for (int jk = 0; jk < c_nLegs; jk++)

					{

						int nOutcomePart = (outcome >> (jk * 2)) & 3ULL;

						int nLegValue = (LegVariation >> (jk * 2)) & 3ULL;

						if (nOutcomePart == 3)

						{

							nVoids++;

						}else if (nOutcomePart == nLegValue){

							nWinners++;

							dPrice = dPrice * aPrices[((combinations >> (jk * 4)) & 15ULL)*c_nSelections +  nLegValue]; 

							dSumP += aPrices[((combinations >> (jk * 4)) & 15ULL)*c_nSelections +  nLegValue];

						}else{

							nLoosers++;

							dPrice = 0;

						}

					}

									

					

					if (nLoosers > 0)

					{

						for (int jjk=0; jjk < c_nLegs; jjk++)

						{

							int nOutcomePart = (outcome >> (jjk * 2)) & 3ULL;

							int nLegValue = (LegVariation >> (jjk * 2)) & 3ULL;

							if (nOutcomePart != 3 && nOutcomePart != nLegValue)

							{

								double value = (dStake/nLoosers);

								localresults[((combinations >> (jjk * 4)) & 15)*2] += value;

							}

						}

							

					}else if (nWinners > 0)

					{

						for (int jjk=0; jjk < c_nLegs; jjk++)

						{

							int nOutcomePart = (outcome >> (jjk * 2)) & 3ULL;

							int nLegValue = (LegVariation >> (jjk * 2)) & 3ULL;

							if (nOutcomePart == nLegValue)

							{

								double value1 = (dStake/nWinners);

								double value2 = (dStake/nWinners + dStake*(dPrice - 1)*aPrices[((combinations >> (jjk * 4)) & 15ULL)*c_nSelections +  nLegValue]/dSumP);

								localresults[((combinations >> (jjk * 4)) & 15)*2] += value1;

								localresults[((combinations >> (jjk * 4)) & 15)*2 + 1] += value2;

							}

						}

							

					}else

					{

							for (int jjk=0; jjk < c_nLegs; jjk++)

							{

								double value1 = (dStake/nVoids);

								localresults[((combinations >> (jjk * 4)) & 15)*2] += value1;

								localresults[((combinations >> (jjk * 4)) & 15)*2 + 1] += value1;

							}

					}

			}

			ii++;

		}

		for (int i = 0; i < 2 * c_nMarkets; i++)

		{

			stream[globalindexA + i] = localresults[i]; //store values back in global memory

			localresults[i] = 0.0; //reset localresults and prepare it for next iteration

		}

		scenario += nstride; //advance to next scenario

	}

}

Some values are pre-calculated before calling this kernel in some other kernels (stakes and devcombinations). Selection prices are stored in constant memory (aPrices).

It doesn’t seem to me that thread ordering should have any impact on calculation.

Basically this kernel calculates 642105999360 doubles for given test case (c_nMarkets = 10, c_nLegs = 5, c_nSelections = 3).

I don’t know if this bunch of words made anything more clear to you :)

regards

Mirko

mipko · August 19, 2011, 6:49pm

This issue occurs on Tesla based C1060 card but it does not occur on FERMI based GTX 580. I will retest it at GTX 285 and GTX 570.

regards
Mirko

seibert · August 21, 2011, 3:43pm

This is a minor issue, but it surprised me:

double localresults[32] = {0.0};

Does this really initialize all elements of the array to zero?

mipko · August 21, 2011, 4:33pm

I have checked it at nVidia parallel nSight and it shows all elements initialized to zero.

All elements get reseted to zero at the end of iteration as well.

M.

seibert · August 23, 2011, 1:16pm

Just to be really sure, could you change it to explicitly set all 32 elements to zero? (I can’t find a reference that says an array initializer with only one element initializes all the elements…) Uninitialized memory could easily result in strange, inconsistent behavior on different devices. It might not help, but it would make your code less surprising. :)

mipko · August 23, 2011, 3:12pm

Hello Seibert,

Thank you very much for your help.

You were right. Explicit initialization of localvalues did the trick.

With explicit initialization to 32 zeros Tesla produces exact same result as Fermi for any thread configuration.

Correct me if I’m wrong but isn’t it addressed by C99 standard that if you specify fewer initializers in braces, the rest of them are initialized to zeroes ? (http://c0x.coding-guidelines.com/6.7.8.html)

thanks

Mirko

EDIT: I would classify this as nvcc bug. I’m still confused how different thread configurations managed to produce exact results for different number of blocks.

njuffa · August 23, 2011, 6:09pm

By my reading of the ANSI C99 standard this

double localresults[32] = {0.0};

should be equivalent to this

double localresults[32] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

The applicable section of the standard appears to be:

6.7.8 Initialization

[...]

21 If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, 

or fewer characters in a string literal used to initialize an array of known size than there are elements in the

array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage

duration.

The 1998 ANSI C++ standard contains equivalent language in section 8.5.1 paragraph 7. If you have traced the undesired behavior in your code conclusively to this variable initialization, I would recommend filing a compiler bug, attaching a self-contained repro app. The smaller the repro code the easier it will be for the compiler team to track down what’s going on. Please also note the difference in behavior you saw across various platforms in the bug report, as this may provide valuable clues as to where the problem may be. Thanks.

mipko · August 23, 2011, 6:29pm

By my reading of the ANSI C99 standard this
double localresults[32] = {0.0};
should be equivalent to this
double localresults[32] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
The applicable section of the standard appears to be:
6.7.8 Initialization

[...]

21 If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, 

or fewer characters in a string literal used to initialize an array of known size than there are elements in the

array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage

duration.
The 1998 ANSI C++ standard contains equivalent language in section 8.5.1 paragraph 7. If you have traced the undesired behavior in your code conclusively to this variable initialization, I would recommend filing a compiler bug, attaching a self-contained repro app. The smaller the repro code the easier it will be for the compiler team to track down what’s going on. Please also note the difference in behavior you saw across various platforms in the bug report, as this may provide valuable clues as to where the problem may be. Thanks.

Many thanks for advice njuffa, where can I file compiler bug ? Can I file it here on forums ?

njuffa · August 23, 2011, 8:36pm

Thank you for your help. You need to be a registered developer to file a bug. The forums are intended to provide a CUDA developer community, not a bug reporting channel. Signing up as a registered developer should be a fairly streamlined process these days.

mipko · August 23, 2011, 8:49pm

Ok, thanks again, I’m registered developer. I’m finishing test cases and test scripts to reproduce this behavior. I will pack complete project /source/ + binaries + test case scripts.

I will generate result sets for different thread configurations

1024 blocks 64 threads

1024 blocks 128 threads

2048 blocks 64 threads

…

8192 blocks 128 threads

for both cases

a) without explicitly initializing all elements to zero i.e.

double localresults[32] = {0.0};

b) with explicitly initializing all elements to zero i.e.

double localresults[32] = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

                           0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

than I will calculate SHA1 for output array, dump results in the files

a) results.bin - raw binary dump of result array

b) results.txt - formatted file with 20 doubles per row

And run file compare against reference files.

It will demonstrate that in case of implicit array initialization result depends on thread configuration. In case of explicitly initialized array result is the same no matter what thread configuration is used.

M.

mipko · August 23, 2011, 11:15pm

POST REMOVED by ME :)

Topic		Replies	Views
No results are written to output buffer - why? CUDA Programming and Performance	34	20016	July 15, 2009
Different results using diffrent memory types CUDA Programming and Performance	22	4676	April 14, 2010
emu vs debug, different values CUDA Programming and Performance	48	16225	February 5, 2009
Same Code (really, it is) - Much Different Results CUDA Programming and Performance	38	14317	September 30, 2010
Some elements in Arrays not getting values CUDA Programming and Performance	2	2066	August 18, 2009
problem with double precision unpredictable results Different run give differents errors or no error CUDA Programming and Performance	12	2985	September 10, 2010
Cuda code performance CUDA Programming and Performance	14	3371	December 16, 2014
why result varied based on different number of threads per block? CUDA Programming and Performance	8	2065	March 1, 2011
Wrong output in double precision CUDA Programming and Performance	20	6221	May 20, 2011
CUBLAS + Kernels + double precision + "for loop" = strange behavior CUDA Programming and Performance	17	4501	September 25, 2009

Double precision and threads per block Slightly different results in DP calculations when changed th

Related topics