Different results using diffrent memory types

kng · April 12, 2010, 8:58am

So guys i got 2 code snippets which give different results although they are doing the same computation.

P.S I am using device emulation mode just to compare the results with the cpu version (i have a different code to run on the actual device which gives me the same exact values / errors)

Snippet 1 (Result Incorrect) (Result Acheived 0.0026914661)

float total = 0;

		//Multiply column vector with our input vector

		for (int i = 0; i < columnSize_; i++)

		{

			total  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

		}

		

		device_outputVector[0]  +=  total;

OR

shared_buff[threadIdx.x] = 0;

		__syncthreads();

		//Multiply column vector with our input vector

		for (int i = 0; i < columnSize_; i++)

		{

			shared_buff[threadIdx.x]   += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

		}

		

		device_outputVector[0]  += shared_buff[threadIdx.x]

;

Snippet 2 (Result Correct) (Result Achieved 0.0026918666)

for (int i = 0; i < columnSize_; i++)

		{

			 device_outputVector[0]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

		}

Although the result are pretty close to each other this minor discrepancy between them is causing havoc in my algorithm and i can’t seem to pinpoint whats wrong

Sarnath · April 12, 2010, 9:26am

device_outputVector is accessed by all threads without any atomics in place…

kng · April 12, 2010, 9:34am

As i said i am running that sample code in device emulation mode where each thread is run sequentially i have a different code to run on the actual CUDA architecture which make use or warp level reductions and atomic float addition yet i still get the same results observed by the code posted above

What i didn’t mention was that there are multiple threads in a block (256 threads) and multiple blocks in a single dimension 300+

kng · April 12, 2010, 9:53am

So i tried a new piece of code and if gives the correct results but i can’t understand what is going wrong why am i getting different results?

shared_buff[threadIdx.x] = 0;

	

		// Block until all threads in the block have written their data to shared mem

		__syncthreads();

		//Multiply column vector with our input vector

		for (int i = 0; i < columnSize_; i++)

		{

			shared_buff[threadIdx.x]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

			device_outputVector[2]  += shared_buff[threadIdx.x];

			shared_buff[threadIdx.x] = 0;

		}

avidday · April 12, 2010, 10:47am

Disassemble the code produced by the compiler and see what the differences are.

kng · April 12, 2010, 10:52am

at the risk of looking like a complete idiot how do you disassemble the code, i m using visual studio? ( i’m pretty desperate right now :P)

nvm i think i found how but i m not much of as assembly guy will look pretty much chines to me

avidday · April 12, 2010, 10:56am

Sorry, no idea about Microsoft’s toolchain. You could use nvcc to compile to ptx and look at that instead (given the device is doing the same thing).

kng · April 12, 2010, 10:59am

float total = 0;

000000013F85B7B3  xorps	   xmm5,xmm5 

000000013F85B7B6  movss	   dword ptr [total],xmm5 

000000013F85B7BC  mov		 dword ptr [i],0 

000000013F85B7C4  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+221h (13F85B7D1h) 

000000013F85B7C6  mov		 eax,dword ptr [i] 

000000013F85B7CA  add		 eax,1 

000000013F85B7CD  mov		 dword ptr [i],eax 

000000013F85B7D1  mov		 eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B7D7  cmp		 dword ptr [i],eax 

000000013F85B7DB  jge		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+28Eh (13F85B83Eh) 

		//Multiply column vector with our input vector

		#pragma unroll 

		for (int i = 0; i < columnSize_; i++)

		{

			total  += (float)(device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i]);

000000013F85B7DD  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B7E3  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B7EA  add		 eax,dword ptr [i] 

000000013F85B7EE  mov		 ecx,eax 

000000013F85B7F0  mov		 rax,qword ptr [shared_rowIDs] 

000000013F85B7F5  mov		 eax,dword ptr [rax+rcx*4] 

000000013F85B7F8  sub		 eax,1 

000000013F85B7FB  movsxd	  rcx,eax 

000000013F85B7FE  mov		 rax,qword ptr [device_inputVector] 

000000013F85B806  movss	   xmm1,dword ptr [rax+rcx*4] 

000000013F85B80B  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B811  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B818  add		 eax,dword ptr [i] 

000000013F85B81C  mov		 ecx,eax 

000000013F85B81E  mov		 rax,qword ptr [shared_values] 

000000013F85B823  movss	   xmm0,dword ptr [rax+rcx*4] 

000000013F85B828  mulss	   xmm1,xmm0 

000000013F85B82C  movss	   xmm0,dword ptr [total] 

000000013F85B832  addss	   xmm0,xmm1 

000000013F85B836  movss	   dword ptr [total],xmm0 

		}

000000013F85B83C  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+216h (13F85B7C6h) 

				

		device_outputVector[0]  +=  (float)total;

000000013F85B83E  mov		 rax,qword ptr [device_outputVector] 

000000013F85B846  movss	   xmm0,dword ptr [rax] 

000000013F85B84A  addss	   xmm0,dword ptr [total] 

000000013F85B850  mov		 rax,qword ptr [device_outputVector] 

000000013F85B858  movss	   dword ptr [rax],xmm0 

000000013F85B85C  mov		 dword ptr [i],0 

000000013F85B864  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+2C1h (13F85B871h) 

000000013F85B866  mov		 eax,dword ptr [i] 

000000013F85B86A  add		 eax,1 

000000013F85B86D  mov		 dword ptr [i],eax 

000000013F85B871  mov		 eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B877  cmp		 dword ptr [i],eax 

000000013F85B87B  jge		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+33Fh (13F85B8EFh) 

		//atomicAdd(&device_outputVector[0], (total) );

				

		

		for (int i = 0; i < columnSize_; i++)

		{

			//device_outputVector[(shared_resultOrder[threadIdx.x*columnSize_ + i])]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

			device_outputVector[1]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

000000013F85B87D  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B883  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B88A  add		 eax,dword ptr [i] 

000000013F85B88E  mov		 ecx,eax 

000000013F85B890  mov		 rax,qword ptr [shared_rowIDs] 

000000013F85B895  mov		 eax,dword ptr [rax+rcx*4] 

000000013F85B898  sub		 eax,1 

000000013F85B89B  movsxd	  rcx,eax 

000000013F85B89E  mov		 rax,qword ptr [device_inputVector] 

000000013F85B8A6  movss	   xmm1,dword ptr [rax+rcx*4] 

000000013F85B8AB  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B8B1  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B8B8  add		 eax,dword ptr [i] 

000000013F85B8BC  mov		 ecx,eax 

000000013F85B8BE  mov		 rax,qword ptr [shared_values] 

000000013F85B8C3  movss	   xmm0,dword ptr [rax+rcx*4] 

000000013F85B8C8  mulss	   xmm1,xmm0 

000000013F85B8CC  mov		 rax,qword ptr [device_outputVector] 

000000013F85B8D4  movss	   xmm0,dword ptr [rax+4] 

000000013F85B8D9  addss	   xmm0,xmm1 

000000013F85B8DD  mov		 rax,qword ptr [device_outputVector] 

000000013F85B8E5  movss	   dword ptr [rax+4],xmm0 

		}

000000013F85B8EA  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+2B6h (13F85B866h) 

000000013F85B8EF  mov		 dword ptr [i],0 

000000013F85B8F7  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+354h (13F85B904h) 

000000013F85B8F9  mov		 eax,dword ptr [i] 

000000013F85B8FD  add		 eax,1 

000000013F85B900  mov		 dword ptr [i],eax 

000000013F85B904  mov		 eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B90A  cmp		 dword ptr [i],eax 

000000013F85B90E  jge		 000000013F85B9DB

Thats the assembly but i m pretty much lost in it

tera · April 12, 2010, 11:29am

Are you sure you are not running into problems with the non-associativity of floating point additions?
How large is columnSize_?

Edit: Just realized that this was your original question. So I think yes, that’s the problem. What results do you get with double precision?

kng · April 12, 2010, 11:42am

The column is 400,000 elements long which is then split into 256thread per block with around 300blocks total

where each thread calculates a sub part of the actual column then adds its result to the total

kng · April 12, 2010, 11:45am

But even if its the float associativity problem i am running the sample code in device emulation mode which would run on the CPU in a sequential manner wouldn’t it?

tera · April 12, 2010, 12:08pm

Yes, but you still have different orders in the different samples. In the cases that you have labeled as incorrect, you have done blockwise sums and an extra reduction of the block sums, while in the cases labeled as correct, you’ve directly summed up all values.

And with 400,000 values, the discrepancy can easily be explained by rounding errors violating float associativity. You’d either need to make sure you sum these in order of (roughly) increasing absolute value, of use double precision.

kng · April 12, 2010, 12:15pm

Thank mate really needed that explanation since it was driving me crazy i ll try using double precision see how it goes

avidday · April 12, 2010, 12:19pm

Kahan summation might be another useful option to explore too.

kng · April 12, 2010, 12:56pm

any reading material you would suggest?

avidday · April 12, 2010, 1:08pm

Actually yes. This article appeared in Doctor Dobb’s 15 years ago, and I still have a very old, coffee stained copy of it in a easy to reach place. It draws heavily on an older ACM paper which I think should be mandatory reading:

@article{103163,

 author = {Goldberg, David},

 title = {What every computer scientist should know about floating-point arithmetic},

 journal = {ACM Comput. Surv.},

 volume = {23},

 number = {1},

 year = {1991},

 issn = {0360-0300},

 pages = {5--48},

 doi = {http://doi.acm.org/10.1145/103162.103163},

 publisher = {ACM},

 address = {New York, NY, USA},

 }

kng · April 12, 2010, 1:23pm

So one question guys atomic double addition function can i recycle the float one or is their a built in function in CUDA

tera · April 12, 2010, 1:34pm

Atomic double operations are difficult as there are no atomic operations on 8-byte words in CUDA. You’d need a semaphore to provide exclusive access to the variable. I’d strongly suggest not to take that direction.

Rather, just save the individual block sums and start an extra kernel for the final addition.

kng · April 12, 2010, 2:01pm

Before i impelment the kernal to reduce threads you think this code i found from Atomic Double Add will work?

static __inline__ __device__ double atomicAdd(double *addr, double val)

{

	double old=*addr, assumed;

	

	do {

		assumed = old;

		old = __longlong_as_double( atomicCAS((unsigned long long int*)addr,

										__double_as_longlong(assumed),

										__double_as_longlong(val+assumed)));

	} while( assumed!=old );

	return old;

}

tera · April 12, 2010, 2:54pm

I stand corrected.
I thought to remember there are no 64 bit atomic operations in Cuda. Checking the programming guide, I see they are there for compute capability 1.2 and above, they just need to operate in global memory. So yes, you can use any of those atomic adds.

Topic		Replies	Views
float asssociative Debugging error CUDA Programming and Performance	10	2224	April 12, 2010
Is there a block equivalent to cub::DeviceSegmentedReduce CUDA Programming and Performance	7	1173	October 10, 2023
How to launch CUDA Cooperative Groups Standard Deviation example kernel? CUDA Programming and Performance	11	2807	February 12, 2023
Emulated double precision Double single routine header CUDA Programming and Performance	24	49137	October 18, 2010
limit of computation CUDA Programming and Performance	44	32903	April 8, 2008
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Accumulation Difference between CUDA and CPU? CUDA Programming and Performance	7	10470	August 22, 2009
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7724	July 22, 2010
Fiding an elegant solution for an On^4 problem CUDA Programming and Performance	43	25683	June 1, 2011

Different results using diffrent memory types

Related topics