What could cause example code oclMatVecMul to produce wrong results? "GPU Result DOESN'T MAT

Hi everybody,

I tried today another example from the sdk, oclMatVecMul, but I get following error message: “GPU Result DOESN’T MATCH CPU Result within allowable tolerance”

oclMatrixMul on the other hand works. What could cause this behavior?

Thanks in advance!

Hi,

I did further research and it looks like it depends on the dimensions of matrix, but not on the size, i.e. the bigger one works, the smaller one doesn’t. Do you have an idea, what could be the cause of this?

I attached the results of two different runs. The first one passes with the dimensions of 2525 * 13271 * 4 = 134037100. The second one fails with Matrix width 5050 * 6635 *4 = 134027000. Where 134037100 - 134027000 = 10100 = 4 * 2525.

Determining Matrix height from available GPU mem...

 oclGetPlatformID...

 clGetDeviceIDs clCreateContext...

 clGetDeviceInfo...

 Matrix width	= 2525

 Matrix height	= 13271

Allocate and Init Host Mem...

Get the Platform ID...

Get the Device info and select Device...

 # of Devices Available = 1

 Using Device 0: Quadro FX 580

 # of Compute Units = 4

clCreateContext...

clCreateCommandQueue...

clCreateBuffer (M, V and W in device global memory, mem_size_m = 134037100)...

oclLoadProgSource (oclMatVecMul.cl)...

clCreateProgramWithSource...

clBuildProgram...

clEnqueueWriteBuffer (M and V)...

Running with Kernel MatVecMulUncoalesced0...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 13312

  Local Work Size 		= 256

  # of Work Groups 		= 52

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulUncoalesced0)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulUncoalesced1...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulUncoalesced1)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced0...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced0)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced1...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced1)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced2...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced2)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced3...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced3)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

PASSED

Starting Cleanup...

oclMatVecMul.exe Exiting...

Press <Enter> to Quit

-----------------------------------------------------------

######################################################################################

Determining Matrix height from available GPU mem...

 oclGetPlatformID...

 clGetDeviceIDs clCreateContext...

 clGetDeviceInfo...

 Matrix width	= 5050

 Matrix height	= 6635

Allocate and Init Host Mem...

Get the Platform ID...

Get the Device info and select Device...

 # of Devices Available = 1

 Using Device 0: Quadro FX 580

 # of Compute Units = 4

clCreateContext...

clCreateCommandQueue...

clCreateBuffer (M, V and W in device global memory, mem_size_m = 134027000)...

oclLoadProgSource (oclMatVecMul.cl)...

clCreateProgramWithSource...

clBuildProgram...

clEnqueueWriteBuffer (M and V)...

Running with Kernel MatVecMulUncoalesced0...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 6656

  Local Work Size 		= 256

  # of Work Groups 		= 26

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulUncoalesced0)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result DOESN'T MATCH CPU Result within allowable tolerance

Running with Kernel MatVecMulUncoalesced1...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulUncoalesced1)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result DOESN'T MATCH CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced0...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced0)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced1...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced1)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced2...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced2)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

Running with Kernel MatVecMulCoalesced3...

Clear result with clEnqueueWriteBuffer (W)...

  clCreateKernel...

  Global Work Size 		= 2048

  Local Work Size 		= 256

  # of Work Groups 		= 8

  clSetKernelArg...

clEnqueueNDRangeKernel (MatVecMulCoalesced3)...

  clEnqueueReadBuffer (W)...

  Comparing against Host/C++ computation...

GPU Result MATCHES CPU Result within allowable tolerance

FAILED

Starting Cleanup...

oclMatVecMul.exe Exiting...

Press <Enter> to Quit

-------------------------

Further testing, I found that in some cases the Kernel returns in 0s with “0” values. Any idea?

./oclMatrixMul Starting...

Device 0: Quadro FX 580

Using Matrix Sizes: A(8000 x 16000), B(8000 x 8000), C(8000 x 16000)

Running Computations on 1 - 1 GPU's...

oclMatrixMul, Throughput = 30906456.9153 GFlops/s, Time = 0.00007 s, Size = 2048000000000, NumDevsUsed = 1, Workgroup = 256

Kernel execution time on GPU 0 	: 0.00000 s

Comparing results with CPU computation... 

Listing first 100 Differences > 0.000010...

Row 0:

    Loc(0,0)	CPU=1992.44897	GPU=0.00000	Diff=1992.448975

    Loc(1,0)	CPU=1992.28149	GPU=0.00000	Diff=1992.281494

    Loc(2,0)	CPU=2000.21448	GPU=0.00000	Diff=2000.214478

    Loc(3,0)	CPU=2008.32019	GPU=0.00000	Diff=2008.320190

    Loc(4,0)	CPU=1998.52942	GPU=0.00000	Diff=1998.529419

    Loc(5,0)	CPU=2017.58997	GPU=0.00000	Diff=2017.589966

    Loc(6,0)	CPU=1992.80225	GPU=0.00000	Diff=1992.802246

    Loc(7,0)	CPU=2004.28503	GPU=0.00000	Diff=2004.285034

    Loc(8,0)	CPU=2000.10669	GPU=0.00000	Diff=2000.106689

    Loc(9,0)	CPU=1997.23682	GPU=0.00000	Diff=1997.236816

    Loc(10,0)	CPU=1985.83606	GPU=0.00000	Diff=1985.836060

    Loc(11,0)	CPU=2007.43274	GPU=0.00000	Diff=2007.432739

    Loc(12,0)	CPU=1992.11462	GPU=0.00000	Diff=1992.114624

    Loc(13,0)	CPU=1995.93408	GPU=0.00000	Diff=1995.934082

    Loc(14,0)	CPU=1990.54260	GPU=0.00000	Diff=1990.542603

    Loc(15,0)	CPU=1996.68384	GPU=0.00000	Diff=1996.683838

    Loc(16,0)	CPU=1988.17065	GPU=0.00000	Diff=1988.170654

    Loc(17,0)	CPU=2012.80896	GPU=0.00000	Diff=2012.808960

    Loc(18,0)	CPU=2015.77893	GPU=0.00000	Diff=2015.778931

    Loc(19,0)	CPU=2027.57263	GPU=0.00000	Diff=2027.572632

    Loc(20,0)	CPU=2013.63623	GPU=0.00000	Diff=2013.636230

    Loc(21,0)	CPU=2034.78210	GPU=0.00000	Diff=2034.782104

    Loc(22,0)	CPU=1997.21558	GPU=0.00000	Diff=1997.215576

    Loc(23,0)	CPU=2045.86340	GPU=0.00000	Diff=2045.863403

    Loc(24,0)	CPU=2027.31152	GPU=0.00000	Diff=2027.311523

    Loc(25,0)	CPU=1984.99695	GPU=0.00000	Diff=1984.996948

    Loc(26,0)	CPU=1996.16833	GPU=0.00000	Diff=1996.168335

    Loc(27,0)	CPU=1993.75476	GPU=0.00000	Diff=1993.754761

    Loc(28,0)	CPU=2004.76868	GPU=0.00000	Diff=2004.768677

    Loc(29,0)	CPU=2007.93701	GPU=0.00000	Diff=2007.937012

    Loc(30,0)	CPU=2030.28711	GPU=0.00000	Diff=2030.287109

    Loc(31,0)	CPU=2027.95984	GPU=0.00000	Diff=2027.959839

    Loc(32,0)	CPU=2003.72021	GPU=0.00000	Diff=2003.720215

    Loc(33,0)	CPU=1992.63281	GPU=0.00000	Diff=1992.632812

    Loc(34,0)	CPU=2012.11938	GPU=0.00000	Diff=2012.119385

    Loc(35,0)	CPU=2018.05933	GPU=0.00000	Diff=2018.059326

    Loc(36,0)	CPU=1991.52344	GPU=0.00000	Diff=1991.523438

    Loc(37,0)	CPU=1993.33020	GPU=0.00000	Diff=1993.330200

    Loc(38,0)	CPU=1991.59412	GPU=0.00000	Diff=1991.594116

    Loc(39,0)	CPU=2019.39380	GPU=0.00000	Diff=2019.393799

    Loc(40,0)	CPU=1994.77686	GPU=0.00000	Diff=1994.776855

    Loc(41,0)	CPU=1996.65344	GPU=0.00000	Diff=1996.653442

    Loc(42,0)	CPU=1993.86499	GPU=0.00000	Diff=1993.864990

    Loc(43,0)	CPU=2015.09985	GPU=0.00000	Diff=2015.099854

    Loc(44,0)	CPU=1997.42065	GPU=0.00000	Diff=1997.420654

    Loc(45,0)	CPU=1987.11414	GPU=0.00000	Diff=1987.114136

    Loc(46,0)	CPU=1984.02380	GPU=0.00000	Diff=1984.023804

    Loc(47,0)	CPU=2012.55786	GPU=0.00000	Diff=2012.557861

    Loc(48,0)	CPU=2003.70850	GPU=0.00000	Diff=2003.708496

    Loc(49,0)	CPU=2006.56262	GPU=0.00000	Diff=2006.562622

    Loc(50,0)	CPU=2010.09790	GPU=0.00000	Diff=2010.097900

    Loc(51,0)	CPU=2007.61536	GPU=0.00000	Diff=2007.615356

    Loc(52,0)	CPU=1992.81860	GPU=0.00000	Diff=1992.818604

    Loc(53,0)	CPU=2002.70422	GPU=0.00000	Diff=2002.704224

    Loc(54,0)	CPU=1982.52747	GPU=0.00000	Diff=1982.527466

    Loc(55,0)	CPU=2022.27722	GPU=0.00000	Diff=2022.277222

    Loc(56,0)	CPU=2001.84509	GPU=0.00000	Diff=2001.845093

    Loc(57,0)	CPU=2003.65381	GPU=0.00000	Diff=2003.653809

    Loc(58,0)	CPU=2012.68188	GPU=0.00000	Diff=2012.681885

    Loc(59,0)	CPU=1994.68628	GPU=0.00000	Diff=1994.686279

    Loc(60,0)	CPU=2017.70618	GPU=0.00000	Diff=2017.706177

    Loc(61,0)	CPU=1993.10938	GPU=0.00000	Diff=1993.109375

    Loc(62,0)	CPU=2036.02258	GPU=0.00000	Diff=2036.022583

    Loc(63,0)	CPU=2006.94653	GPU=0.00000	Diff=2006.946533

    Loc(64,0)	CPU=2003.39282	GPU=0.00000	Diff=2003.392822

    Loc(65,0)	CPU=1972.17664	GPU=0.00000	Diff=1972.176636

    Loc(66,0)	CPU=1990.85706	GPU=0.00000	Diff=1990.857056

    Loc(67,0)	CPU=2003.34399	GPU=0.00000	Diff=2003.343994

    Loc(68,0)	CPU=2014.35437	GPU=0.00000	Diff=2014.354370

    Loc(69,0)	CPU=2015.83765	GPU=0.00000	Diff=2015.837646

    Loc(70,0)	CPU=1992.88477	GPU=0.00000	Diff=1992.884766

    Loc(71,0)	CPU=1985.77124	GPU=0.00000	Diff=1985.771240

    Loc(72,0)	CPU=1989.29761	GPU=0.00000	Diff=1989.297607

    Loc(73,0)	CPU=2012.30798	GPU=0.00000	Diff=2012.307983

    Loc(74,0)	CPU=2009.36536	GPU=0.00000	Diff=2009.365356

    Loc(75,0)	CPU=1981.96594	GPU=0.00000	Diff=1981.965942

    Loc(76,0)	CPU=1997.63953	GPU=0.00000	Diff=1997.639526

    Loc(77,0)	CPU=2009.42236	GPU=0.00000	Diff=2009.422363

    Loc(78,0)	CPU=1993.30200	GPU=0.00000	Diff=1993.302002

    Loc(79,0)	CPU=1999.61621	GPU=0.00000	Diff=1999.616211

    Loc(80,0)	CPU=1996.50671	GPU=0.00000	Diff=1996.506714

    Loc(81,0)	CPU=1999.69434	GPU=0.00000	Diff=1999.694336

    Loc(82,0)	CPU=2021.29834	GPU=0.00000	Diff=2021.298340

    Loc(83,0)	CPU=1994.06824	GPU=0.00000	Diff=1994.068237

    Loc(84,0)	CPU=1996.23657	GPU=0.00000	Diff=1996.236572

    Loc(85,0)	CPU=2004.75378	GPU=0.00000	Diff=2004.753784

    Loc(86,0)	CPU=1994.15723	GPU=0.00000	Diff=1994.157227

    Loc(87,0)	CPU=2012.92725	GPU=0.00000	Diff=2012.927246

    Loc(88,0)	CPU=1999.31653	GPU=0.00000	Diff=1999.316528

    Loc(89,0)	CPU=2002.83374	GPU=0.00000	Diff=2002.833740

    Loc(90,0)	CPU=1988.23853	GPU=0.00000	Diff=1988.238525

    Loc(91,0)	CPU=2005.37439	GPU=0.00000	Diff=2005.374390

    Loc(92,0)	CPU=2037.14099	GPU=0.00000	Diff=2037.140991

    Loc(93,0)	CPU=2010.84180	GPU=0.00000	Diff=2010.841797

    Loc(94,0)	CPU=2017.66699	GPU=0.00000	Diff=2017.666992

    Loc(95,0)	CPU=1994.46619	GPU=0.00000	Diff=1994.466187

    Loc(96,0)	CPU=1990.31714	GPU=0.00000	Diff=1990.317139

    Loc(97,0)	CPU=1986.71008	GPU=0.00000	Diff=1986.710083

    Loc(98,0)	CPU=2008.22083	GPU=0.00000	Diff=2008.220825

    Loc(99,0)	CPU=2000.99719	GPU=0.00000	Diff=2000.997192

Total Errors = 128000000

FAILED 

Press <Enter> to Quit...

-----------------------------------------------------------

Hi, Have you solved it now?