Fun with complex numbers. Today: 2x2 matrix inversions, millions ;) example use of my cudacomplex cl

Hi,

Today I am going to post a little fun project that is based on my cudacomplex.h header file (posted in the other thread named “An attempt at a complex maths library”). This project demonstrates performing inversions on 2x2 matrices, lots in parallel. As an added bonus, all read and write accesses are coalesced to obtain nearly optimal memory throughput even on Compute 1.0 hardware. The difficulty of this tutorial is low to moderate, so CUDA beginners are invited to have a look.

The code is actually based on the template SDK project. So if you need to create a Makefile for Linux and Mac-OS, just use the template makefile and change the file names inside to use complextest instead of template. Or just use this one - just be sure to work within a subdirectory of the SDK’s projects folder.

Makefile

############################################################

#

# Build script for project

#

############################################################

# Add source files here

EXECUTABLE	  := complextest

# CUDA source files (compiled with cudacc)

CUFILES		 := complextest.cu

# CUDA dependency files

CU_DEPS		 := \

		complextest_kernel.cu cudacomplex.h complexmatrix.h

# C/C++ source files (compiled with gcc / c++)

CCFILES		 := \

		complextest_gold.cpp \

############################################################

# Rules and targets

include ../../common/common.mk

First let me introduce a matrix2_2 class (it is hardcoded for 2x2 dimension for best speed). Name this file complexmatrix.h. It stores the matrix elements in struct members a,b,c,d and overloads some common operators for matrix computations. The coding conventions I use are similar as in the cudacomplex.h header file, which you have to grab from the second posting in the other thread

complexmatrix.h

/*

* Copyright (c) 2008-2009 Christian Buchner <Christian.Buchner@gmail.com>

* All rights reserved.

*

* Redistribution and use in source and binary forms, with or without

* modification, are permitted provided that the following conditions are met:

*	 * Redistributions of source code must retain the above copyright

*	   notice, this list of conditions and the following disclaimer.

*	 * Redistributions in binary form must reproduce the above copyright

*	   notice, this list of conditions and the following disclaimer in the

*	   documentation and/or other materials provided with the distribution.

*

* THIS SOFTWARE IS PROVIDED BY Christian Buchner ''AS IS'' AND ANY 

* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

* DISCLAIMED. IN NO EVENT SHALL Christian Buchner BE LIABLE FOR ANY

* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

*/

#ifndef COMPLEXMATRIX_H

#define COMPLEXMATRIX_H

#include "cudacomplex.h"

#include <stdio.h>

// C++ struct to represent a 2x2 matrix with the following layout:

// [ a b ]

// [ c d ]

// Some operators are overloaded and members for inversion are provided.

struct matrix2_2 {

	// data members

	cudacomplex a, b, c, d;

	// ctor's

	M_HOSTDEVICE matrix2_2() {

		a = b = c = d = 0.0f;

	}

	M_HOSTDEVICE matrix2_2(const cudacomplex REF(a), const cudacomplex REF(b), const cudacomplex REF(c), const cudacomplex REF(d)) :

		a(a), b(b), c(c), d(d) {}

	// overloaded operators

	// multiplication with another matrix

	M_HOSTDEVICE matrix2_2 operator*(const matrix2_2 REF(A)) const {

		return matrix2_2(

			A.a * a + A.c * b, A.b * a + A.d * b,

			A.a * c + A.c * d, A.b * c + A.d * d );

	}

	// division, i.e. multiplication with inverse of another matrix

	M_HOSTDEVICE matrix2_2 operator/(const matrix2_2 REF(A)) const {

		matrix2_2 A_inv = A.inv();

		return matrix2_2(

			A_inv.a * a + A_inv.c * b,  A_inv.b * a + A_inv.d * b,

			A_inv.a * c + A_inv.c * d,  A_inv.b * c + A_inv.d * d );

	}

	// sum with another matrix

	M_HOSTDEVICE matrix2_2 operator+(const matrix2_2 REF(A)) const {

		return matrix2_2(A.a + a,  A.b + b,

						 A.c + c,  A.d + d);

	}

	// difference between this and another matrix

	M_HOSTDEVICE matrix2_2 operator-(const matrix2_2 REF(A)) const {

		return matrix2_2(a - A.a,  b - A.b,

						 c - A.c,  d - A.d);

	}

	// negation of this matrix

	M_HOSTDEVICE matrix2_2 operator-() const {

		return matrix2_2(-a, -b,

						 -c, -d);

	}

	// member functions

	// compute the determinant

	M_HOSTDEVICE cudacomplex det() const {

		return a*d-b*c;

	}

	// invert this matrix

	M_HOSTDEVICE matrix2_2 inv() const {

	cudacomplex  i_det = 1.0f / det();

	cudacomplex _i_det = -i_det;  // saves one explicit negation below

	return matrix2_2( d *  i_det,  b * _i_det,

					  c * _i_det,  a *  i_det);

	}

	// print matrix contents (host code only!)

	M_HOST void print() {

		printf("[ %.4f + %.4fj\t", a.real(), a.imag()); printf("%.4f + %.4fj\n",   b.real(), b.imag());

		printf("  %.4f + %.4fj\t", c.real(), c.imag()); printf("%.4f + %.4fj ]\n", d.real(), d.imag());

	}

	// static member functions

	// return empty matrix

	static M_HOSTDEVICE matrix2_2 zero(void) {

		return matrix2_2(constants::zero, constants::zero,

						 constants::zero, constants::zero);

	}

	// return matrix of all ones

	static M_HOSTDEVICE matrix2_2 ones(void) {

		return matrix2_2(constants::one,  constants::one,

						 constants::one,  constants::one);

	}

	// return unit matrix

	static M_HOSTDEVICE matrix2_2 unit(void) {

		return matrix2_2(constants::one,  constants::zero,

						 constants::zero, constants::one);

	}

};

/**

 * An object that can present itself like an array of matrix2_2 objects,

 * but with an internal struct-of-arrays data layout for memory coalescing.

 */

#if 1  // use #if 0 to disable memory coalescing

template <int height> struct matrixstack2_2

{

	// data members

	cudacomplex a[height], b[height], c[height], d[height];

	// Retrieve the corresponding matrix2x2 object by index

	M_HOSTDEVICE matrix2_2 operator[](int index) const {

		return matrix2_2(a[index],b[index],

						 c[index],d[index]);

	}

	// Set the corresponding matrix2x2 by index

	// Note that the assignment operator won't work because

	// the [] operator gives us a temporary matrix2_2 object

	// and not an actual reference to the internal storage.

	// So you HAVE to use the set() function.

	// WARNING: CUDA uses local memory when compiling this on the device,

	//		  which slows down your program.

	//		  See the overload below for a possible remedy.

	M_HOSTDEVICE void set(int index, const matrix2_2 REF(data))

	{

		a[index] = data.a; b[index] = data.b;

		c[index] = data.c; d[index] = data.d;

	}

	// This overload of set() avoids CUDA's use of local memory because

	// it doesn't pass the larger matrix2_2 type as function argument

	M_HOSTDEVICE void set(int index, const cudacomplex REF(_a), const cudacomplex REF(_b), const cudacomplex REF(_c), const cudacomplex REF(_d))

	{

		a[index] = _a; b[index] = _b;

		c[index] = _c; d[index] = _d;

	}

#else

/**

 * A very similar matrix2_2 stack object, but without an internal

 * data layout that enables memory coalescing.

 * This is just for testing purposes (use #if 0 above to enable this).

 */

template <int height> struct matrixstack2_2

{

	// data members

	matrix2_2 m[height];

	// Retrieve the corresponding matrix2x2 object by index

	M_HOSTDEVICE matrix2_2 operator[](int index) const {

		return m[index];

	}

	// Set the corresponding matrix2x2 by index

	// note that the assignment operator won't work because

	// the [] operator gives us a temporary matrix2_2 object

	// and not an actual reference to the internal storage.

	// So you HAVE to use the set function()

	// WARNING: CUDA uses local memory when compiling this on the device,

	//		  which slows down your program.

	//		  See the overload below for a possible remedy.

	M_HOSTDEVICE void set(int index, const matrix2_2 REF(data)) {

		m[index] = data;

	}

	// This overload of set() avoids CUDA's use of local memory because

	// it doesn't pass the larger matrix2_2 type as function argument

	M_HOSTDEVICE void set(int index, const cudacomplex REF(_a), const cudacomplex REF(_b), const cudacomplex REF(_c), const cudacomplex REF(_d)) {

		m[index] = matrix2_2(_a, _b, _c, _d);

	}

#endif

};

#endif // #ifndef COMPLEXMATRIX_H

When you want to perform simultaneous computations using matrix2_2 in many threads, you will notice that memory access will not be coalesced, because an array of structs (like matrix2_2) generally does not coalesce its memory accesses in CUDA.

So I came up with a trick of introducing an intermediate storage object named “matrixstack2_2”, which I declare as a template class. In this class I store the matrix elements a,b,c,d in arrays of size “height” - and height is an integer parameter to the template class. It has to be specified at compile time. I overload the operator to access individual elements from the matrix stack and I provide a set function to modify (overwrite) an element of the stack. Because the matrix stack now internally stores everything as a struct of arrays, the memory accesses will be coalesced.

Try this test program “complextest.cu”. It generates a lot of random matrices, inverts them and multiplies them with their inverse. So we get plenty of unit matrices. The computation is repeated on the CPU in a “gold” kernel and the results are verified. I print the estimated memory throughput, as well as GFlops/sec during the computation.

complextest.cu

/*

* Copyright (c) 2008-2009 Christian Buchner <Christian.Buchner@gmail.com>

* All rights reserved.

*

* Redistribution and use in source and binary forms, with or without

* modification, are permitted provided that the following conditions are met:

*	 * Redistributions of source code must retain the above copyright

*	   notice, this list of conditions and the following disclaimer.

*	 * Redistributions in binary form must reproduce the above copyright

*	   notice, this list of conditions and the following disclaimer in the

*	   documentation and/or other materials provided with the distribution.

*

* THIS SOFTWARE IS PROVIDED BY Christian Buchner ''AS IS'' AND ANY 

* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

* DISCLAIMED. IN NO EVENT SHALL Christian Buchner BE LIABLE FOR ANY

* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

*/

/* Inversion of 2x2 matrices using coalesced memory access patterns.

 * Based on nVidia's template SDK project.

 * Host code.

 */

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil_inline.h>

// NOTE: this definition is duplicated in complextest_gold.cpp

// when you change it, please change both!

#define STACKHEIGHT 384

#include "cudacomplex.h"

#include "complexmatrix.h"

// includes, kernels

#include <complextest_kernel.cu>

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest( int argc, char** argv);

extern "C"

void computeGold(  matrixstack2_2<STACKHEIGHT>* reference, matrixstack2_2<STACKHEIGHT>* idata, const unsigned int threads, const unsigned int samples );

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv) 

{

	runTest( argc, argv);

	cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void

runTest( int argc, char** argv) 

{

	// use command-line specified CUDA device, otherwise use device with highest Gflops/s

	if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

		cutilDeviceInit(argc, argv);

	else

		cudaSetDevice( cutGetMaxGflopsDeviceId() );

	cutilSafeCall( cudaFree(NULL) ); // force creation of CUDA context

	unsigned int num_samples = 10000;

	unsigned int mem_size = sizeof( matrixstack2_2<STACKHEIGHT> ) * num_samples;

	printf("Complex class and Complex 2x2 matrix class functional test: Performing\n");

	printf("%d complex 2x2 matrix inversions and multiplications (in %d bytes)\n", STACKHEIGHT * num_samples, mem_size);

	// allocate host memory

	matrixstack2_2<STACKHEIGHT>* h_idata = (matrixstack2_2<STACKHEIGHT>*) malloc( mem_size );

	// initalize the memory

	for( unsigned int i = 0; i < num_samples; ++i) 

	{

		for ( unsigned int j = 0; j < STACKHEIGHT; j++)

		{

			// create random 2x2 matrices with elements in the range 0...1

			//				 (for both complex and imaginary components)

			cudacomplex a = { (float)rand() / RAND_MAX, (float)rand() / RAND_MAX };

			cudacomplex b = { (float)rand() / RAND_MAX, (float)rand() / RAND_MAX };

			cudacomplex c = { (float)rand() / RAND_MAX, (float)rand() / RAND_MAX };

			cudacomplex d = { (float)rand() / RAND_MAX, (float)rand() / RAND_MAX };

			h_idata[i].set(j, matrix2_2( a, b, c, d ));

		}

	}

	// allocate device memory

	matrixstack2_2<STACKHEIGHT>* d_idata;

	cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));

	// allocate device memory for result

	matrixstack2_2<STACKHEIGHT>* d_odata;

	cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));

	unsigned int timer = 0;

	cutilCheckError( cutCreateTimer( &timer));

	cutilCheckError( cutStartTimer( timer));

	// copy host memory to device

	cutilSafeCall( cudaMemcpy( d_idata, h_idata, mem_size,

								cudaMemcpyHostToDevice) );

	// setup execution parameters

	dim3  grid( num_samples, 1, 1);

	dim3  threads( STACKHEIGHT, 1, 1);

	unsigned int timer2 = 0;

	cutilCheckError( cutCreateTimer( &timer2));

	cutilCheckError( cutStartTimer( timer2));

	// execute the kernel

	testKernel<<< grid, threads >>>( d_idata, d_odata );

	// check if kernel execution generated and error

	cutilCheckMsg("Kernel execution failed");

	cudaThreadSynchronize();

	cutilCheckError( cutStopTimer( timer2));

	// allocate mem for the result on host side

	matrixstack2_2<STACKHEIGHT>* h_odata = (matrixstack2_2<STACKHEIGHT>*) malloc( mem_size );

	// copy result from device to host

	cutilSafeCall( cudaMemcpy( h_odata, d_odata, mem_size,

								cudaMemcpyDeviceToHost) );

	cutilCheckError( cutStopTimer( timer));

	printf( "Copy+Processing time (GPU): %f (ms)\n", cutGetTimerValue( timer));

	printf( "Only processing time (GPU): %f (ms)\n", cutGetTimerValue( timer2));

	// compute memory bandwidth, assuming we're reading and writing all matrix elements once

	printf( "GPU read/write performance during computation: %f GBytes/s\n", (((double)mem_size*2 / ((double)cutGetTimerValue( timer2)/1000)))/(1024*1024*1024));

	printf( "Matrices processed per second (inversion+multiplication): %.0f \n", (((double)STACKHEIGHT*num_samples/ ((double)cutGetTimerValue( timer2)/1000))));

	// 90 FLOPs can be found in PTX file, thereof 15 MADs

	printf( "Estimated Gflops/s: %f \n", (((double)STACKHEIGHT*num_samples*(90+15)/ ((double)cutGetTimerValue( timer2)/1000)))/1E9);

	cutilCheckError( cutDeleteTimer( timer));

	cutilCheckError( cutDeleteTimer( timer2));

	// compute reference solution

	matrixstack2_2<STACKHEIGHT>* reference = (matrixstack2_2<STACKHEIGHT>*) malloc( mem_size );

	unsigned int timer3 = 0;

	cutilCheckError( cutCreateTimer( &timer3));

	cutilCheckError( cutStartTimer( timer3));

	computeGold( reference, h_idata, STACKHEIGHT, num_samples);

	cutilCheckError( cutStopTimer( timer3));

	printf( "	 Processing time (CPU): %f (ms)\n", cutGetTimerValue( timer3));

	cutilCheckError( cutDeleteTimer( timer3));

	// Check if the GPU computed result is equivalent to the CPU reference soluion

	// Typecast buffer to float, such that cutComparefe works for us

	// This requires a large epsilon value because random matrices aren't perfectly invertible

	CUTBoolean res = cutComparefe( (float*)reference, (float*)h_odata, 2*2*2*STACKHEIGHT*num_samples, 0.001f);

	printf( "CPU<->GPU comparison Test %s\n", (1 == res) ? "PASSED" : "FAILED");

	// The mathematically correct solution of a matrix multiplication

	// with its inverse matrix is a unit matrix [ 1 0 ]

	//										  [ 0 1 ]

	for( unsigned int i = 0; i < num_samples; ++i) 

		for ( unsigned int j = 0; j < STACKHEIGHT; j++)

			reference[i].set(j, matrix2_2::unit());

	// Check if the computed result is equivalent to the mathematical reference soluion

	// Typecast buffer to float, such that cutComparefe works for us

	// This requires a large epsilon value because random matrices aren't necessarily invertible

	res = cutComparefe( (float*)reference, (float*)h_odata, 2*2*2*STACKHEIGHT*num_samples, 0.001f);

	printf( "Mathematical correctness Test %s\n", (1 == res) ? "PASSED" : "FAILED");

	// cleanup memory

	free( h_idata);

	free( h_odata);

	free( reference);

	cutilSafeCall(cudaFree(d_idata));

	cutilSafeCall(cudaFree(d_odata));

	cudaThreadExit();

}

complextest_kernel.cu

/*

* Copyright (c) 2008-2009 Christian Buchner <Christian.Buchner@gmail.com>

* All rights reserved.

*

* Redistribution and use in source and binary forms, with or without

* modification, are permitted provided that the following conditions are met:

*	 * Redistributions of source code must retain the above copyright

*	   notice, this list of conditions and the following disclaimer.

*	 * Redistributions in binary form must reproduce the above copyright

*	   notice, this list of conditions and the following disclaimer in the

*	   documentation and/or other materials provided with the distribution.

*

* THIS SOFTWARE IS PROVIDED BY Christian Buchner ''AS IS'' AND ANY 

* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

* DISCLAIMED. IN NO EVENT SHALL Christian Buchner BE LIABLE FOR ANY

* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

*/

/* Inversion of 2x2 matrices using coalesced memory access patterns.

 * Based on nVidia's template SDK project.

 * Device code.

 */

#ifndef _COMPLEXTEST_KERNEL_H_

#define _COMPLEXTEST_KERNEL_H_

#include "cudacomplex.h"

#include "complexmatrix.h"

////////////////////////////////////////////////////////////////////////////////

//! Simple test kernel for inverting 2x2 matrices

//! @param g_idata  input data in global memory

//! @param g_odata  output data in global memory

////////////////////////////////////////////////////////////////////////////////

__global__ void

testKernel( matrixstack2_2<STACKHEIGHT>* g_idata, matrixstack2_2<STACKHEIGHT>* g_odata) 

{

  // access thread id

  const unsigned int tid	= threadIdx.x;

  const unsigned int sample = blockIdx.x;

matrix2_2 tmp = g_idata[sample][tid] * g_idata[sample][tid].inv();

g_odata[sample].set(tid, tmp.a, tmp.b, tmp.c, tmp.d);

}

#endif // #ifndef _COMPLEXTEST_KERNEL_H_

complextest_gold.cpp

/*

* Copyright (c) 2008-2009 Christian Buchner <Christian.Buchner@gmail.com>

* All rights reserved.

*

* Redistribution and use in source and binary forms, with or without

* modification, are permitted provided that the following conditions are met:

*	 * Redistributions of source code must retain the above copyright

*	   notice, this list of conditions and the following disclaimer.

*	 * Redistributions in binary form must reproduce the above copyright

*	   notice, this list of conditions and the following disclaimer in the

*	   documentation and/or other materials provided with the distribution.

*

* THIS SOFTWARE IS PROVIDED BY Christian Buchner ''AS IS'' AND ANY 

* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED

* WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

* DISCLAIMED. IN NO EVENT SHALL Christian Buchner BE LIABLE FOR ANY

* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND

* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

*/

#include "cudacomplex.h"

#include "complexmatrix.h"

#define STACKHEIGHT 384

////////////////////////////////////////////////////////////////////////////////

// export C interface

extern "C" 

void computeGold( matrixstack2_2<STACKHEIGHT>* reference, matrixstack2_2<STACKHEIGHT>* idata, const unsigned int threads, const unsigned int samples );

////////////////////////////////////////////////////////////////////////////////

//! Compute reference data set (inversion of 2x2 matrices)

//! @param reference  reference data, computed but preallocated

//! @param idata	  input data as provided to device

//! @param threads	height of matrix stacks

//! @param samples	total number of matrix stacks

////////////////////////////////////////////////////////////////////////////////

void

computeGold(  matrixstack2_2<STACKHEIGHT>* reference, matrixstack2_2<STACKHEIGHT>* idata, const unsigned int threads, const unsigned int samples ) 

{

	for( unsigned int i = 0; i < samples; ++i)

	{

		for( unsigned int j = 0; j < threads; ++j) 

		{

			reference[i].set(j, idata[i][j] * idata[i][j].inv());

		}

	}

}

The program’s result looks like this on my laptop with a 9600M GT graphics chip (32 shader units). You may need 256MB of graphics memory to run this, as the program tries to grab about 120MB in a single block. Cool - so we’ve just inverted 176 million complex matrices per second. Eat this, Intel CPU!

Complex class and Complex 2x2 matrix class functional test: Performing

3840000 complex 2x2 matrix inversions and multiplications (in 122880000 bytes)

Copy+Processing time (GPU): 218.680099 (ms)

Only processing time (GPU): 21.812899 (ms)

GPU read/write performance during computation: 10.492958 GBytes/s

Matrices processed per second (inversion+multiplication): 176042628 

Estimated Gflops/s: 18.707373

	 Processing time (CPU): 1572.325073 (ms)

CPU<->GPU comparison Test PASSED

Mathematical correctness Test PASSED

To see the catastrophic effect of uncoalesced memory access, change the #if 1 in the complexmatrix.h file to #if 0 - then the matrixstack2_2 will be replaced by a version that uses an array of structs. Then rebuild the project. The memory throughput will tank and computation time is increased significantly (For me: 10.5 GBytes/sec -> 2.6 GBytes/sec, 21.8ms -> 87.2ms)

This application is mostly memory bound, not compute bound - as you can determine from the low GFlops numbers. Also note that the memory transfer overhead to and from the GPU is significant, yet even including this overhead the actual computation still beats the CPU by a significant factor.

One specialty are the two set() functions in the matrixstack2_2 routine. I found that when I used the first version which passes a const reference to matrix2_2, the code became needlessly slow. As the CUDA Visual profiler revealed, the compiler started to pass the matrix2_2 through local memory - which is essentially uncached and eats memory bandwidth. So I created the alternate version where you can pass the matrix elements a,b,c,d explicitly as individual arguments. Then all data stays in registers, as you can verify by using the -keep compiler option and inspecting the resulting PTX code.

I hope you have enjoyed this little tutorial. The code hasn’t been awfully complex and maybe you can make use of the cudacomplex.h and complexmatrix.h headers in some other projects.

I am attaching a Visual C++ 2008 project file. Let me know if you were able to create a Makefile and run this on Linux or Mac. I haven’t tested compilation of the matrix2_2 class on these systems yet.

Christian

P.S.: So why would you need to work with lots of 2x2 matrices in parallel? A practical use of parallel complex 2x2 matrix inversions would be in communications engineering, e.g. simulation of a 2x2 MIMO antenna systems using OFDM modulation. Here you essentially have to perform matrix computations for every radio carrier in each simulation time step - and there can be hundreds or thousands of carriers in parallel in an OFDM system such as LTE or 802.11 draft n.
complextest_vc90.zip (1.95 KB)

Errata, comments regarding the 2x2 code above:

Aaah, sorry guys. My initially posted code underestimated the GFlops count by a factor of about 5. I originally counted the maths operators on a cudacompex each as one FLOP, of course that’s not true. A compex addition is 2 Flops, and a complex multiplication amounts to even more FLOPs. So I went to check the PTX file and counted the FLOPs thoroughly this time. I found 90 tightly packed floating point ops, 15 therof are MADs (fused multiply and adds), so that is 105 FLOPs per thread.

I also just tested this on Linux. Except for a few “missing braces around initializer” warnings the code compiles and works fine. I’ve since updated the cudacomplex.h to no longer produce these warnings. As a side note, the quite current gcc compiler on OpenSuse 11.1 is giving my GPU a hard time. With a little bit of extra ISSE and multithreading magic we would actually beat the combined memcpy and processing time of the GPU.

Copy+Processing time (GPU): 226.570999 (ms)

	 Processing time (CPU): 354.365997 (ms)

Finally, there was a glitch in the matrix2_2 subtraction operator. It hat the operands backwards, so it would return the wrong sign on the result. Fixed this.

Christian

Replying to one’s own thread is lame, but I’ve got some news regarding 4x4 matrix operations:

Based on the code posted above I’ve constructed a matrix4_4 object consisting of four matrix2_2 objects. Likewise, I also created a matrixstack4_4 consisting of four matrixstack2_2 objects. I will not post my new 4x4 source code, but leave this as practice for the reader.

For the 4x4 matrix inversion I used a blockwise partitioning scheme as described in http://www.da.isy.liu.se/pubs/eilert/eilert-iscas2007.pdf (sorry, link to full paper is not available - only the abstract). An alternative solution is given in the “Numerical Recipes in C”. An explicitly unrolled Multiplication operator for 4x4 matrices was a bit messy to program because of the hierarchical sub-partitioning, but it works fine.

The memory access remains fully coalesced and all data is held fully in registers (no local memory is used).

Now I am able to invert and multiply about 19.3 million 4x4 matrices per second (achieved memory bandwidth is 4.6GBit/sec, GFlops is roughly 22.4) - all that on the same “cheap” laptop hardware as above.

The resulting PTX code requires 81 registers per thread, which is a bit on the high side for a G80/G92 chip. For this reason I am now running only 64 threads per block. There are about 1158 densely packed FLOP instructions per thread in the PTX, which seems to make the application more compute limited than bandwidth limited. However I am not even able to get close to the advertised 120 GFlops of the 9600M GT laptop GPU.

—UPDATE

Here’s the 4x4 matrix benchmark run repeated on an nVidia 8800GT with 112 shaders

Complex class and Complex 4x4 matrix class functional test: Performing

1000000 complex 4x4 matrix inversions and multiplications (in 128000000 bytes)

Copy+Processing time (GPU): 215.403976 (ms)

Only processing time (GPU): 11.187735 (ms)

GPU read/write performance during computation: 21.310711 GBytes/s

Matrices processed per second (inversion+multiplication): 89383600 

Estimated Gflops/s: 103.506209 

	 Processing time (CPU): 4350.023926 (ms)

CPU<->GPU comparison Test PASSED

Mathematical correctness Test PASSED

Scary that the actual inversion and multiplication of 1 million original matrices with their inverse takes only 11.18 milliseconds - considering that the CPU code took 4.3 seconds to repeat the same (without ISSE vectorization or multithreading). The speed-up is about factor 400 not considering memory copy overhead.

Can’t wait to repeat this on a G200 chip with its 16384 registers.

—END UPDATE

What I don’t understand is why the Visual Profiler reports 80000 coalesced loads, but 640000 coalesced stores. That doesn’t make much sense to me. The PTX contains no branches and exactly the same number of global v2.f32 vector loads as v2.f32 vector stores. So both numbers should theoretically match up.

Anyway, 4x4 MIMO antenna array and MMSE receiver simulation, here we go ;)

Christian