Electrostatic calculations - Need help optimizing

Mr_Nuke · January 25, 2009, 10:43am

I started a small project that calculates electrostatic field lines, and displays them with OpenGL. The purpose of this was to learn CUDA, and get familiar with making visual representations of the results. I am a bit stuck on optimizing the “master” kernel. I’m not as much interested in squeezing max performance of this particular kernel, as I am in getting comfortable with effective optimization techniques. Here’s the code:

[codebox]template<>

struct Vector3

{

	float x, y, z;

};[/codebox]

[codebox]// Allows point charges to be read and written as float4 type

template <>

struct align(16) pointCharge

{

union

{

	float4 ftype;

	struct

	{

		Vector3<float> position;

		float magnitude;

	};

};

};[/codebox]

[codebox]global void CalcField_SPkernel(Vector3 *FieldVecs, pointCharge *Charges,

							unsigned int n, unsigned int p, unsigned int fieldIndex, float resolution)

{

unsigned int tx = threadIdx.x;

unsigned int ti = blockDim.x * blockIdx.x + tx;

// Shared array to hold the charges

__shared__ pointCharge<float> charge[BLOCK_X];

// previous point ,used to calculate current point, and cumulative field vector

Vector3<float> point, temp = {0, 0, 0};

// Number of iterations of main loop

unsigned int steps = (p + BLOCK_X - 1) / BLOCK_X;

// Load starting point

point = FieldVecs[n * (fieldIndex - 1) + ti];

for(unsigned int step = 0; step < steps; step++)

{

	// Load point charges from global memory

	unsigned int pId =  step * BLOCK_X + tx;

	if (pId < p)			// This if statement has minimal impact on performance

		charge[tx] = Charges[pId];

	else

		charge[tx].magnitude = 0;	// magnitude of 0 should not affect result

	// Wait for all loads to complete

	__syncthreads();

	// Completely unroll the following loop

	#pragma unroll

	for(unsigned int i = 0; i < BLOCK_X; i++)

	{

		vec3Addto(temp, electroPartField(charge[i], point) );	// 15FLOP(ElectroPartField) + 3 FLOPs

	}

}

// Finally, add the unit vector of the field divided by the resolution to the previous point to get the next point

FieldVecs[n * fieldIndex + ti] = vec3Div(vec3Add(point, vec3Unit(temp) ), resolution);// 15 FLOPs (3 div + 3 add + 9 unit)

}[/codebox]

Now, Vector3 is a 12-byte type, and I was not able to coalesce the memory read. I tried loading the array in shared mem by pieces, but the overhead associated with that was too great, and performance poor. I’m currently getting about 136 GFLOP/s (on a 9800GT) with 16384 threads, and it maxes out around147 GFLOP/s at 20480 threads or more. My main concern is getting more performance with these large data sets. I’m not as worrried about “multithreading” individual threads.

Any suggestion is greatly appreciated.

Also, the block size is 64. I’ve tried other values, but this one produces the best performance.

cbuchner1 · January 25, 2009, 12:34pm

I like this project a lot.

How many point charges do you usually have? Moving these to constant memory might help. This memory is cached, whereas global memory is not.

Not sure if this is any faster than your use of shared memory, however if you keep the point charges in constant memory you have the shared memory available to do something else with it.

I see that each thread block computes one “resolution” step of several field lines and then terminates. You could make each thread block compute several “resolution” steps in an outer loop, holding intermediate results, such as the previously computed field line vector in shared memory (or a local variable “point”, as currently done). After moving the point charges to constant memory, computing several of these vector positions in a loop will make good use of the constant memory cache. Right now you have a global memory load (to fill the variable “point”) and one global memory write per iteration (the last line of the kernel).

If you computed sixteen resolution steps in a loop, then you would have one read and 16 writes. So you would save 15 reads and get at most twice the throughput (GFLOPS). How large the speed-up will be, will depend a lot on the total amount of point charges that you’re accumulating.

Oh… and use the --keep option to have a look at the generated PTX. Often you’ll see that the compiler choses sub-optimal instructions to achieve what you want. Then you can tweak the .cu source code until the PTX becomes more elegant.

And one final thought. I noticed your computation is separable (along the x,y, and z coordinates) right until the very last line where you divide by the length of the vector. So you could be doing the field summation component-wise. If you operate on a struct of arrays instead of an array of structs like detailed in the attached nVidia paper you will achieve coalesced memory access when you do the point charge field summation component by component.

cbuchner1 · January 25, 2009, 1:54pm

Regarding this statement, do you mean the BLOCK_X define or the number of threads in a thread block?

Christian

Mr_Nuke · January 26, 2009, 12:12am

For testing, I have about 1000 point charges. Of course, that could be any arbitrary number, but I see what you mean; it is small enough to fit in constant memory. I was reading them from texture memory at one point, but that proved to be slower. Now since both constant and texture memory are cahced, is there an advantage for using one over the other?

I have since made slight modifications since so I will definitely try that again.

I see that each thread block computes one “resolution” step of several field lines and then terminates. You could make each thread block compute several “resolution” steps in an outer loop, holding intermediate results, such as the previously computed field line vector in shared memory (or a local variable “point”, as currently done). After moving the point charges to constant memory, computing several of these vector positions in a loop will make good use of the constant memory cache. Right now you have a global memory load (to fill the variable “point”) and one global memory write per iteration (the last line of the kernel).

If you computed sixteen resolution steps in a loop, then you would have one read and 16 writes. So you would save 15 reads and get at most twice the throughput (GFLOPS). How large the speed-up will be, will depend a lot on the total amount of point charges that you’re accumulating.

I’m not sure what you mean. Each thread has its “own” field line to process, and in the end computes the resolution step for its field line. The resolution indicates how “short” a step should be (the higher the resolution, the shorter the step)

There was a slight bug that I corected:

[codebox]

// This is wrong

vec3Div(vec3Add(point, vec3Unit(temp) ), resolution);

// I corrected it to

vec3Add(point, vec3Div(vec3Unit(temp), resolution);

[/codebox]

Thanks for the pdf. I was wondering if here is a reasonable way to coalesce 12-byte types. I’ll have to study the matter a litle bit more, as the electroPartField function is set up to take structures as the parameter. I’m also passing the data as an Array of structures to the kernel wrapper, so I’ll have to see how to copy it properly. I don’t think it should be too hard to come up with something.

Also, electroPartField() uses all three components of a vector to determine the partial field, making the computation non-separable at that stage.

Both. It’s a one-dimensional block. Here’s the kernel wrapper:

template<>

inline int CalcField_core<float>(Vector3<float> * fieldLines, unsigned int steps, unsigned int lines,

					  pointCharge<float> * pointCharges, unsigned int points,

			  float resolution)

{

	// Compute dimension requirements

	dim3 block(BLOCK_X, 1, 1),

		grid( ((unsigned int)lines + BLOCK_X - 1)/BLOCK_X, 1, 1 );

	// Call the kernel

	for(unsigned int i = 1; i < steps; i++)

	{

		CalcField_SPkernel<<<grid, block>>>(fieldLines, pointCharges, lines, points, i, resolution);

	}

	return 0;

}

Mr_Nuke · January 26, 2009, 2:03am

I guess I could make a structure of arrays like the following:

template<>
Vector3SOA
{
float2* xyInterleaved;
float* z;
}

With that, I could easily read a Vector3 with a 64-bit coalesced and 32-bit coalesced access into the “point” variable. However, I would need a way to copy from device to host memory with an element size of 8 and 4 bytes respectively, and a stride of 16 bytes. To my knowledge, CUDA does not provide such a memcpy routine.

EDIT - I was reading through the reference manual. Would cudaMemcpy2D() be suited for the job?
EDIT2 - Experimenting, I answered my own question. cudaMemcpy2D() is what I need.

Mr_Nuke · January 26, 2009, 3:54am

Ok, I am getting around 139 GFLOP/s now that I optimized memory access

Here’e the kernel:

[codebox]global void CalcField_SPkernel_coalesced(float2* xyInterleaved, float* z, pointCharge *Charges,

							unsigned int n, unsigned int p, unsigned int fieldIndex, float resolution)

{

unsigned int tx = threadIdx.x;

unsigned int ti = blockDim.x * blockIdx.x + tx;

// Shared array to hold the charges

__shared__ pointCharge<float> charge[BLOCK_X];

// previous point ,used to calculate current point, and cumulative field vector

Vector3<float> point, temp = {0, 0, 0};

// Number of iterations of main loop

unsigned int steps = (p + BLOCK_X - 1) / BLOCK_X;

// Load starting point

float2 ptXY = xyInterleaved[n * (fieldIndex - 1) + ti];

point.x = ptXY.x;

point.y = ptXY.y;

point.z = z[n * (fieldIndex - 1) + ti];

for(unsigned int step = 0; step < steps; step++)

{

	// Load point charges from global memory

	unsigned int pId =  step * BLOCK_X + tx;

	if (pId < p)

		charge[tx] = Charges[pId];

	else

		charge[tx].magnitude = 0;	// magnitude of 0 should not affect results

	// Wait for all loads to complete

	__syncthreads();

	// Completely unroll the following loop

	#pragma unroll

	for(unsigned int i = 0; i < BLOCK_X; i++)

	{

		temp += electroPartField(charge[i], point);	// ElectroPartFieldFLOP + 3 FLOPs

	}

}

// Finally, add the unit vector of the field divided by the resolution to the previous point to get the next point

point = (point + vec3SetLen(temp, resolution));// 13 FLOPs (10 set len + 3 add)

ptXY.x = point.x; ptXY.y = point.y;

xyInterleaved[n * fieldIndex + ti] = ptXY;

z[n * fieldIndex + ti] = point.z;

}[/codebox]

Low level wrapper:

[codebox]template<>

inline int CalcField_coreCoal(Vec3SOA &fieldLines, unsigned int steps, unsigned int lines,

				  pointCharge<float> * pointCharges, unsigned int points,

		  float resolution)

{

// Compute dimension requirements

dim3 block(BLOCK_X, 1, 1),

	grid( ((unsigned int)lines + BLOCK_X - 1)/BLOCK_X, 1, 1 );

// Call the kernel

for(unsigned int i = 1; i < steps; i++)

{

	CalcField_SPkernel_coalesced<<<grid, block>>>(fieldLines.xyInterleaved, fieldLines.z,

		pointCharges, lines, points, i, resolution);

}

return 0;

}[/codebox]

And high level wrapper:

[codebox]template

inline int CalcField_wrap(Array<Vector3 >& fieldLines, Array<pointCharge >& pointCharges,

		  size_t n, T resolution,  perfPacket& perfData, bool useTex)

{

//Used to mesure execution time

LARGE_INTEGER lfreq, lstart, lend;

QueryPerformanceFrequency(&lfreq);

unsigned __int64 freq = lfreq.QuadPart;

// Place Runtime checks here

// Check amount of needed memory

size_t neededRAM = fieldLines.GetSizeBytes() + pointCharges.GetSizeBytes();

#if _DEBUG

printf(" Call requires %u MB to complete\n", (unsigned int) neededRAM/1024/1024);

#endif

// Get sizing information

size_t size, steps = fieldLines.GetSize()/n;

unsigned int p = (unsigned int)pointCharges.GetSize();

// Memory Allocation

pointCharge<T> *charges;

size = pointCharges.GetSizeBytes();

CUDA_SAFE_CALL(cudaMalloc((void**)&charges, size));

CUDA_SAFE_CALL(cudaMemcpy(charges, pointCharges.GetDataPointer(), size, cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaBindTexture(0, constSPcharges, charges, size));

Vector3<T> *lines;

size = steps*n*fieldLines.GetElemSize();

// CUDA_SAFE_CALL(cudaMalloc((void**)&lines, size));

// CUDA_SAFE_CALL(cudaMemcpy(lines, fieldLines.GetDataPointer(), n*fieldLines.GetElemSize(), cudaMemcpyHostToDevice));

Vec3SOA<T> coalVec;

size_t compSize = (fieldLines.GetElemSize()*2)/3;

size = steps*n*compSize;

CUDA_SAFE_CALL(cudaMalloc((void**) &coalVec.xyInterleaved, size));

CUDA_SAFE_CALL(cudaMemcpy2D(coalVec.xyInterleaved, compSize,

	fieldLines.GetDataPointer(), fieldLines.GetElemSize(),

	compSize, steps*n,

	cudaMemcpyHostToDevice));

compSize/=2;

size = steps*n*compSize;

CUDA_SAFE_CALL(cudaMalloc((void**) &coalVec.z, size));

CUDA_SAFE_CALL(cudaMemcpy2D(coalVec.z, compSize,

	(char*)fieldLines.GetDataPointer() + compSize*2, fieldLines.GetElemSize(),

	compSize, steps*n,

	cudaMemcpyHostToDevice));

QueryPerformanceCounter(&lstart);

// Call the core function

CalcField_coreCoal(coalVec, (unsigned int) steps, (unsigned int)n, charges, p, resolution);

cudaThreadSynchronize();

QueryPerformanceCounter(&lend);

// Compute performance info.

perfData.time = (double)(lend.QuadPart - lstart.QuadPart) / freq;

perfData.performance = ( (steps - 1) * CalcField_kernelFLOP(n,p) ) / perfData.time/ 1E9; // Convert from FLOPS to GFLOPS



// Retrieve data

// CUDA_SAFE_CALL(cudaMemcpy(&fieldLines[n], &lines[n], size - n * fieldLines.GetElemSize(), cudaMemcpyDeviceToHost) );

compSize = (fieldLines.GetElemSize()*2)/3;

CUDA_SAFE_CALL(cudaMemcpy2D(fieldLines.GetDataPointer(), fieldLines.GetElemSize(),

	coalVec.xyInterleaved, compSize,

	compSize, steps*n,

	cudaMemcpyDeviceToHost));

compSize/=2;

CUDA_SAFE_CALL(cudaMemcpy2D((char*)fieldLines.GetDataPointer

()+compSize*2, fieldLines.GetElemSize(),

	coalVec.z, compSize,

	compSize, steps*n, 

	cudaMemcpyDeviceToHost));

// Free device memory

// CUDA_SAFE_CALL(cudaFree(lines));

CUDA_SAFE_CALL(cudaFree(charges));

return 0;

}[/codebox]

I know this needs a ton of cleanup, but I will leave that for tomorrow.

cbuchner1 · January 26, 2009, 12:59pm

What I was trying to suggest is that one kernel invocation should perform the calculation to advance the field lines several steps forward (storing all computed steps to global memory).

This will lower the number of global memory reads that are required - and the number of kernel invocations in total (reducing overhead). When using textures or constant memory to hold the point charges, these will likely remain cached during kernel execution (unless you spill the 8kb sized cache with too many charges).

I believe between distinct kernel calls the caches are always getting flushed, so you wouldn’t see a lot of benefit from the constant or texture caches when setting N to 1.

Have you considered making your step resolution inversely proportional to the electrical field strength? Alternatively: the more the field line is bending, the shorter the step size should be. This will lower the number of vertices that the OpenGL viewer has to plot, without diminishing the accuracy much.

Christian

Mr_Nuke · January 27, 2009, 1:01am

Christian,

Thanks a lot for your input. You definitely gave me some great ideas for optimizations.

Thanks. That sounds like a good idea. I’ll get working on it.

I’m currently using 1000 charges, 16 bytes each, so that will definitely spill over the 8KB boundary. One idea is to use textured reads when the numver of charges is less than 512.

I’m thinking of computing the curvature (+ some constant, so that I don’t have to worry when the curvature is zero), and using its inverse as the resolution, or something similar. The extra calculations should not bring much overhead, as the hot spot is in the partial field calculation and summation.

Alex G.

cbuchner1 · January 27, 2009, 1:04pm

As a final idea - quantizing the charge’s x,y,z position (and strength) to 16 bit signed integers would allow you to double the amount of charges at the expense of precision. In the kernel you would dequantize the values to floats for the computation. It depends on the geometry of your scenario whether such an approximation is acceptable or not.

Emulating 16 bit floats using a 8 bit exponent (+sign) and an 8 bit mantissa would also be an idea (maybe a little slow?).

By the way… for visualization: do your point charges have a radius? If you showed some of the larger charges as 3D spheres you’d get a nice effect. The field lines would start from the surface of one sphere and terminate on the surface of another. Your kernel could be extended to stop computing a field line when it hits one of the spheres.

Once your application produces some cool images, show us some screen shots please. Or even better - post a binary or source code ;)

Combining the CUDA particle and nbody samples with your code would be ultimately cool. Thousands of 3D balls moving according to gravity and electrostatic forces AND the field lines visible. Woo-hoo.

Christian

Mr_Nuke · January 27, 2009, 9:50pm

Here’s a small (64-bit) binary. I’ve reduced the data size to 128 lines for demo purposes (so only 2 MPs are used).

Use WSAD for front/back/strafe, QETG for looking left/right and up/down, and IO for changing the field of view.

It’s far from ready, but shows the direction I’m aiming for.

Alex
Electromag.zip (657 KB)

Mr_Nuke · January 28, 2009, 4:37am

Hmm, using textured reads, I can eliminate the outer loop, manually unroll the remaining loop, but the global/sharedmem kernel is about 4% faster still. And that is for cases when the charges don’t spill the 8KB boundary. I tried to unroll the loop 16, 32, and 64 times, and got the same performance every time. I’ve used the occupancy calculator, and a block of 64 and 128 threads gives me the max occupancy of 67%, so there’s no optimization in that direction.
Also, for some odd reason, the texturing kernel spits out QNAN whenever the number of charges is greater than 512, but works fine otherwise. I will stick with the regular kernel for now.

EDIT - I’m not very fond of reducing precision for the point charges. I plan to use this code a few years down the road for hardcore simulations. The electric field calculation is just the first kernel. I chose this as my starting point because it’s the easiest calculation to build and visually display the results for.
About the point charges, they do not have a radius. I’ve looked over the nbody code (my favorite SDK example by far) to see how they do it. I see they use sprites, and GLEW (I have no idea what that is). I don’t know how that code works. I prefer to write my own code rather than copy code I do not understand; that’s the best way to learn.
Of course, I am planning to give the point charges a more exiting appearance than some blue dots on the screen.

Alex

EDIT2 - I’m currently using the constant memory when 511 or fewer charges are present, and regular global/shared mem otherwise.

Mr_Nuke · March 1, 2009, 3:54am

Alright, I’ve done some refinements to the point that I’m feeling comfortable about releasing the source code. Feel free to play with it, change it as you like, and use it in your own projects. I welcome any fixes to the named known issues. Please post any constructive modifications so that I can incorporate them in the code.

You may encounter some problems compiling, as I have set up some custom environment variables - let me know if you have problems compiling under Windows. Linux users, sorry, I was not yet able to link the GPU code in my app.

Test setup

Windows Server 2008 x64------------------------ Fedora 10
Visual Studio 2008 v9.0.21022.8---------------- NetBeans 6.1
Intel C++ Compiler 11.0.072--------------------- Intel C++ Compiler for Linux 11.0.072
CUDA SDK 2.1 64-bit------------------------------- CUDA SDK 64-bit
CUDA toolkit 2.1 64-bit (and 32-bit)------------ CUDA toolkit 64-bit
NVIDIA GeForce 182.06 Driver------------------- NA - Tests performed under VmWare

Improvements and bugfixes:

Introduced “multithreaded” kernel that scales well with data sizes that are not a multiple of the number of processors. (On a 9800GT, data that was not a multiple of 112, would produde 10% to 20% lower throughput). The new kernel loses roughly 2 to 3% when the data is not a multiple.
Fixed potential bug where reads would not be coalesced if the size of a row were not a multiple of the alignment width.
Fixed potential bug where, under certain compilation conditions, starting and using OpenGL to render the data would cause a segmentation fault or access violation.
Redesigned host to device memory copies to take advantage of non-paged memory, and ensured coalescing would occur with all data sizes. This helped reduce data transfer time by roughly two orders of magnitude.
Known issues:
The “non-multithreaded” kernel is about 7 to 8 % faster than the “multithreaded” kernel when the data size is a multiple of the number of multiprocesors.
Under 64-bit mode, when not compiling with -maxrregcount 18, 19, or 20, the “multithreaded” kernel uses 21 registers, which reduces its throughput significantly.
With the 182.06 driver, non-paged to device memory copies seem to be capped around 3.3GB/s.
Under OpenGL, depth-buffering, blending, and antialiasing are not working, although they are explicitly enabled.
The CPU implementation of the calculations may innefficiently use SSE instructions, as not all elements of a vector are aligned to a 16-byte boundary.
Other unknown issues may exist under 32-bit mode, as I did not intensivley test under 32-bit mode.
Linux cross-compatibility issues:
The code will need small adjustments to compile with GCC or Intel C++ under linux, although threading and timing are cross-implemented.
Under Linux, enabling depth-buffering, blending, or antialiasing will cause a segmentation fault.
I was not able to test the GPU kernel under Linux, as I was not able to link the GPU code.

cbuchner1 · March 2, 2009, 10:16am

Thank you for posting this. I am going to check this out for sure. Maybe I can even learn a new thing or two.

Mr_Nuke · March 2, 2009, 9:00pm

Thank you for your interest. It’s nice to know someone is interested in my little project.

I will be posting more regular updates on my personal website,

http://g-tech.homeserver.com/HPC.htm, as I don’t want to clutter this thread with mini updates.

Alex

Mr_Nuke · March 11, 2009, 9:33am

A few sleepless nights, and a few indexing bugs and multi-GPU support is well fed and running.

Any suggestions and feedback(including bashing for naive algorithms) is appreciated.
Alex
Electromagnetism_sim.zip (126 KB)