FEM Jacobi Method

Hi everybody

Im trying to create a FEM solver using the Jacobi Iterative Method. My data is structure in the following way. I have a table which indicate how many neighbors that node have. The Next one indicate me which are the neighbors of that node, And the last one which have the value of the nodes. The idea is to get the average of the Neighbors and store it in a new table. In all Tables, the number of row indicate the number of node. For example

table 1

row 4 3
row 5 8

table 2
row 4 1 2 3
row 5 1 2 4 6 8 9 10 12

table 3
row 1 0.58963
row 2 0.69774

row 9 0.25689

the idea is to get the average of the nodes indicated in table number 2, The principal problem I have is not all the rows have the same number of neighbors.

The tables have a row format, and the table number 2 is an sparse Matrix, which the number of elements are given for the table number 1.

I already did a try, but it is 3 times slower than my approach using C++.

Here is the code

global void Jacobi_Kernel(unsigned int *d_neigh, unsigned int *d_numNeigh, float *d_meas, float *d_results) // add for the submatrix
{
int Node = blockIdx.x;

unsigned int numData = d_numNeigh[Node];
float Sum = 0;

if (Node < NUM_NODES && Node > 0){
	
	unsigned int temp = d_numNeigh[Node - 1];

	numData -= temp;

	for (unsigned int i  = 0; i < numData; i ++)
	{
		Sum += d_meas[d_neigh[i + temp] - 1];
	}

}

if (Node == 0){
	
	for (unsigned int i  = 0; i < numData; i ++)
	{
		Sum += d_meas[d_neigh[i] - 1];
	}

}

if (Node < NUM_NODES)
	d_results[Node] = Sum / numData;

}

Hope you can give some advices to improve my code.

regards
Luis

The big drawback with this sort of problem on the GPU is the unpredictable memory access patterns, which makes achieving anything resembling coalesced memory access very hard indeed. I generally work with structured continuum meshes, which makes life considerably easier, so I haven’t tried to do this for unstructured meshes, but textures spring immediately to mind. If you bind the node quantities to textures, and use texture reads, you might get a considerable improvement because there is caching and textures tend to work well for spatially coherent data, which is what you have here (or can have if you do some preliminary spatial ordering of the nodes or graph partitioning on the mesh). If you take some shared memory as an output buffer, you should also be able coalesce the writes as well.

(Disclaimer: these are only ideas, I have no proof of concept that they would actually work).

Hi,

I have no idea what FEM solver is… but your code is absolutly not coalesced (even before what avidday is talking about)

for example:

int Node = blockIdx.x; 

unsigned int numData = d_numNeigh[Node];

the code above makes all threads in a block read the same value. You should do something like this:

__shared__ unsigned int smNumData;  // This will replace the numData variable...

if ( 0 == threadIdx.x )

   smNumData = d_numNeigh[Node];

__syncthreads();

// Now all your threads have the correct value...

same goes for those lines - all should read to shared memory and share between the threads…

unsigned int temp = d_numNeigh[Node - 1];

numData -= temp;

for (unsigned int i = 0; i < numData; i ++)

{

Sum += d_meas[d_neigh[i + temp] - 1];

}

I guess the whole code should re-evaluated as its not sure that you’ve managed to parallel it correctly on the GPU.

Avidday/others - on Compute 1.3 (newest cards) what would be the penalty of such a code:

int iValue = pValues[ blockIdx.x ];

what will be the penalty of all the threads looping through an array, all accessing the same memory location?

Will the new hardware relax the coallescing penalty? by how much?

thanks

eyal

On Compute 1.3 devices when a warp (or half warp, can’t remember) reads a single global memory address, a broadcast mechanism is used. It ends up being a single transaction. On cc 1.1 devices there’s no broadcast for global memory AFAIR and this will result in 16 separate trips to memory - a remedy for this is the second codeblock eyalhir74 posted.

so int iValue = pValues[ blockIdx.x ]; is fine on cc 1.3.

When all threads in a warp (half warp) read the same address in shared memory, it’s a broadcast read on all devices (even cc 1.0) and there are no block conflicts or whatever.

I believe he’s talking about Finite Element Method (http://en.wikipedia.org/wiki/Finite_element_method). It would be helpful it people put the full name in the title instead of acronyms, which is especially helpful when searching the forums.

Thanks for the clarification. That means that the other piece of code:

for (unsigned int i = 0; i < numData; i ++)

{

Sum += d_meas[d_neigh[i] - 1];

}

will also use broadcasting. Aside from the fact that d_neigh and d_meas should have been read into smem by all threads that

their id < numData, to hide latency, on CC1.3 this code will not show x16 performance degragation.

I can’t help but think why NVIDIA would do such thing? this lets people write bad un-coalesced code and use

faulty methods and coding practice on the GPU. Sometimes I think it would have been better to let the user suffer the x16 performance

degragation if only to make sure people write good code. What do you think?

eyal

You’re joking, right?

99% joking :)

I modified this part of the code and I got some improvements, and I tried to use shared memory but I got no improvemt at all.

int Node = blockIdx.x*blockDim.x + threadIdx.x;

Sorry, the code I post for first time had this big error.

Im still working in the use of textures. I dont know if it can make some improvements