Im trying to create a FEM solver using the Jacobi Iterative Method. My data is structure in the following way. I have a table which indicate how many neighbors that node have. The Next one indicate me which are the neighbors of that node, And the last one which have the value of the nodes. The idea is to get the average of the Neighbors and store it in a new table. In all Tables, the number of row indicate the number of node. For example
the idea is to get the average of the nodes indicated in table number 2, The principal problem I have is not all the rows have the same number of neighbors.
The tables have a row format, and the table number 2 is an sparse Matrix, which the number of elements are given for the table number 1.
I already did a try, but it is 3 times slower than my approach using C++.
Here is the code
global void Jacobi_Kernel(unsigned int *d_neigh, unsigned int *d_numNeigh, float *d_meas, float *d_results) // add for the submatrix
{
int Node = blockIdx.x;
unsigned int numData = d_numNeigh[Node];
float Sum = 0;
if (Node < NUM_NODES && Node > 0){
unsigned int temp = d_numNeigh[Node - 1];
numData -= temp;
for (unsigned int i = 0; i < numData; i ++)
{
Sum += d_meas[d_neigh[i + temp] - 1];
}
}
if (Node == 0){
for (unsigned int i = 0; i < numData; i ++)
{
Sum += d_meas[d_neigh[i] - 1];
}
}
if (Node < NUM_NODES)
d_results[Node] = Sum / numData;
}
Hope you can give some advices to improve my code.
The big drawback with this sort of problem on the GPU is the unpredictable memory access patterns, which makes achieving anything resembling coalesced memory access very hard indeed. I generally work with structured continuum meshes, which makes life considerably easier, so I haven’t tried to do this for unstructured meshes, but textures spring immediately to mind. If you bind the node quantities to textures, and use texture reads, you might get a considerable improvement because there is caching and textures tend to work well for spatially coherent data, which is what you have here (or can have if you do some preliminary spatial ordering of the nodes or graph partitioning on the mesh). If you take some shared memory as an output buffer, you should also be able coalesce the writes as well.
(Disclaimer: these are only ideas, I have no proof of concept that they would actually work).
I have no idea what FEM solver is… but your code is absolutly not coalesced (even before what avidday is talking about)
for example:
int Node = blockIdx.x;
unsigned int numData = d_numNeigh[Node];
the code above makes all threads in a block read the same value. You should do something like this:
__shared__ unsigned int smNumData; // This will replace the numData variable...
if ( 0 == threadIdx.x )
smNumData = d_numNeigh[Node];
__syncthreads();
// Now all your threads have the correct value...
same goes for those lines - all should read to shared memory and share between the threads…
unsigned int temp = d_numNeigh[Node - 1];
numData -= temp;
for (unsigned int i = 0; i < numData; i ++)
{
Sum += d_meas[d_neigh[i + temp] - 1];
}
I guess the whole code should re-evaluated as its not sure that you’ve managed to parallel it correctly on the GPU.
Avidday/others - on Compute 1.3 (newest cards) what would be the penalty of such a code:
int iValue = pValues[ blockIdx.x ];
what will be the penalty of all the threads looping through an array, all accessing the same memory location?
Will the new hardware relax the coallescing penalty? by how much?
On Compute 1.3 devices when a warp (or half warp, can’t remember) reads a single global memory address, a broadcast mechanism is used. It ends up being a single transaction. On cc 1.1 devices there’s no broadcast for global memory AFAIR and this will result in 16 separate trips to memory - a remedy for this is the second codeblock eyalhir74 posted.
so int iValue = pValues[ blockIdx.x ]; is fine on cc 1.3.
When all threads in a warp (half warp) read the same address in shared memory, it’s a broadcast read on all devices (even cc 1.0) and there are no block conflicts or whatever.
I believe he’s talking about Finite Element Method (http://en.wikipedia.org/wiki/Finite_element_method). It would be helpful it people put the full name in the title instead of acronyms, which is especially helpful when searching the forums.