Thanks Peter for the quick reply.
- The code is a fluid dynamics unstructured FEM solvers which has been reformulated based on edges instead of elements. That has the advantage of reducing quite a bit the number of flops to perform. The indirect addressing is accessed sequentially, while the unknown array is reached in a more “jump all over the place” manner. The jump distances are reduced as much as possible with various renumbering techniques but here is so much that we can do… especially with a unstructured 2D/3D mesh. I usually write coalesced unkno by group of 5 to 10 depending on the type of solver used and then it gets non-coalesced from point to point. This was coded in that way for super-computers using openMP. Most of the computation is done at the edge level then scattered back to the points. On machine such as SGI-altix NUMA mem architecture, the result is a very high GFlops, very little penalty for cache misses (those machines have 4MB caches) and a very good speed-up to 128 CPU.
Well trying to code the solver that way using Cudas is not the good approach mostly because of the scatter. It is preferable to do twice the amount of work and only one scatter back to the point than once the work and 2 scatters even is almost coalesced write. I recoded a simple 1D solver in order to loop over the edges connected to each point therefore only allowing 1 scatter and it is now much faster. Now … the 1D is using coalesced data de facto (each point as one point before and one point after, except for the boundary points). I will check on a 2D case as soon as I have more time. In this case, there will be non-sequential access and fewer scatter back.
I will therefore try to eliminate the scattering process as much as possible.
- I am doing the following: for each point I am computing an allowed timestep, I then find the minimum of those timestep that becomes the timestep used for advancing my solution. Computing the timestep at each points is fairly efficient (some computation and only one scatter per point. The problem was getting the min and retrieving the min from the device memory into the host. I looked at the SDK example but need to do some timing to see if I would anything by using that methodology versus a simple 1block-1thread run in the card on the device array. In an openMP fortran paradigm it would be a simple and efficient
omp parallel for reduced(min:vmin)
vmin = min(vmin,dt[i])
On the device the simple loop might be more gpu-efficient than the scanning version presented in the SDK. the example is computation intensive, not the min.
- In fact the list is accessed linearly but the indirect addressing is jumpy. Taking my example of edges the list array is as follows:
list = point 1 of edge 1
list = point 2 of edge 1
list = point 1 of edge 2
list = point 2 of edge 2 …
I am then accessing for each edge the unknowns of point 1 and point 2
unkno[listNBR_UNKNOWNS] and unkno[list[+1]*NBR_UNKNOWNS
It is then a non-coalesced access. If I keep list as a standard global memory array and move the unknown to the texture, will that help?. Of course If I do so, the increment for each points and unknowns will be a a standard global mem array. I will than have to update the texture containing the unknowns for the next cycle of the simulation. Will that help? what is the cost of copying global array to a texture?
- I looked into the Cublas lib and am wandering if I can used arrays allocated with cudaMemAlloc as arrays in cublas? I can not afford duplicatign arrays.
I have some more questions after a few tests this week-end.
Is there a penalty retrieving from device > host small amount of data (i.e. a timestep, conservatin error etc.) I have not played with structures and not looked much in the doc about it. Is it possible to define a structure and retrieve data as structures between the device and the host?