I am new in the CUDA/GPU coding and trying to implement a FEM solver. I have a pretty extensive experience implementing solvers on shared memory parallel supercomputers. My solvers are using all kind of renumbering/coloring schemes (on Supercomputers) to avoid scattering interference and reduce cache misses as much as possible.
global memory. I understand that a scattering to an array allocated with cudaMemAlloc is a slow process even is the access is sequential (coalesced)… but I still need to do it in some cases. What is the best option to reduce the scattering to global memory? if there is any.
min(array): I am computing timestep at each point of my mesh using a kernel. This is done on an array located in global memory. I need to get back to the host the min( timestep(i)). How do I do this without getting the array out the global memory? I have seen the atomic functions in 0.9 but they only work on integers. In my case those are floats.
Indirect addressing f[list[*]]: my array “list” is an array of integer and f is an array of floating points. list is also an array not evolving in time. if list is allocated using cudaMemAlloc(), it is therefore located in global memory. list is accessed sequentially. By quickly reading through the doc, it looks like the global mem access is not cached. This is not the case for the texture mem. If this is the case, I would greatly benefit in implemented list inside a texture. Am I right?
is there ways to accelerate functions such as
a = v , a = b+c , a = b*c …
with the array a,b,c being in global memory. Note that the access in those arrays is sequential.
That’s it for now. I hope that some of you will spend the time to read this and provide me with some answers.
there is no real alternative if you need to do scattering. You could however try to turn the scatter into coalesced writes if you can bound the scatter range. In that case, if you can bound it such that all threads of a block write within it, you can stage the scatter in shared mem and then flush the shared mem block to global mem using coalesced writes. Maybe you can do that in some few passes, if the scatter range would create a large shared mem block.
“…do this without getting the array out the global memory?” Do you mean you want to avoid reading it back to the CPU? In that case, take a look at the SDK prefix scan example. It is easy to modify it to do min/max etc instead of the sum. It leaves the computed result in the last array position, so you can fetch it from there in the next pass.
If you access it linearly (and coalesced for the threads), there is no benefit of using a texture. In this case, you even need more registers, as the texfetch instruction has some overhead there. However if there is a chance that the threads do not do coalesced reads in list, or you are reading nearby values more than once per kernel, texture is the big winner.
Take a look at the CUBLAS lib that comes with the toolkit.
I’ve seen you talking about coloring in another topic about scatter operations and you’re talking about that here again. After research I found a IEEE paper from Iwacrapa and Shimasaki (2002) and I could see it was a technique for separating unknowns which don’t have direct relationships with eath other. Fine. But you have to figure out by yourself one algorithm suitable for your solver. Coloring technique is just a generic term to say we have to find a way for not having any conflicts for a parallel computation. As far as I understood it doesn’t bring a direct solution.
I’m working on a finite element solver as well. In my case I would need sets of elements which don’t share any nodes. Assuming I can find an algorithm which will find me few big sets of elements satisfying this condition, I would have to run my kernel on each set. Right? I’m kind of worried about the performances I’ll get if I got too many sets of elements.
About atomic operations on floats we have talked about that in another topic if you are interested. Osiris and myself have managed to figure out something to ‘simulate’ atomic operations on float. If you’re lucky in some cases the compiler doesn’t optimize the code and the method proposed by Osiris works. But it is bloody slow… If you want to have a look though it’s here:
[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA
Last question, I can see a bit everywhere that people are talking about CUDA 0.9. Where did you get it? You can access to it if you’re register as a developer?
Peter gave some good advice, here is what I can add from my experience so far.
One step of my algorithm (at least the current implementation) does 100’s of MB to GB of coalesced memory reads and a few MB of non-coalesced scattered writes. The scattered writes here don’t seem to impact performance too much, partially because there are so few of them and partially because this code does lots and lots of FLOPS for each memory access.
I did device a way to reorganize the calculation to do the scatter part into shared memory and then do a fully coalesced write at the end. However, in my case the overhead of doing the reorganization was slower than scattering into global memory in the first place.
So the lesson here is that coalesced writes are something to strive for, but non-coalesced scattering into global memory might be the better choice depending on how much data is written/read.
Keep list as a normal global memory array and access it sequentially. But, but f into array memory (cudaMallocArray) and access it a texture. If you can sort your data so that the memory access pattern doesn’t jump around a lot, it will help, but you may be amazed at the performance of the cache: I was. In a completely randomly arranged array, my code that does a lot of indirect addressing still is able to pull over 50GB/s from the texture fetches. Sorted to improve cache hits, I max out the device at ~70GB/s. Since the cache is 2D, I just take my 1D dataset and treat it as 2D with M columns and N/M+1 rows.
The code is a fluid dynamics unstructured FEM solvers which has been reformulated based on edges instead of elements. That has the advantage of reducing quite a bit the number of flops to perform. The indirect addressing is accessed sequentially, while the unknown array is reached in a more “jump all over the place” manner. The jump distances are reduced as much as possible with various renumbering techniques but here is so much that we can do… especially with a unstructured 2D/3D mesh. I usually write coalesced unkno by group of 5 to 10 depending on the type of solver used and then it gets non-coalesced from point to point. This was coded in that way for super-computers using openMP. Most of the computation is done at the edge level then scattered back to the points. On machine such as SGI-altix NUMA mem architecture, the result is a very high GFlops, very little penalty for cache misses (those machines have 4MB caches) and a very good speed-up to 128 CPU.
Well trying to code the solver that way using Cudas is not the good approach mostly because of the scatter. It is preferable to do twice the amount of work and only one scatter back to the point than once the work and 2 scatters even is almost coalesced write. I recoded a simple 1D solver in order to loop over the edges connected to each point therefore only allowing 1 scatter and it is now much faster. Now … the 1D is using coalesced data de facto (each point as one point before and one point after, except for the boundary points). I will check on a 2D case as soon as I have more time. In this case, there will be non-sequential access and fewer scatter back.
I will therefore try to eliminate the scattering process as much as possible.
I am doing the following: for each point I am computing an allowed timestep, I then find the minimum of those timestep that becomes the timestep used for advancing my solution. Computing the timestep at each points is fairly efficient (some computation and only one scatter per point. The problem was getting the min and retrieving the min from the device memory into the host. I looked at the SDK example but need to do some timing to see if I would anything by using that methodology versus a simple 1block-1thread run in the card on the device array. In an openMP fortran paradigm it would be a simple and efficient
omp parallel for reduced(min:vmin)
do i=1,nbrPoints
vmin = min(vmin,dt[i])
enddo
On the device the simple loop might be more gpu-efficient than the scanning version presented in the SDK. the example is computation intensive, not the min.
In fact the list is accessed linearly but the indirect addressing is jumpy. Taking my example of edges the list array is as follows:
list[0] = point 1 of edge 1
list[1] = point 2 of edge 1
list[2] = point 1 of edge 2
list[3] = point 2 of edge 2 …
I am then accessing for each edge the unknowns of point 1 and point 2
unkno[list[*]NBR_UNKNOWNS] and unkno[list[+1]*NBR_UNKNOWNS
It is then a non-coalesced access. If I keep list as a standard global memory array and move the unknown to the texture, will that help?. Of course If I do so, the increment for each points and unknowns will be a a standard global mem array. I will than have to update the texture containing the unknowns for the next cycle of the simulation. Will that help? what is the cost of copying global array to a texture?
I looked into the Cublas lib and am wandering if I can used arrays allocated with cudaMemAlloc as arrays in cublas? I can not afford duplicatign arrays.
I have some more questions after a few tests this week-end.
Is there a penalty retrieving from device > host small amount of data (i.e. a timestep, conservatin error etc.) I have not played with structures and not looked much in the doc about it. Is it possible to define a structure and retrieve data as structures between the device and the host?
You are correct, there are many algorithms and we have coded and tried most of then renumbering not only the elements but the points. This leads to very low cache misses and more sequential access. That is fine on machines such as altix SGI with large cache (4Mb) and NUMA architecture.
You properly guessed the fact that you have to reorganize your elements in batches of non-touching elements… then use the kernel on each of of them. As they are not of the same size as well. … and as I experienced with a simple 1D case, the performance is not there. It is better to precompute the list of all the elements connected to each point then use a thread per point with a loop over the elements connected to it. In my case the fluxes are computed in the middle of the element than dispatched to the point. There are 2 gattering ways of handling the problem depending if you can spare extra memory
Method #1. I can spare extra memory
a. kernel to compute fluxes for each element store results in a separate array
b. kernel looping over the points. each thread has a loop overt the connected elements summing the values found in the previous array.
Method #2. I can not spare extra memory
a. kernel looping over the points. Each thread has the loop and the fluxes are computed there.
Method #1 will perform less work but will have requirement for extra mem and will perform an extra coalesced scatter.
Method #2 will perform more work (3times for triangles (2d) 2 times for edges) but no need for extra mem and no extra scatter.
There seems to be a balance between cost of scattering and amount of work. That depends on the type of solver used.
I will look at your atomic version for float unfortunately it seems that it is compiler dependent???
I have installed CUDA 0.9 using the partners.nvidia.com channel. I used to be on their developer database but have not done much in past few years and have therefore being removed from it. Now I can not log in tehpartners.nvidia.com part of the site to check if I have the correct driver. Logging in gets me a “page” does not exist… weird.
yeah, the SGI is actually ccNUMA (cache-coherent NUMA) which is exactly what the G80 is not, ie. shared resources are not coherent.
The “simple 1block-1thread” solution is absolutely the worst thing you can do. You will waste 127 ALUs and get full latency for every access! The SDK scan example uses more threads (hide latency) and hierarchically computes the result (log complexity)
Yes, I would access unkno using a texfetch. Modify the addressing such that it is actually 2D (using some artificial field width) and you will get full cache assistance for the i+1 fetch if NBR_UNKNOWNS is not too big. Because you cannot read and write to the same 2D array, do a ping-pong. That way, you don’t need a copy between the simulation cycles.
Don’t know.
(new) The overhead for device2host copy is always the same. The copy is pretty slow normally. So batching several small copies into a larger one does help a lot. If you can stage the data distribution on the host, take a look at pinned host memory (same latency but much better bandwidth). I always transfer structs simply as uchar arrays. Just cast the pointers to the cudaMemcpy call.
Check out the section “Texture Reference Management” in the guide (4.5.2.4 in 0.9). You can bind a texture to an array with cudaBindTextureToArray. You can also create 1D textures with cudaMallocArray/cudaBindTextureToArray, but beware of the restrictions on the size of those textures (maximum 2^13).
You can bind a global memory pointer to a texture too with cudaBindTexture, which has a maximum width of 2^27, though it won’t benefit as well from the texture cache as memory allocated by cudaMallocArray … at least according to the 0.8 guide, the 0.9 guide doesn’t seem to mention that, or I missed it. Anyways, I haven’t tried cudaBindTexture because the 2D array memory is working very well for me.
The “f” that I am randomly accessing is just an array of particle positions. But treating this 1D list of particles as a 2D array as I mentioned above gives a good speedup, most likely due to improved cache hits. The texture cache is designed to work best if your data has good 2D locality, as this is a graphics card after all, where textures are usually 2D.
I think I will try the method with different batches of elements. With a enough high number of elements I might not have too many batches and therefore not wasting too much resources.
I’m implementing the method 1. It can run but it’s not really stable if the number of elements is too high. I’m still trying to figure out why (if this is my algorithm by itself or the CUDA implementation). The method 2 is not very appropriate in my case. I need to compute the force for each node. But for that, I need to add the force contribution from every element which shares this node. In my mesh (tetrahedral) the node valence is up to 24. In other words one node may be connected up to 24 elements. So computing element values each time for each node would be too costly.
About float atomic writes it is kind of compiler dependent yes. But it seems to have no difference between 0.8 and 0.9 though. Our trick works but in some cases the compiler re-orders the instructions. But if you work within a loop or when you check the ID you use a global memory variable and not the local one to do the comparison, this works. Read the topic to understand better.
About 0.9 how did you register on partners.nvidia.com ? My team asked for a developer account but it takes time.
Obviously with a valence of 24, it is not a good option to recompute the forces at the level of the loop of points. That means redoing the work 4 times. the extra storage is almost a requirement especially if the force computation is involving a lot of computation. If you do not have memory to spare, you may want to try to reformulate your FEM formulation based on edges instead of elements. That is what we have done and found in our case (CFD) that it was cutting down the amount of computation. Now you do the work twice but the gain is worth it (everything is coalescent). I will try to find one of the paper written or at least provide you with a reference.
Related to registering to partners.nvidia.com. I do not remember how I got there. I was registered as a developer a long time ago and gained access to that by the mean of an email that Nvidia sent me. Unfortunately, I have erased it. Ask the NVidia admin guys for a way to speed up the registration process.
If you can provide me a paper about that, why not. Thank you. I want to stabilize the gather implementation first. But later I might try.
About developer registration, we’ve left a message this morning in another forum about registration issues. One guy from NVIDIA answered us by email to tell us they had no trace of our registration. Apparently they have problems if you’re not using internet explorer… And we’re running Firefox on Kubuntu. It’s almost unbelievable that we have to use internet explorer to register. But we got a quick answer from NVIDIA so it’s great. At least we’re gonna be able to fix this.