I have searched around, but as I know really nothing about CUDA I probably don’t know where to look, or what to search for.
What I would like to do is offload the calculation of the Laplacian of a 2d array (float) to the GPU, so I could execute it for differing radii and stencil neighborhood types, with the goal ultimately of using this in C# as an extern call.
Did that make sense?
My questions are three:
Is there something which exists that does this already?
Is there a concise source example out there that an utter neophyte to CUDA could hack into doing this?
Is CUDA best suited for this, or another GPU language like OpenCL, GLSL, etc…?
The language of choice matters not to me; my only goal is to offload the Laplacian calculation in a way C#/Java can interact with, and be done with it.
I do not want a huge library, I just want to offload this one task to the GPU, and a huge lib would just confuse me in trying to refactor out the one thing I need. I also do not have time to learn a new language in enough detail to code this from scratch, because I am not being paid for this…it is art.
This part of the PDE calculation takes so much time on the CPU I am limited in array size, depth and radii of the Laplacian neighborhood. If I had something to do this, it would open up a whole lot of possibilities for me.
The goal for typical array x and y dimensions are up to 8192 at a z depth of up to 64, with Laplacian stencil radii up to 1024. So, GPU is pretty much required for this.
For reference, could you state what library you are currently using for the CPU version of this code? If you plan on using GPUs in your work, I can recommend CUDA as the most mature and widely-used parallel programming platform with the largest eco-system, but of course I am biased.
The maximum dimension of you problem (8K x 8K x 64) would seem to suggest that you will need a GPU near the top of the line, as only those have sufficient amount of on-board memory. I assume this is single-precision computation?
Currently I hand code all PDE solvers, and my code for handling Laplacian stencils and such is my own as well due to the unusual nature of my work. I use no libraries, and use only the base, core functions of any language I code in for this purpose, which allows me to copy/paste into nearly any C based language with few modifications.
I use plain arrays to contain the data and for loops to iterate through them.
I am not familiar with Laplacian calculations, but a 30-second internet search for “Laplacian” plus “CUDA” results in very large number of hits that appear relevant, including various papers and codes. Have you had a chance to browse through that material to see if there is something that fits your needs relatively closely?
Everything I have found thus far has assumed either that the entire app will be written in CUDA and so a lot of stuff is intertwined with it to the point It is unnavigable for me/or they are doing a whole lot more than just a Laplacian, and all of it assumes a level of understanding of CUDA that I simply do not have, so separating out only what I need becomes rather a task.
That is why I was hoping someone knew of a library or an app which is doing this one thing in isolation, and nothing else, so that I would be able to identify and extract exactly what I want.
At that point the problem becomes encapsulating the relevant code in an assembly of some kind which I can call from C# or some such, which is a much better defined thing, and far easier to locate the relevant information for.
I’m not asking that someone do it for me, just for help in finding a much more narrow example, so that I might do it myself.
There is not enough context information about the use case here, but one potential issue with wanting to “outsource” just a particular portion of a compute pipeline to the GPU is the overhead of transporting data from and to the GPU, that’s a fairly thin straw at 12 GB/sec for a PCIe gen3 connection. How serious of an issue this is here is impossible to say.
The most successful GPU-accelerated applications are typically those that mostly move data to the GPU once, run that data through the entire compute pipeline, then copy back the final results to the host. It is the weekend now, I would suggest waiting a few days to see whether someone familiar with Laplacian calaculations shows up here with relevant advice on the best /recommended approach on how to go about Laplacian calculations with CUDA. I was just trying to gather some basic context information, so it is readily available to the next reader.
Not sure what you mean by “better defined”. CUDA is essentially a subset of C++, with a few extensions for parallel programming. What do you find under-defined relative to C#, and what practical problems does that give rise to?
I don’t know about C#, but I am aware that many people use CUDA via Python bindings, in particular PyCUDA. Seems like a good approach for programmers not comfortable with CUDA’s raw C+±ness.
Given that on the CPU attempting to calculate the laplacian of even a 5-stencil on data of 4096x4096 takes almost multiple days…I am sure that GPU would be much better, despite the transport issues.
Added to that the face I work with many layers and radii of stencil, the problem gets much, much worse. The only way for me, short of a supercomputer, will be GPU.
That is precisely what I wish to do with it. In the most simple case I will have arrays like so:
The laplacian will be calculated on u and v, then stored in a temporary array which is iterated over and used in the PDE, updating uu and vv, which are then swapped. This allows me to entirely offload the laplacian, run it, get it back, then use it.
I mean only that there are far more instances of information about encapsulating a CUDA operation in a C# consumable library than there are about how to do specifically what I want with regards to Laplacian, and so it will be much easier to figure out how to do that part of it.
I use vs2015 enterprise with the NVIDIA plugin for it so I’m not too worried about that part. I think once I see how to handle the calculation of Laplacian on N arrays that the offload/push back to c# part will be comparatively trivial.
The most simple example of laplacian would be as thus, for a 5-point stencil:
Basically all I want to do is encapsulate the two loops and that line so CUDA will execute it. Once I have that, I can easily extend it for the more advanced operations and neighborhoods which I use, then make it into an externally callable library.
As I said, it isn’t that this is a hard problem, it is that I have literally no understanding of how to do it with CUDA. There are probably links out there that explain the likely very few steps involved, but I don’t even know how to frame the search terms, because of my having no CUDA domain knowledge.
I expect the answer is rather simple, as is the problem itself.
It is my personal experience that it is much easier for domain experts in physical sciences or various engineering disciplines to pick up CUDA knowledge than the other way around. From what I have seen, in many cases one to two weeks was all it took for a scientist or engineer to acquire enough proficiency in CUDA to create a basic GPU port that was already running faster than the CPU solution by small integer multiples.
As a computer science guy without a background in computational physics, computational chemistry, or fluid dynamics (and the relevant numerical methods) I have found it very tough to work from the other end of the game. So I don’t think there is a reason to worry. You might want to start with some introductory material to get a feel for the basic techniques of mapping threads to data in CUDA; learning CUDA exclusively from the accompanying documentation might be a bit overwhelming. My standing recommendation is to learn by experimenting, starting with simple programs then working towards more advanced techniques. CUDA comes with a bunch of example programs, from trivial to reasonably sophisticated.
Yes, absolutely this can be accelerated on the GPU, but (like most things) with many caveats and many opportunities/limitations depending on your exact problem. In the simplest, a large kernel laplacian evaluation is just a convolution (though with sparse samples). If the kernel width is small then the computation is so trivial that your speed will be dominated just by getting data to and from the GPU, so there would be no net speedup. If your kernel width is large (and you mentioned 1024, which is huge, with thousands of merged samples) then a GPU’s memory bandwidth will be a big advantage, plus you can start playing algorithmic games to make the speed sublinear in the number of stencil samples.
But before diving into THAT question, do you really really need a kernel of size 1024? That seems awfully large. And more importantly, does your kernel size change with position? It probably does, and that changes the algorithm significantly. So explaining more details about your needed algorithm would help a lot. In particular it may be much more efficient (on CPU or GPU) to do repeated relaxation passes of a much smaller kernel (and tune the size of the kernel for best net speed).
I’m smiling about this post since I’ve been doing GPU PDE R+D for fast Laplace equation electromagnetic solvers for years, on both CPU and GPU. The finite difference stencil approach is really “old school” but still quite reasonable for some cases, especially if you need the values evaluated on a grid at the end. You used the word “art” so I wonder if you’re implementing a form of rasterizing vector diffusion curves? Though that wouldn’t explain a Z depth.
In terms of trying to use the GPU with minimal GPU specific code, you may give up a lot of perfomance and control, but for a simple stencil compute probably you can use a higher level library like Thrust to use the GPU without having to start CUDA coding. Again it depends very much on the specifics of your needed algorithm.
Yup, indeed I do. What I am doing is multiscale coupled reaction-diffusion algorithms, to produce very interesting output.
I have many 2d RD layers that are coupled, each layer having differing stencil size for some algorithms. In others, such as below, I have many differently sized grids which are interpolated and coupled in strange ways.
No, not at present [in the x/y, they vary by layer]…but that is an interesting idea to try…thank you…that may yield more interesting results.
Here is an example of what I both know how to do, and want to do, but am currently limited by all of my code being CPU. The image below is by Jonathan McCabe, and it is a multiscale Belousov-Zhabotinsky RD system.
Well, I need to begin somewhere, and I will get a much better feel for it if I start simple, like through implementing the offload of a simple stencil laplacian op. Generally once I get my head around the same problem in two different languages so I am operating in the same context, I can learn what I need to figure out how to do in order to support more complex operations with better performance.
If I start with a library I will only learn how to use the library, which is a less than optimal solution. :)
“Days” seems very very very slow. Perhaps it’s not really slow, but you have used all of your CPU’s RAM and you’re paging?
For reaction diffusion, the laplace operator is seperable and symmetric so it’s way cheaper. It’s really a gaussian blur. You just alternate 1D passes in X and Y, and as a bonus you can accumulate various intermediates to get a gaussian blur of variable radius.
This kind of strategy is very successful for realtime RD systems, which may not the same be yours, but clearly this is apropos if you’re looking for some implementation inspriation. That link is a gaussian-blur based RD system implemented as an image shader, so it runs entirely on GPU. It updates at about 2fps for a 4K image on my low power GTX 750Ti. Note Buffer B and Buffer C doing the 1D X and Y convolutions alternately, giving an effective stencil size of 81.
No, even my laptop has 64gb ram, so its all from the loops.
I have versions of split 1D box blurs that I have tried to use in place of laplacian, but there is some difference in the final output that is both nonlinear and unknown to me, so it tends to blow up beyond what is expected.
I have looked at shadertoy examples in the past, and I have no idea what that is doing. I avoid shader languages like the plague, because it just doesn’t make sense to me, for many reasons which I am sure you care not to hear, but suffice it to say what is happening with cross buffer communication and where variables that have no obvious assignment are getting their values is most of it. ;-)
I think it would be easier for me if I can keep working with things I am familiar with, and if I can do that with CUDA, I’ll be happy.
Like I said I’m not being paid for this and investing a lot of time learning a whole new way to think about not only the problem at hand but the way I must code it is just…not worth it. So I am trying to find a middle ground where I can just offload the worst parts of this and keep the rest as-is.
Independent of CUDA, if one wants to take advantage of modern parallel architectures, one has to learn how to “think parallel”. After two decades of serial programming, I found the switch quite challenging. It definitely got me out of my comfort zone. But it is necessary to reap the performance benefits, on both CPUs and GPUs.
CPUs offer instruction level parallelism (out of order execution), data-level parallelism (MMX, SSE, AVX, Neon, Altivec, etc), thread-level parallelism (hyper-threading, OpenMP), task level parallelism (multi-core, MPI), and only some of these layers are handled automatically by today’s sophisticated tool chains.
I would claim the hard part about programming with CUDA is not the mechanics of writing code (just a small number of extensions to standard C++), it is the orchestration of work across multiple levels of parallelism (warp, thread block, grid, multi-GPU), in much the same way this is the major challenge for accelerating code on CPUs.
It isn’t parallelism that presents a problem, it is the weird texel stuff and odd way many shading languages do it that is off-putting.
I have no problem with creating code which runs in parallel, after all, I have been professionally creating software for around 23 years now, but GPU languages haven’t been part of this.
As stated I am not familiar with CUDA and have only seen other shading languages used in the capacity I seek thus far.
Yep, like C# task parallel library and such. It is useful in a general sense but is by no means granularly optimized like one may accomplish through lower level and more situation specific efforts, and I certainly do not want to get down so far that I am manually using processor level extensions. Not fun! :)
I am not pushing for extreme optimization, I only seek to offload the laplacian calculation to the GPU, so I don’t want to get super deep into the weeds. Were I to spend the time to go as low level as is possible, I doubt I would see a performance increase worth the time spent doing so over that gained by the initial the offload to gpu. In fact, if I were really pushing for the absolute best performance I would probably switch to a matrix based solution.
I will be happy with just pushing the laplacian to the GPU for now though, for a number of reasons.
I’m fine with C++/C#/similar, it is just the things like GLSL/HLSL/WebGL which represent the bulk of what I have seen so far which I find awkward. It is good to know CUDA is less like those than it is C++.
I will take what I have learned with regards to some terms and such and use this to refine my searches. I always knew that one day I would have to go GPU, and the less painful I can make it, the better.
This seems like something that is a solved problem, somewhere. I would be willing to bet that someone, somewhere has wanted and made this exact thing, right down to the limited purview and scope of it.
Of course if anyone happens to be aware of the location of such a thing, feel free to speak up. ;-)
Using some terms from the replies above, I think I might have found exactly what I needed…please let me know if there is some glaring evil in the GPU transport parts of this…here is what I found:
I am not advocating the use of shader languages, that is how people were forced to do GPU computing prior to the creation of CUDA. In other words, CUDA was designed specifically as the answer to programmer’s struggle with shader languages. One major part of that was to make it as close to standard C/C++ as possible and thus allow GPU computing to become ubiquitous by being accessible to all, not just ninja programmers.
I was a member of the original five-engineer team that created CUDA, and ease-of-use was always on the top of our list of design goals for the CUDA environment; the rapid adoption of CUDA would appear to validate that approach. The design of CUDA is not perfect, of course, and it created some issues that were hard to foresee in 2005, such as release schedule conflicts caused by the tight integration of CUDA with the host toolchains.
You keep stressing the desire to have as simple and minimal GPU code as possible, which is a totally reasonable goal for your project. I had mentioned Thrust, which is a C++ library that abstracts much of the GPU compute to be more like a set of routines that operate on arrays. Most of us here in this forum of course prefer the “real” GPU language of CUDA which gives you most of the GPU’s power.
But that’s not even the whole heirarchy of options. Perhaps at the topmost level, a system called OpenACC is designed to give non-GPU programmers access to some of the simplest GPU abilities. Kind of the “low hanging fruit” of a GPU. I’ve never used OpenACC since it exposes so little of the GPU’s versatility, but it might actually be something that will give you some speedup with the smallest effort.