I am not sure if this post belongs here, but from what I have read the best CUDA program (CUDA kernel?) is when the kernel is called once and does an extremely involved parallel processing task just once and sends the answer back to the main program and is thus finished with it forever (or at least that program’s execution). This minimizes GPU to CPU bus transfer.
However, in my situation I have the subprogram which is now in ANSI C, but it seems to be called hundreds of thousands of times. The obvious problem is there will be a lot CPU-GPU bus transfer. I am not sure how to minimize this transfer phenomenon. The subprogram that takes up so much proportional time is parallelizable, but as stated previously it is is not called once and used then its resultsl sent back to the main part of the program.
I have been looking through the CUDA examples for something like this. Is it possible to still program a parallel CUDA kernel that is called from the main c program many times and realize a significant speedup?
That is a pretty hypothetical question and it is impossible to give anything but a hypothetical answer, but yes you can (in certain cases) get good speed up in that sort of scenario.
I solve transient PDEs a lot. Explicit multi-stage methods might require hundreds of thousands or even millions of forward Euler steps to solve a complete problem. When a single Euler step is computationally very cheap (like a simple function evaluation) it makes no sense to implement it in CUDA. When a single Euler step involved spatial discretization of the PDE across a million cell mesh, it makes a lot of sense. Algorithms can also be redesigned to minimize the data exchange between host and device. The more data you can keep in device memory and the longer it stays there through the application lifetime, the better it will be. But even what might be considered “out of core” operations can still yield good speed up with CUDA if you know what you are doing.
I have less experience with this than our local gurus, but my understanding is that doing the cudaMalloc and cudaMemcpy commands use the bandwidth, and the math operations on the GPU are fast and not troublesome. (If this is wrong, I would be happy to be enlightened by said gurus.)
You can eliminate malloc/free overheads (if your code uses a lot of them), just by preallocating all the memory you need and running your own block/chunk based memory manager in host code, which basically makes memory management free. PCI-e bus transfers can be amortized or hidden by using asynchronous copies with overlapping kernel execution, and by offloading calculations back to the host cpu asynchronously with kernel execution. But none of that helps if there isn’t sufficient data parallel work in the actual kernel itself. That is the key to getting speed up in things like iterative solvers, time integration schemes and hybrid linear algebra functions, where the GPU is used more as an accelerator rather than the primary computation device.
Is there an example of this available? I am looking for something similar to the Kirk book’s MRI example. I would like to make memory management free, but I am unsure how. I need an example.
Yes, you can see an example of this strategy here, but seriously learn to walk before you try competing in the Olympic 100 yard dash. If you aren’t already fluent in C and C++, you won’t understand anything in that code, and if you don’t have a thorough understanding of the tool chain, you won’t even be able to get it to compile.
I would be focussing on learning the mechanics of writing and building functional, slow code that works, and only then worry about how to make it faster.