Newbie : Would like to build a physics sim completely contained in the GPU

I have an nvidia 4090 and would like to know if someone could point me to a place that would tell me what the limitations are for having a kernel run other kernels (such as 2D FFT’s on complex arrays) interspersed with math functions on the same data and constantly looping it all within the GPU. I want to keep all the data arrays on the GPU for the duration of the sim and only send data once in a while back to the CPUs for analysis and display. Is it possible or do I have to retreat back to the C handling program every time a kernel has finished?
Thanks
glenn

Hi Glenn,
you should differentiate between where (GPU/CPU) the data is kept, what (GPU/CPU) decides on the functions to run and their parameters and what (GPU/CPU) actually calls them. What is the reason you want to have the physics simulation completely contained in the GPU? To speed it up? Either because of the time for data transfer or for invoking kernels? To keep the source code compact? Many variants of fully running code on the GPU and also have numerous hybrid approaches have been successfully done in the past. Keeping data in GPU (global) memory without copying, but calling the kernels themselves from the CPU is the most usual approach to work with GPUs. If you want to call kernels from kernels that is possible. It is even possible to combine several kernels into one with a (nearly) infinite loop and a switch case, which operation to do next. Several SMs can communicate over global memory and atomic accesses with each other.

Dear Curefab
Thanks for getting back to me. Sorry I’ve been on holiday break.
I am trying to run laser simulations with 100 2Kx2K or 4Kx4K double precision complex arrays representing optical fields in the resonator. I have another 20 or so double precision real arrays for the multiple energy level densities corresponding to the gain dynamics. I have been told that memory swaps are time expensive, but I do not know how expensive are kernel swaps? The code is probably smaller than the arrays I am using. I also do not need to keep or output all the arrays that I process. There are likely only two out of the 100 processed per sim time step that I need to send back to the CPU for simple analysis and I/O. The communication with the CPU can be fairly sparse and asynchronous as long as the GPU can keep all the arrays inside it. Right now I have a GeForce 4090, but I don’t know if that is the best GPU for doing complex 2D FFTs. It seems to have enough memory, but it’s the only GPU in the system and while the loading on it looks trivial for routine things, I don’t know what will happen when I demand most of its time. Should I have bought a different GPU(s)?

The other aspect is that this will be a diagnostic calculator that is only useful in real time. Watching the laser fields develop is essential for building intuition in this arena. That requires processing 1,000 to 10,000 time steps on each array in about 5-10 minutes.

It also seems difficult to get NVIDIA to tell me what GPUs work best on what types of problems. My sims have less linear algebra, then special functions.

I am writing simple dynamics code now, but I would like to write a kernel that does gain dynamics and then immediately does the FFTs on the array for propagation.
Most of the examples I’ve seen are doing a single function on the GPU and returning an answer. Do you know where I can look for examples doing multiple different functions in the kernel and only selecting some subset of them for returning to the CPU?
Sorry, I have a lot of questions
Again

Thanks for your time.
glenn

A couple of comments:

It’s not unusual for the GPU workload to be fed from a CPU loop through different kernels. At startup, many loop iterations may be queued on the GPU. Kernel startup overhead is of the order of 5 microseconds, so as long as your kernel duration exceeds this by a reasonable margin, it’s insignificant.

You mention double precision and depending on how heavily you need this level of precision, will have an impact on performance. It’s only the data centre class GPU’s that offer full throughput compared to the consumer oriented ones. An idea of the differential can be seen here comparing the throughput of double vs float instructions across the various architectures. Taking your 4090 (CC 8.9) vs an A100 (8.0), double throughput is 1:16. The 4090 should still be fine for developing on, but this is a factor to consider.

Hi glenn,
the different Nvidia GPUs are more similar than different, how balanced the various execution units are. Of course some are overall much faster than others.

However, some differences are:

  • As rs277 said, some few workstation GPUs have high double precision throughput, the other cards are quite limited with double precision.
  • The amount of memory (2 GB vs 80 GB) to keep all the needed working data
  • The ratio of global memory speed vs. computation speed is nearly the same, but some cards prefer one, some the other parameter
  • The L2 cache size (for algorithms repeatedly accessing the same dataset); the same, but in a smaller time and size scale as the total global GPU memory
  • The workstation cards have enterprise features, e.g. ECC memory or RDMA data transfer to other PCIe (e.g. InfiniBand network cards)

The RTX 4090 is the fastest consumer graphics card and therefore even with the high penalty is still quite fast for double precision.

Even older workstation GPUs would be quite expensive in comparison. E.g. Titan V/GV100/V100 (32 GB version, 7x as fast with double), or the newer A30 (24 GB, 4x as fast with double as RTX 4090, cost around 5000), A100 (40 or 80 GB, 8x as fast, cost around 13000), H100 (80 GB), GH200 (96 or 144 GB). So, when you see the prices, there was nothing wrong to get the RTX 4090 first or in addition.
Newer generations have better Tensor Units, which can help with FFT calculations.

Global Memory size: 100 2Kx2K double complex = 6.4 GB. Sometimes you need more than one array, e.g. input+output. So let’s say 12.8 GB. Or 51.2 GB for 4Kx4K. The RTX 4090 has 24 GB. You really have to be careful here to stay below 24 GB.
Do all 100 simulations have to be run in parallel or can they run sequentially?

PCIe for transfer between host and GPU is a bottleneck. With PCIe 4.0x16, transferring the 24 GB costs about 750ms.

Perhaps you can do the analysis on the GPU? But if you only analyze a few of them, it won’t matter so much.

I/O can be overlapped (streams and cudaMemcpyAsync) with calculations.

Kernel swaps are cheap. The program will be loaded (at least after the first use) into the GPU and just called, whenever needed. The overhead for a kernel call, rs277 said, is about 5µs. For your data sizes, it does not matter.

Use the existing cuFFT library for doing real and complex 2D FFTs. You can try out the speed of different sizes, powers of 2 are the fastest. So 2K=2048 and 4K=4096 are good values. If you can fall back to single precision numbers, it would run faster. Be advised, that creating the plans should be done only once, and the plans should be reused, as creating them takes some time.

For the overall speed, I would first try out, how fast the FFTs run on your graphics card. Use the cuFFT batch functions to process several FFTs at the same time (easier to parallelize for the library). Try out single and double precision and real and complex FFTs. Try out 2Kx2K and 4Kx4K. Use it on memory, which is already on the GPU. Then you can decide, whether your GPU ist fast enough. There is also the option, to use multiple GPUs in one or several PCs.

I would (half guessing) expect that with a single RTX 4090 and 1000 time steps and 2Kx2K you will get into the right ballpark for single precision, for double precision 10000 time steps and 4Kx4K it will probably take a few hours to simulate.

You are talking about special functions. FFT is linear algebra. The SFU Special Function Unit helps with exponent, cosine, etc. Those often can be approximated either by look-up-tables or by approximations with polynomials, which goes back to linear algebra.

For doing several functions in the kernel, just use the examples for single functions, repeat them a few times and do not copy back the memory between the kernel calls.

Best,
Sebastian

The SFU offers only single-precision operations. It’s not really helpful for double-precision computations.

The DP:SP ratio of 1:64 specified for compute capability 8.9 means that using double-float computation is potentially attractive, but it also puts severe limits on the dynamic range. On a related note, I have been wondering for a while whether integer-based emulation of certain FP64 operations might actually be faster than equivalent native FP64 computation on that architecture.

@njuffa In this paper (https://arxiv.org/pdf/2203.03341.pdf) FP32 operations are simulated with TF32 operations on a A100 to achieve greater speed. Library is here GitHub - wmmae/wmma_extension: An extension library of WMMA API (Tensor Core API)

I have myself only done combined INT8 Tensor core operations to achieve matrix multiplication with higher accuracy (complex fixed point so far). The number of operations increases with the square of the bit size. So 64 bits (or 53 significant bits) is perhaps a bit much for INT8 simulation? INT32 probably is good.

I was actually thinking of Dekker’s classical technique for approximately doubling the precision by operating on pairs of floating-point numbers, comprising the “head” and the “tail”, where |tail| <= 0.5 * ulp(|head|). See this answer of mine at Stackoverflow.

Assuming support for FMA (fused multiply-add), which is a given for GPUs, this boils down to something like 20 instructions for a pair-precision addition and 8 instructions for a pair-precision multiply. So on a platform where FP32 has 64 times the throughput of FP64, as is the case with sm_89, that should always result in a win in terms of performance.

Getting the basic arithmetic into place is not very hard; finding a library of basic mathematical functions would probably be more of a challenge. double-float retains the exponent range of float, which I could imagine is too limiting for general physics. My experience is that double computation dominates in computational physics.

rs277
Thanks for the feedback. The laser dynamics needs a dynamic range of about 10 billion +, Pretty much due to the interference terms in the amplitudes. Heterodyne detection schemes use this effect for single photon counting. Single precision and using natural log functions without being able to flip between two or three approximations of the function is a disaster. If I want to keep if/then statements out of my kernels, I will need double precision. If the if/then statements are not a problem and the faster throughput makes up for different parts of the arrays needing different approximations then this would be a consideration. Also, how many kernels can I store in the GPU code memory that would operate on the same data in series? Can I preload them and just trigger them with the CPU? I would like to be able to load all the functions into a single kernel without having to go back and forth with loading new kernels. I definitely want to keep the majority of the sim results in the GPU and only memory swap what I need out of there using a different kernel and only release the memory when I’m done with the sim.

Thanks
glenn

Hello, Curefab
Thanks for the info. I’ll need to find out how big my kernels are and can they fit in the available L2.

I’ll talk to folks and see what they suggest for workstation GPUs
glenn

Thanks njufta I will look into it. I am solving both a wave equation in a split step sense and in between the steps I am solving a set of eight first order non -linear differential equations. 2D complex FFts, exponential functions and logarithms are essential. There is very little (except for the FFTs) matrix or linear algebra in the model. The dynamics part is highly parallelizable as the physics is all local and I can process many arrays through the same functions. Then I have to do the propagation of the optical fields and I can do that in bulk too. Then switch back and repeat ad infinitum.

It’s also my first project using GPUs. I am very much a newby.
glenn

Kernels are loaded into global memory on the GPU at startup. As for how many, I haven’t come across a figure and haven’t seen anyone complaining about a lack in that regard.

If you haven’t already seen it, you may find some benefit in the “Best Practices Guide”.

I’ll look at that guide

thanks

In the earliest Cuda capable Nvidia GPU generations there was a limit of 2 million instructions per Kernel. Since then the Nvidia GPUs have more complex memory management and currently AFAIK the limit for kernel size is the (virtual) address(able) space of 2^59 bytes (for Hopper generation; pre-Hopper 2^49 bytes; the 2^49 bytes address space is confirmed in the Nvidia Kernel Profiling Guide) with SASS instruction size of 16 bytes and kernels are (theoretically) allowed to span that size.

There are limits, which are active much earlier: If your kernel has an inner loop and you implement it too large than the instruction cache is too small.

Each SM Partition (there are 4 partitions per SM) has about 32 KiB of L1 instruction cache and has to possibly serve several (typically around 4-8 or 4-16 depending on GPU capabilities) warps, which may execute the same kernel position or a totally different one.
Each SM has about 128 KiB of L1.5 instruction cache.
After that the requests go into the L2 and then the global memory of the GPU.

Those cache sizes are from Turing generation and may have slightly changed.

I mostly saw this effect and slowing down, if I unrolled too many loops (e.g. to more efficiently use registers as they cannot be dynamically indexed; with loop unrolling, the index gets static) and executed complicated algorithms within those unrolled loops (e.g. complex single precision FFT of size 64 locally just within registers of one thread).

The Cuda programming model is meant to have a limited number of kernels (can be thousands), which are either preloaded at program start or (especially in past years) at first invocation and then stay within the huge global memory of the GPU. The amount of data and parallelization is huge compared to the size of the kernels. So using large kernels is possible, using many kernels is possible and starting and restarting kernels is possible.

If you want to invoke a lot of kernels in a predefined and repeating way from CPU side (instead of fusing the kernels to one larger kernel), you can look into graphs, which optimize the microseconds of kernel invocation. But I do not think kernel invocation will be your pre-dominant bottleneck.

1 Like

@qgeek4755 For a CUDA newbie, it will probably be best to follow the old maxim “Get it right before you make it fast” and use double computation for the initial implementation. This will provide a useful baseline in terms of functionality, accuracy, and performance.

Once everything works to your satisfaction, you could then explore where you can make do with float computation as part of a mixed-precision setup or where double computation can be replaced with pair-precision computation.