Cuda test particle simulation with 10^6 particles and 10^5 constants

Hi,
I am trying to do a test particle simulation of 10^6 electrons moving in a magnetic field using the Runge-Kutta algorithm. The magnetic field every where is determined by a function containing 10^5 constants (such as B=a_1sin(b_1x)+a_2sin(b_2x)…).

Where should I put the constants?

a) in the constant memory (I wonder if the memory is large enough?)

b) write the constants directly in the code (The code could be very long and the program itself will also occupy much memory)

or any else methods?

All the best, and thanks in advance
Yiwei

__constant__ memory holds up to 64K bytes. Therefore 105 constants likely won’t all fit. To take advantage of the __constant__ cache you want to make sure your constant usage is uniform across the warp. If so, and there is some logical separation of your constants, you might still put some in __constant__ memory.

Putting the constants directly in the code is likely a good choice if that is workable for you.

Otherwise, put the constants in global memory and mark the pointer that you pass to the kernel with const ... __restrict__ to take advantage of the RO cache.

Hi,
thanks for your kind reply. Here I have a question: where does the program itself load? in the global memory or somewhere else? It seems that the program itself seems to occupy a large space.

All the best
Yiwei

global memory is a logical space. It is usually physically backed by GPU DRAM memory

the program itself loads (i.e. instructions are fetched) from a separate logical space that is also physically backed by GPU DRAM memory.

thanks

Please let us know how it goes. The GPU provides multiple constant banks, one of which is exposed as the 64KB of __constant__ memory available to programmers, and another one is used by the compiler to store literal constants from the code. I have no idea what will happen when you compile code that contains 105 literal constants, as the constant bank can’t hold that many constants, best I know. Maybe the compiler converts them into loads of immediate data embedded into the instructions themselves.

Having a single function defined by 105 constants strikes me as highly unusual (to be perfectly honest, my very first thought was “that’s just ridiculous”). What is driving this? Is this a common occurrence in your field?

[Later:]

I did a quick experiment with a function defined by many literal constants of type double. For architectures < sm_70, the CUDA compiler uses MOV32I instructions to load the constants.

For sm_70, the compiler sticks the literal constants into constant bank 2 (that is, c[0x2][0xnnnn]). The compiler abnormally terminates with an access violation in ptxas when I exceed 8192 constants (= 64 KB). This is with the CUDA 9.2 toolchain. Will try the CUDA 11.1 toolchain next.

cool, you have a sm_70 device now. (I misread.)

This observation of yours (if borne out on recent toolchain) would be great to file a bug on…
I could cobble something together but it seems you already have the code…

The conclusion is incorrect. I simply compiled for sm_70, which is the latest architecture that CUDA 9.2 supports. This is a really old machine with an sm_30 GPU which I use for email and web browsing. I have to get over to my current developer machine with the Quadro RTX 4000 now.

The CUDA 11.1 toolchain seems to handle literal constants in excess of 64KB just fine.

What I see for sm_70 is that the constants that don’t fit into the constant bank are loaded using two moves from immediate, a MOV Rn, 0xnnnnnnnn for the least significant 32 bits followed by an IMAD.MOV.U32 Rn+1, RZ, RZ, 0xnnnnnnnnn for the most significant 32 bits.

For sm_75 and sm_80, literal double constants that don’t fit into the constant bank are loaded with two consecutive UMOV Rn, 0xnnnnnnnnn instructions.

The performance impact from having to use immediate loads instead of constant bank reference is probably (conjecture!) minimal for double-precision code. For single-precision code, the compiler should be able to fit many single-precision floating-point literals into the arithmetic instructions themselves so they don’t take up space in the constant bank, and twice as many constants can fit into the constant bank. Where that isn’t sufficient, a noticeable impact from the increased dynamic instruction count due to immediate load instructions seems likely.

The performance impact will probably depend on how the code is compiled. The original post suggests that there is one call to sin() for every two constants. So the majority of the time would likely be spent in trig function evaluation, unless the code is single-precision computation compiled with -use_fast_math.

Formatting hint: This forum supports markup like <sup></sup> for superscripts and <sub></sub> for subscripts: ab, logab.

1 Like

Thanks for reply.
I am expecting an “uniform and isotropic” magnetic field. As is known to all, magnetic field has a specific direction, so if I want it to be “uniform and isotropic”, I need to superpose a large number of magnetic fields (the number is determined by the degree of “uniform and isotropic”).

Thanks for the explanation of the use case. Across a wide variety of computations, the largest number of coefficients I had encountered, regardless of the set of basis functions used, was about 1200 as an extreme case, but usually not in excess of a few hundred.

In response to this thread I also searched the literature on simulations of magnetic fields in various contexts and wasn’t able to find any use case requiring 105 compile time constants.

(The attached code seems to be misleading, so I withdraw it.)

Use the CUDA profiler to guide you to the bottlenecks. One that thing doesn’t look right in this code is the calls to curand_init() inside the loop. My expectation would be that this is called once at the start of the kernel. Why are there calls to curand_uniform_double() given that the entire computation seems to be done in single precision, i.e. float?

What does the loop represent? Looping over the 105 coefficients that define the function B()? If you do this much computation for each of 106 particles, I am not surprised that this proceeds at a snails pace. There seem to be on the order of 102 FP32 operations per loop iteration, so you are looking at 1013 FP32 operations for a single pass over all particles. Even on the fastest GPUs this would take a second or more, based on my back-of-the-envelope calculation.

As I stated previously, I have never encountered a function defined by 105 parameters, and I did not find anything like this in my perusal of the literature related to magnetic field simulations. This would suggest that either you are exploring an entirely novel idea here, or that the chosen approach is not well matched to the problem at hand.

This is your field, not mine, so it it really not my place to dispense domain-specific advice, but you might want to re-think the overall algorithmic approach before doing a deep dive on code optimization.

Thanks anyway.