Cuda test particle simulation with 10^6 particles and 10^5 constants

dz1726001 · January 25, 2021, 12:39pm

Hi,
I am trying to do a test particle simulation of 10^6 electrons moving in a magnetic field using the Runge-Kutta algorithm. The magnetic field every where is determined by a function containing 10^5 constants (such as B=a_1sin(b_1x)+a_2sin(b_2x)…).

Where should I put the constants?

a) in the constant memory (I wonder if the memory is large enough?)

b) write the constants directly in the code (The code could be very long and the program itself will also occupy much memory)

or any else methods?

All the best, and thanks in advance
Yiwei

Robert_Crovella · January 25, 2021, 3:19pm

__constant__ memory holds up to 64K bytes. Therefore 10⁵ constants likely won’t all fit. To take advantage of the __constant__ cache you want to make sure your constant usage is uniform across the warp. If so, and there is some logical separation of your constants, you might still put some in __constant__ memory.

Putting the constants directly in the code is likely a good choice if that is workable for you.

Otherwise, put the constants in global memory and mark the pointer that you pass to the kernel with const ... __restrict__ to take advantage of the RO cache.

dz1726001 · January 25, 2021, 3:33pm

Hi,
thanks for your kind reply. Here I have a question: where does the program itself load? in the global memory or somewhere else? It seems that the program itself seems to occupy a large space.

All the best
Yiwei

Robert_Crovella · January 25, 2021, 4:00pm

global memory is a logical space. It is usually physically backed by GPU DRAM memory

the program itself loads (i.e. instructions are fetched) from a separate logical space that is also physically backed by GPU DRAM memory.

dz1726001 · January 25, 2021, 4:04pm

thanks

njuffa · January 25, 2021, 8:00pm

Please let us know how it goes. The GPU provides multiple constant banks, one of which is exposed as the 64KB of __constant__ memory available to programmers, and another one is used by the compiler to store literal constants from the code. I have no idea what will happen when you compile code that contains 10⁵ literal constants, as the constant bank can’t hold that many constants, best I know. Maybe the compiler converts them into loads of immediate data embedded into the instructions themselves.

Having a single function defined by 10⁵ constants strikes me as highly unusual (to be perfectly honest, my very first thought was “that’s just ridiculous”). What is driving this? Is this a common occurrence in your field?

[Later:]

I did a quick experiment with a function defined by many literal constants of type double. For architectures < sm_70, the CUDA compiler uses MOV32I instructions to load the constants.

For sm_70, the compiler sticks the literal constants into constant bank 2 (that is, c[0x2][0xnnnn]). The compiler abnormally terminates with an access violation in ptxas when I exceed 8192 constants (= 64 KB). This is with the CUDA 9.2 toolchain. Will try the CUDA 11.1 toolchain next.

Robert_Crovella · January 25, 2021, 8:46pm

~~cool, you have a sm_70 device now.~~ (I misread.)

This observation of yours (if borne out on recent toolchain) would be great to file a bug on…
I could cobble something together but it seems you already have the code…

njuffa · January 25, 2021, 8:53pm

The conclusion is incorrect. I simply compiled for sm_70, which is the latest architecture that CUDA 9.2 supports. This is a really old machine with an sm_30 GPU which I use for email and web browsing. I have to get over to my current developer machine with the Quadro RTX 4000 now.

njuffa · January 25, 2021, 9:10pm

The CUDA 11.1 toolchain seems to handle literal constants in excess of 64KB just fine.

What I see for sm_70 is that the constants that don’t fit into the constant bank are loaded using two moves from immediate, a MOV Rn, 0xnnnnnnnn for the least significant 32 bits followed by an IMAD.MOV.U32 Rn+1, RZ, RZ, 0xnnnnnnnnn for the most significant 32 bits.

For sm_75 and sm_80, literal double constants that don’t fit into the constant bank are loaded with two consecutive UMOV Rn, 0xnnnnnnnnn instructions.

The performance impact from having to use immediate loads instead of constant bank reference is probably (conjecture!) minimal for double-precision code. For single-precision code, the compiler should be able to fit many single-precision floating-point literals into the arithmetic instructions themselves so they don’t take up space in the constant bank, and twice as many constants can fit into the constant bank. Where that isn’t sufficient, a noticeable impact from the increased dynamic instruction count due to immediate load instructions seems likely.

The performance impact will probably depend on how the code is compiled. The original post suggests that there is one call to sin() for every two constants. So the majority of the time would likely be spent in trig function evaluation, unless the code is single-precision computation compiled with -use_fast_math.

Formatting hint: This forum supports markup like <sup></sup> for superscripts and <sub></sub> for subscripts: a^b, log_ab.

dz1726001 · January 28, 2021, 12:00am

Thanks for reply.
I am expecting an “uniform and isotropic” magnetic field. As is known to all, magnetic field has a specific direction, so if I want it to be “uniform and isotropic”, I need to superpose a large number of magnetic fields (the number is determined by the degree of “uniform and isotropic”).

njuffa · January 28, 2021, 12:06am

Thanks for the explanation of the use case. Across a wide variety of computations, the largest number of coefficients I had encountered, regardless of the set of basis functions used, was about 1200 as an extreme case, but usually not in excess of a few hundred.

In response to this thread I also searched the literature on simulations of magnetic fields in various contexts and wasn’t able to find any use case requiring 10⁵ compile time constants.

dz1726001 · January 31, 2021, 6:15am

(The attached code seems to be misleading, so I withdraw it.)

njuffa · January 31, 2021, 6:41am

Use the CUDA profiler to guide you to the bottlenecks. One that thing doesn’t look right in this code is the calls to curand_init() inside the loop. My expectation would be that this is called once at the start of the kernel. Why are there calls to curand_uniform_double() given that the entire computation seems to be done in single precision, i.e. float?

What does the loop represent? Looping over the 10⁵ coefficients that define the function B()? If you do this much computation for each of 10⁶ particles, I am not surprised that this proceeds at a snails pace. There seem to be on the order of 10² FP32 operations per loop iteration, so you are looking at 10¹³ FP32 operations for a single pass over all particles. Even on the fastest GPUs this would take a second or more, based on my back-of-the-envelope calculation.

As I stated previously, I have never encountered a function defined by 10⁵ parameters, and I did not find anything like this in my perusal of the literature related to magnetic field simulations. This would suggest that either you are exploring an entirely novel idea here, or that the chosen approach is not well matched to the problem at hand.

This is your field, not mine, so it it really not my place to dispense domain-specific advice, but you might want to re-think the overall algorithmic approach before doing a deep dive on code optimization.

dz1726001 · January 31, 2021, 6:51am

Thanks anyway.

Topic		Replies	Views
Use of constant caches for large data? CUDA Programming and Performance	10	27108	February 23, 2007
Constant memory allocation and initialization CUDA Programming and Performance	12	83019	November 20, 2010
__constant__ memory failed to load. CUDA Programming and Performance	3	2632	August 25, 2007
const __restrict__ read faster than __constant__ ? CUDA Programming and Performance	17	6283	February 26, 2014
Am I using the constant memory properly? CUDA Programming and Performance	4	1865	May 15, 2013
questions on constant memory CUDA Programming and Performance	3	3366	September 18, 2009
Tables not correct when using __constant__ CUDA Programming and Performance	12	14977	July 28, 2010
Accessing beyond bank 0 in __constant__ memory from PTX? CUDA Programming and Performance	7	760	September 17, 2023
Constant Arrays CUDA Programming and Performance	13	30820	November 24, 2007
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7489	January 25, 2010

Cuda test particle simulation with 10^6 particles and 10^5 constants

Related topics