Can NVIDIA's development stack replace the need for an FPGA in CNC motion control?

mzeuner · March 18, 2024, 2:50am

I work for a Computer Numeric Control (CNC) company who manufacture router tables and waterjets, AXYZ Automation Group (AAG). Our current motion controller, A2MC, uses a Field Programmable Gate Array (FPGA) to output step and direction for the stepper motors that control the machine. I am curious if NVIDIA’s hardware/software solution could prevent the need for the FPGA in our hardware. The FPGA was selected in 2008 because it was able to solve the standard formula for motion deterministically at 5.125MHz.

Can NVIDIA’s hardware running Cuda code be able to solve d = (j * t^3) / (6 + [a * t^2]/[v * t]) at a minimum of 1Mhz (but ideally can run significantly faster) to 9 integrators with +/-10 nanoseconds of jitter (discrepancies in the determinism) at a deterministic rate of 100microseconds.

d = distance, in millimeters (mm), on the FPGA on the A2MC this unit was Smidge, which is a very small unit

t = time, in milliseconds (ms), on the FPGA on the A2MC this unit is Ticks where there are 4,150,000 ticks in a second

v = velocity, in millimeters per milliseconds (mm/ms)

a = acceleration, in millimeters per milliseconds squared (mm/ms^2)

j = jerk, in millimeters per milliseconds cubed (mm/ms^3)

While an FPGA, GPU, and CPU are fundamentally different, in 2024, can we solve the standard formula for motion deterministically using NVIDIA’s hardware stack (CPU and GPU)? Theoretically, if we can do this, we should be able to create a functional digital twin for our CNC machines. The value of this cannot be understated; addressing bugs reported from customers, developing/testing new features, and tailored machines could be achieved without hardware. On the A2MC we can only reasonably achieve the aforementioned with fully functional machines. Buying into NVIDIA’s hardware stack promises other foundational improvements to our machines, such as deep learning vision models, and predictive maintenance.

Robert_Crovella · March 18, 2024, 4:07am

1 MHz corresponds to 1 microsecond cycle time.

You can’t do anything on current (ordinary, desktop, discrete) CUDA capable GPUs in 1 microsecond, repetitively, deterministically. Furthermore, GPUs thrive on having a lot of the same kinds of work to work on, all at the same time. Even if you bumped your work up to 100 axes, there simply wouldn’t be enough parallelism exposed to take advantage of the GPU in a sensible way.

Many folks would say an ordinary desktop or discrete GPU is not deterministic at all, and I generally wouldn’t argue the point. These GPUs are nestled within a typically non-deterministic operating system - linux or windows, so to avoid a religious discussion about the definition, meaning, and interpretation of determinism, we could just offer that up and stop right there.

A GPU can certainly do a lot of calculations in a microsecond, but only in bulk, as part of a larger “kernel” that might take hundreds of microseconds to execute, or more. Once you get down to 3-5 microseconds, you are at the launch latency of a GPU kernel.

Aside: I guess numerical control has come a long way - not surprising. I used to work at Cincinnati Milacron on digital servos integrated into the A950 numerical control. We did not do PID, just PI, we used siemens AC brushless servo motors, and we closed the position loop w.r.t. the upstream control at a rate of 2 milliseconds. We had a subspan interpolation system built on that which we used to close the position and velocity loops every 500 microseconds (effectively dividing every CNC cycle into 4 subspans). We had a DSP chip running the PI calculations, plus a few other tasks such as resolver calculation, some torque ripple correction, and a few other things (that was primarily what I worked on - TMS320 DSP assembly).

This was sufficient for any work we could imagine on the turning and machining centers of the day. We were able to do things like solid tapping (without a floating tap holder), which was considered difficult at the time. We could remove metal as fast as any state of the art control of the day, such as GE/Fanuc, Siemens Sinumerik, or our own A850/A950 controls with full analog servos.

I guess there might be a need to close a velocity loop at a 1 microsecond rate, but its certainly beyond me, or anything we would have needed back then. The magnetic constants in the PM motors we used had no need for closure of a torque/current loop at a 1 microsecond rate, although we were doing our torque/current loop using analog electronics, still, at that point (and trapezoidal commutation. Sinusoidal commutation was planned for but not implemented by the time I had moved on to Honeywell.)

(Even 10 years after I left, controls engineers were imagining 100 microsecond torque loop closure rates, not 1 microsecond.)

njuffa · March 18, 2024, 11:42pm

Thanks for posting, this made for fascinating reading. I have a CS degree with a minor in mechanical engineering and my original goal was to work for a company like Traub or Kuka before Silicon Valley snatched me up :-)

The entire CUDA software stack, including all of the operating systems on which CUDA is supported, are not capable of hard real-time operation. All work is executed on a best-effort basis. Some people have successfully accelerated soft real-time applications with CUDA, where application performance degrades gracefully (e.g. dropped frames) when deadlines are missed.

Back when I was in school, an embedded 8086 with 8087 math coprocessor or a DSP in the control were considered state of the art. Reading along, I therefore wondered why a fast embedded CPU with built-in FPU (e.g. ARM) running an RTOS would not suffice in this application, as they should provide sufficient computational horse power based on my understanding of the task at hand. Then I got to this requirement:

I would think that this precludes any solution based on common CPUs or GPUs. Assume a 1GHz processor, where a mispredicted branch costs 12 cycles and an L2 cache miss 30 cycles. A limit of 10 ns of jitter (= 10 cycles) seems impossible to maintain in such an execution environment.

An FPGA seems indeed a good solution to the determinism requirements. I have fond memories of early Xilinx FPGAs, but I have not worked with FPGAs in the past 30 years. Back when I used them, the development tools were a source of major frustration. I would hope that much has changed for the better in this regard by now, making design changes less challenging.

Robert_Crovella · March 19, 2024, 12:24am

Indeed, microstepping could involve very high pulse rates (e.g. 5MHz). I don’t think that high level of loop closure/control would be needed for practical BLDC/AC brushless servo motors, but I may be mistaken. We expected our axis motors to go up to either 3000 or 6000 RPM, and although the spindle could go much higher (20,000 to 200,000 RPM), we didn’t use the full velocity loop on the spindle motor. It was very often a variable speed or flux-vector-control AC motor/drive anyway.

njuffa · March 19, 2024, 3:17am

Either I never heard of microstepping before or I did and forgot about it. Looking it up right now, it seems the earliest literature on it dates all the way back to 1980 and one of the earliest practical application was in optical instruments requiring submicron positioning.

@Robert_Crovella Did you guys employ microstepping in any of your products when you were at Cincinnati Milacron?

Robert_Crovella · March 19, 2024, 12:27pm

No, we didn’t use stepper motors. We used what we called (brushless) AC servo motors. A stepper motor has a control methodology that to my mind doesn’t look much like classical DC motors. On the other hand, with a bit of effort (electronic commutation) our servo motors behaved exactly like classical DC motors, and we controlled them that way. You put current through it, and the torque constant told you how much torque, and in what direction, you would get.

Because there were no brushes or integrated commutation, you excited the motor in 3 phases. The motors had samarium cobalt permanent magnets in the rotor, and you had phases in the stator that you would r energize (commutate) to make the motor spin. In some way similar to the motors that Tesla uses in their EVs, although there are significant differences in rotor and stator design, as the tesla motors blend BLDC and switched-reluctance, for benefits and efficiency at both low and high speed. We didn’t care as much about efficiency as Tesla does, we weren’t running off batteries.

Topic		Replies	Views
cuda for ati cards we need a stadard CUDA Programming and Performance	27	43374	October 3, 2008
FMADD non-deterministic? CUDA Programming and Performance	10	3498	March 12, 2009
How to get the cuda "first-call overhead" to happen only once for cuda called from dll? CUDA Programming and Performance	51	308	November 25, 2024
A driver to disabled & re-enable Cuda cores..Possible? Just some idea that popped up in my brain CUDA Programming and Performance	13	22309	June 29, 2010
300x to 600x times faster... really? CUDA Programming and Performance	92	34413	February 8, 2010
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	623	June 3, 2024
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2366	January 9, 2017
NVidia GPUs in Embedded Computing Has the GPU computing and CUDA penetrated the embedded market? CUDA Programming and Performance	11	3864	August 3, 2010
What is the Least Awful Linux Distro for CUDA development? Is it possible that Windows, or anything, could be worse than Ubuntu? CUDA Setup and Installation	13	15127	March 21, 2018
Fermi? Sounds interesting... CUDA Programming and Performance	58	15508	October 18, 2009

Can NVIDIA's development stack replace the need for an FPGA in CNC motion control?

Related topics