I’m one of the programmers of The Usual Suspects, we emulate synthesizers of the 90s/2000s in realtime by emulating the DSP56300 series of Motorola DSPs.
The current technology uses either x86 or aarch64 host CPUs. We do this by translating the DSP opcodes to host CPU opcodes, this is done at runtime because the code is self-modifying. While this works pretty well, I still have the (maybe crazy) idea free up CPU resources by moving the DSP emulation to the GPU. The problem is, that this DSP is not parallel in any way, its just the opposite.
I have some CUDA experience but this was many years ago so I tried to read up a bit, this is my current idea:
Instead of generating x86 or aarch64 machine code, generating PTX code should be more or less straightforward. I read the PTX doc and found that there are some nice instructions in there that fit the 24 bit DSP pretty well. So that part should be easy to solve.
However, there are some open topics:
There is the PXT Compiler API but what I didn’t figure out yet is if this is something that can be used on an end user system that has recent drivers installed or is this compiler API only available on development systems with the CUDA SDK installed? Using that on end user systems would be mandatory.
Another concern is the speed of compilation. PTX is a text format and the driver needs to parse it, is there a binary format available too? I need to minimize compilation times as I might have to rebuild code frequently (depends on the emulated DSP code).
Given that these DSP are single core, a GPU would be able to execute one thread only, i.e. there is hardly any parallelism. Are there any numbers available about the MIPS that can be achieved on a GPU in a scenario like this? The DSPs run with up to 150 MHz and execute roughly one instruction per cycle, as most DSP instructions need more than one host instruction I’d need a good “single core” speed. How good are todays GPUs at tasks like this? Is there any documentation about instruction latency/throughput for GPUs? For example, what is the latency/throughput of a multiply-add, etc?
Also, if I have compiled a block of PTX code, what is the overhead involved to execute it? On a CPU, the overhead is just a jmp to the code, obviously the overhead to launch a PTX code block must be a lot higher. How many of these blocks can I execute per second?
This is probably just a very crazy idea and I admit that this would be GPU abuse, but in the audio production world, GPUs usually do nothing more than drawing some knobs on a screen so there are plenty of unused resources, while CPU is always the limiting factor.
Would love to hear what you think. It would be a nice experiment, however, if there is no chance to reach realtime speed I can skip this.
Unless you wish to emulate 10K DSPs in parallel, this tells us that GPUs are not a suitable target for this kind of emulation.
What you would want instead is a CPU with extremely high singe-core performance, and in particular one that is very fast for compiler-like tasks. A reasonable proxy for that is the 602.gcc component of SPECspeed 2017 Integer of the SPEC CPU 2017 benchmark suite, which are available from a public database.
Consulting the database, a system based on AMD Ryzen 9 7950X or AMD EPYC 4364P would appear to be the best choice, as these are tied for the high score of 20.6 in the gcc component.
Precisely that. Contemplating the use of a processor architecture optimized for maximum throughput on massively parallel tasks for a single-threaded, latency-sensitive use case strikes me as akin to inquiring about the towing capacity of an Alfa Romeo Spider as you plan to use it for an alpine crossing with a trailer in winter. You are obviously free to follow through on either idea …
For the speed of JIT compilation of PTX to SASS with ptxas, see this recent forum thread:
To get an idea of the performance of a single thread running on a GPU, you could use a mini-benchmark of your choice. For example, here I used a modified version of the “sieve” benchmark from the 1980s. CPU code:
#include <stdio.h>
// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
#define true 1
#define false 0
#define size 65534
#define sizepl 65535
char flags[sizepl];
int main() {
double start, stop;
int i, prime, k, count;
start = second();
count=0 ;
for (i = 0; i <= size; i++)
flags[i] = true;
for (i = 0; i <= size; i++) {
if (flags[i]) {
prime = i + i + 3;
k = i + prime;
while (k <= size) {
flags[k] = false;
k += prime;
}
count = count + 1;
}
}
stop = second();
printf("\n%d primes", count);
printf ("\nelapsed=%.0f micro seconds\b", (stop -start)*1e6);
return 0;
}
Compile and run on a Xeon W-2133:
>icx /W4 /O3 /Qxskylake-avx512 sieve.c
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2023.0.0 Build 20221201
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
>sieve
12250 primes
elapsed=252 micro seconds
Now the CUDA code for the GPU:
#include <stdio.h>
// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
#define true 1
#define false 0
#define size 65534
#define sizepl 65535
__device__ char flags[sizepl];
__global__ void kernel (int *nbr_primes)
{
int i, prime, k, count;
count=0 ;
for (i = 0; i <= size; i++)
flags[i] = true;
for (i = 0; i <= size; i++) {
if (flags[i]) {
prime = i + i + 3;
k = i + prime;
while (k <= size) {
flags[k] = false;
k += prime;
}
count = count + 1;
}
}
*nbr_primes = count;
}
int main (void)
{
double start, stop;
int count = 0, *count_d = 0;
cudaMalloc ((void**)&count_d, sizeof (*count_d));
start = second();
kernel<<<1,1>>>(count_d);
cudaDeviceSynchronize();
stop = second();
cudaMemcpy (&count, count_d, sizeof count, cudaMemcpyDeviceToHost);
printf("\n%d primes", count);
printf ("\nelapsed=%.0f micro seconds\b", (stop -start)*1e6);
return 0;
}
Note that this code is neither representative of best benchmarking practices nor is necessarily representative of performance characteristics of a processor emulation task. But it is a reasonable demonstration that using a GPU to execute single-threaded code is not a particularly sensible approach.