Audio Arrays

Dear community,

Your help is needed.

We have developed a technology that allows an unlimited number of I2S, PDM, TDM, SPI or other serial channels to be streamed to PC via USB 3.0 without using FPGA or DSP chips (https://coherent-receiver.com/audio-and-microphone-arrays). Just plain USB or Ethernet controller + little magic. One USB can process up to 62 channels. All channels are synchronous. All signals are synchronized with a common clock and acquired on host synchronized and interleaved per USB port. Multiple USB ports can be synchronized on host and all channels are available either on USB or as plain files or as named FIFOs - all in parallel. We are using libusb to read from USB on Linux and Jetson Nano can process hundreds of channels.

Our multichannel radio monitoring and a microphone array with external microphones were built as a proof of concept on Jetson Nano and Jetson Xavier. We checked the array by reading from all microphones in parallel, compress with ffmpeg and sent to a local Icecast server. Every microphone can be listened from a browser via network. It is also possible to record all microphone as a plain file and open them in Audacity in parallel.

Now the help question: there are great GPUs on Nano and Xavier that can be used for beamforming, direction finding, DSP operations etc. Which application scenarios could / should be implemented for such array? Which experiments could be done and in which areas? Is there any interest outside from academic area, e.g. from music industry for multi-channel recording or playing?

Thanks in advance for your comments.

Interesting topic! I’m not working on this commercially, but it is a good topic.

You’ve heard of phased radar? It uses a series of small antennas, combined with modification of phase between antennas (such as via inductance or capacitance), and a computer can control it. As a result it can essentially use software to “focus” all antennas, or split several apart and track multiple objects if the signal is strong enough. Audio could use such a counterpart, but I’m not sure how hard it would be since there are a lot of differences between sound and radio waves.

I’ll give you an example of an early car alarm I had thought about, but related to the above. The idea being that if the car windows are closed, then the inside of the car has a certain audio “resonance”. The direction of a sound wave might result in having multiple resonant frequencies, depending on direction of reflection. The simplest form would be a microphone inside the car at some point, and a speaker, capable of working at frequencies the human ear cannot hear. Turn up the gain on the microphone, feed it to the speaker, and you get a feedback which is related to the “shape” of the inside of the car. One could distinguish changes in the “shape” of the inside car by analyzing changes to the sound of feedback.

This could be extended to use multiple microphones and multiple speakers tuned to different parts of the geometry. An array such as yours, if it could be coupled with an array of some sort of “emitter” (in this case high frequency audio) to be in its own new class of alarm if AI could analyze the audio being memorized for the space being monitored.

This could be extended to buildings, very large rooms, so on. You might consider creating an equivalent array of speakers where each speaker is some “standardized” (not random) model with predictable performance.

Just for trivia, you’ve heard of the new Starlink satellite internet? The reason it is possible to do this is because of a phased array antenna, rather than a typical dish. Such an array is capable of concentrating on a number of satellites and tracking each prior to actually dropping one out and switching. Makes it seamless. I think most of where your audio device would be interesting starts with tracking or working with multiple devices simultaneously in a changing environment.

On a different idea…you could actually combine this audio output (analyzing it for position and type of audio) with 3D video and mapping on a drone (obviously a rather big drone, but it looks feasible). Search and rescue or fire monitoring drones would have a new ability.

One thing I would suggest is learning to use AI to figure out how to use mems microphones in the same way as phased radar. As I mentioned before, the audio and radio have some similarities, but the microphones would need an understanding of how timing of audio between several microphones might be used in a similar way (e.g., it might be digital delay instead of inductive/capacitive phasing). Between that and possibly a similar output array you’d have a new industry.

And yes, it would actually be useful for music or theater recording. Perhaps such use could be improved by an output device since tones and strengths could be used to calibrate the responses of a stage or theater (sorry, no feedback fun!).

Thanks for your answer.

  1. Indeed our initial development was RF-based. The prototype included multiple Zero-IF tuners and audio ADCs from AKM with I2S output. Please take a look at the possible architecture at I2S Multituner Receivers | Multichannel SDR Transceivers and Audio Arrays. Therefore, it is possible to build active electronically scanned arrays as well. The restriction of 768 kHz is from the used audio ADC. Theoretically faster serial ADCs can be used as well.
  2. Our array looks bulky and massive due to the external microphones. Most audio processing algorithms, e.g. beamforming or DF like MUSIC, ESPRIT are highly dependent on the topology and the topology is resulted from the used frequencies: lower frequencies – wider distance between microphones. It is much simpler to put all microphones on the single PCB but it will work for higher frequencies only, e.g. > 10 kHZ. May be there are methods to place the microphones on the lower wave length using matrix arrays. There are a lot of academic papers in this area but most of them are theoretical only with models in Matlab.
  3. The construction of an external microphone modules and microphone arrays is also not so trivial as it looks like. I would be very appreciated for good resource describing all pitfalls by building of the external MEMS-based microphone module. We can share and discuss the details about our current construction and implementation.
  4. There is a great Starlink video from Ken Kether: Starlink Teardown: DISHY DESTROYED! - YouTube describing in details the construction of the Dishy. My understanding of the architecture is following: there are 1264 active elements. The smaller ICs calling in the video phase shifters/switchers provide an analog IF signal from/to antennas. The larger chips (79 of them or integrated RF front end in video) steer the phase shifters and combine the receiving signals from smaller chips (time-synchronous averaging in order to increase SNR). Moreover the larger chip includes the ADC/DAC and demodulator. The decoded signals will be sent via two-wire bus RFFE? to an application processor. The application processor provides a processing of the 79 serial inputs: all ADCs/DACs are sampled synchronously and all signals are transmitted synchronous, synchronized and parallel. Therefore, it is pretty easy to make an error correction if the larger chips are tuned to the same frequency or multiplex the results from different larger chips if the frequencies are different. Finally, the stream will be converted to TCP/IP.
  5. It is possible to make the audio array like Starlink dishy. Such arrays will work for the higher frequencies or smart matrix arrays. In this case the smaller chips will be PDM to TDM converter, e.g. PCMD3180 - Octal-Channel, PDM Input to TDM. One our USB multiplexor can process 30 inputs. Therefore, 30 TDM inputs with 8 channels each result in 240 audio channels per USB. Jetson has 4 USB3.0 ports: therefore, 240 * 4 = 960 audio channels can be processed by a single Jetson. A comparable academic project did it different: they have used PDM that was feeded to microcontrollers with 30 inputs. There was more than 30 microcontrollers on the PCB also resulting in more than 1.000 audio channels. It was not explained in details how the single microcontrollers were synchronized but theoretically this is doable.
  6. Regarding the car and car alarm scenarios: there is a lot of research in the automotive sector to this topic, e.g. road noise cancellation or scenarios where passengers are speaking and the driver cannot hear them. These scenarios require input and output with very low latency. Though our solution enables bi-directional transmission, we didn’t test it. Moreover, the latency issues is an own science: just such simple task to measure the latency from the sound to bits is problematic. Our current implementation is a plain user-space process based on the libusb and we are reading 30/60/180 times faster than a particular audio signal bypassing the Linux audio system: raw PCM comparable with I/Q in SDR. Therefore, I believe that it is possible to achieve a low latency using the GPU. But what is low latency? How the algorithm can adapt to the changed environment? These are the next big questions. Therefore, thanks for the idea, interesting topic but we in concurrence with the major auto makers and research labs in this field and have no chance.
  7. Drone scenario: we discussed this topic about two weeks ago. There were several interesting videos this summer: a shouting person on the ground and a drone that flies to this person. Our current implementation is too bulky and heavy for this. The design was not optimized to weight and power supply. Therefore, it should be redesigned but the next big question: if external microphones are necessary or it is possible to place the microphones on the PCB? In case of the PCB: the complete design is much simpler and the resulting unit significantly friendlier for mobile application. The propeller and wind noise are major issues and the idea was to put the microphone acquisition system outside of the drone.
  8. You propose to use AI but it was unclear about the AI applications. Speaking about automotive again: it is possible to get the sound from external microphones in a car and process a classification task on the Jetson, e.g. detection of the emergency siren. If the classification occurs, the sound direction can be calculated using multiple microphones and presented to the driver. Such mixed AI scenarios do not require multichannel microphone processing in AI. Recently, I read a paper about a medical study that have used a microphone array and AI (classification) in order to contactless monitor the heart rhythms. It was not clear for me why the microphone array is needed in this case and which types of microphones are necessary.
  9. And yes, music can be an application as well but this scenario will require different types of microphones and the music industry already has all necessary solutions.

Many thanks for your time!

Sorry, below is long, but I find it interesting…read at your own risk! This is more or less for my own entertainment, not sure if it would be practical commercially.

From number 5 of the above:

I would think that USB is problematic unless operating in isochronous mode. I am curious if this is the case? I think isochronous mode is the most difficult to work with on the Jetson side. Not so much due to USB (the Jetson would be in host mode, and it can work with isochronous), but because the Jetsons would probably have trouble consuming that much data with any sort of hard real time processing. I am guessing the biggest challenge is getting data into and out of the GPU without losing “frames” of data. The Audio Processing Engine (“APE”) uses a Cortex-R5 processor, which is good with hard real time, but memory transfer likely has no such ability to run with hard real time (there would be a need for a buffer and the buffer would harm latency).

From number 6 of the above:

In the case of an alarm which works via sampling what is essentially a set of resonant frequencies of a complex enclosed 3D shape one would not necessarily even care about something like road noise or people speaking. The idea there is to cause feedback at multiple frequencies “tuned” to the inside of the car’s “shape”, and to watch this with something like a Fourier transform. If the selected resonant frequencies change, then the shape has changed, and it is time for an alarm.

Unlike many alarms you could probably set up the software for arrays to be immune to things like other cars driving by (resonant frequencies would not change), or mice moving (a window of how much a resonant frequency is required for trigger), or a fan turning on (read a great story once about a mystery alarm that turned out to be going off whenever the fax machine ran in the middle of the night). Such a system would also be expandable to large buildings which is why this much complexity is of interest. Listening to multiple microphones would allow more complex resonant shapes to be monitored.

The part which is not obvious is that it isn’t just the total spectrum of key resonant components which matters, but also the timing of when a given spectrum is measured at different microphones. To illustrate, consider wearing stereo headphones. This works great for stereo positional audio when wearing headphones to listen. However, there are manufacturers of “surround sound” headphones. One would wonder how this would be possible and not just a gimmick when we have only two ears. Positional audio in headphones sort of works because certain tones which we want to “hear” with more directional information are given controlled delays…either to both left and right headphone, or to left versus right. This produces an illusion to the human listener of more directional information (microphone arrays produce an actual time delay if the same sound reaches a second microphone later than it reaches some other microphone).

I think if you were to try to check for multiple resonant frequencies in a complex enclosed space by audio, then timing of receiving a signal from different microphones at different locations would make for a much more reliable system (relative delays between microphones can also have a Fourier transform to treat time delays of similar audio into a set of discrete cosines…for an alarm it isn’t just about what the array receives, it is also about shifting of which microphone receives when). This sort of data works great in a GPU.

From number 8 above:

Just speculating, but I suspect that not only is the timing of various chambers of the heart being measured, but also the “shape” of how the muscle contracts like a wave over the muscle is being examined. To illustrate, if someone has had a heart attack in the past, then some part of the heart (muscle) has died and no longer contracts, but this is only part of any chamber in the heart. Knowing the timing of contractions of the heart is important, but knowing if the contraction is a smooth wave traveling over the heart, versus hearing some detail of a non-smooth contraction might give hints of muscle damage. An array might be able to determine that sort of defect, whereas a single microphone would not have that ability.

thanks for very valuable answer.

  • We are using USB Bulk- and Control-Transfers.
  • I take a quick look at the Audio Processing Engine (APE). Following thoughts:
  • It is unclear at the first glance if it can be used. There are multiple forum messages (e.g. Use of APE for audio processing? - #2 by ShaneCCC) that ADSP is currently not supported.
  • I didn’t take a more detailed look because regarding to the specification Features (https://robu.in/wp-content/uploads/2020/12/NVIDIA-Jetson-Xavier-NX-Developer-Kit.pdf , page 17) ADSP has no practical usage for our solution :
    • 96 KB Audio RAM
    • Audio Hub (AHUB) I/O Modules
    • 2xI2S/3xDMIC/2xDSPK Audio Hub (AHUB) Internal Modules
    • Sample Rate converter
    • Mixer
    • Audio Multiplexer
    • Audio De-multiplexer
    • Master Volume Controller
    • Multi-Channel IN/OUT
    • Digital Audio Mixer: 10-in/5-out
    • Up to eight channels per stream
    • Parametric equalizer: up to 12 bands
  • Therefore, CPU and GPU with a custom implementation of algorithms should be used for real-time applications. For simulations, the data can be recoded as ordinary files and Matlab and GNU Radio can be used. Moreover, named fifos enable the get the real-time processing on the GNU Radio and Matlab as well; in this case more powerful processors should be used. For example: Intel/AMD with GPU will be necessary for the complex simulations.
  • I need to think about the proposed application scenarios. Here is a link to the paper that I have mentioned in the previous post: Using smart speakers to contactlessly monitor heart rhythms | Communications Biology

I think most people who use the APE for their own use end up putting something like OpenRT on the core. By default it is already set up to help with audio, though I have not read enough to know to what extent. Just guessing, but the Cortex-R5 is probably “normally” used for the i2s, plus something else related to the data which “standard” sound apps would use. If you were to use this core with OpenRT, then you’d probably be free to use it for other purposes with hard real-time response (though an R5 core is not particularly powerful). Such a core would be 100% for your purposes if you could give up the i2s (perhaps you could implement your own i2s, but then you’d be reinventing the wheel).

Incidentally the Sensor Processing Engine (SPE) is also a Cortex-R5. Not sure if this could be “hijacked” and used for something else, but if you don’t need cameras, then this might be worth investigating if it can be used with OpenRT (anyone here who knows the answer to that?).

You’re using mostly the GPU/CPU for what you are doing, and although I don’t know if it is practical, you might consider what you could do with a Cortex-R5 with OpenRT if you are falling short on “real time” behavior. I don’t expect such a thing would be without a learning curve since this is not the intended use of the R5 cores.

The APE is nowhere near powerful enough for massive numbers of channels.

Then again: “A lot of channels,” by itself, is not something that will sell a system. Professional audio interfaces have done hundreds of channels in parallel for quite some time.

A lot of channels, cheaply" isn’t even all that easy to sell, because most people who have a lot of channels, also want those channels to be rock solid, through a range of operation conditions. (These interfaces tend towards PCI-express, Thunderbolt, or Ethernet because of this.) Also, these people tend to require pretty good, low-noise, low-distortion analog implementations for those channels. (Maybe you’re already using outboard converters, in which case this is probably already solved.)

So, to my ears, this sounds a little like “a solution in search of a market,” rather than the other way around. Generally, products become successful when they address a specific existing need in an existing market.

That being said, if you can use the GPU to do autoconvolution on a hundred channels of microphones from far away, you might be able to build a device that certain law enforcement and intelligence authorities might be interested in. That could perhaps be a way to try to build a successful product on these particular capabilities :-)

1 Like
  1. This is good to know about the availability and functionality of APE but it seems that it has no practical relevance by massive number of channels. Thanks for this confirmation.
  2. You writes: “It sounds a little like “a solution in search of a market,” rather than the other way around. Generally, products become successful when they address a specific existing need in an existing market.”
    Thanks for bringing it straight to the point. Yes, and this is my problem. We have a solution but do not have a market. Audio industry, automotive, intelligence already have own solutions.
    And exactly this remark: “A lot of channels, cheaply" isn’t even all that easy to sell, because most people who have a lot of channels” … have a specific application to be solved. They do not need a lot for channels just to have a lot of channels.
  3. Therefore, if you know real application scenarios that need a lot of channels processed by GPU in parallel or know applications that cannot be implemented using available solutions, I would be happy to discuss. We can provide a single synchronisation, up to 30 serial channels with up to 100 MHz each on host in parallel bit-exact aligned.

thanks in advance.

1 Like