I saw a paper this morning on hugging face “Step-Audio-EditX Technical Report” and I wanted to try to run the code according to the directions on the GitHub page GitHub - stepfun-ai/Step-Audio-EditX: A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
I ran into one issue because a wheel was not available from Microsoft.
× No solution found when resolving dependencies:
╰─▶ Because onnxruntime-gpu==1.17.0 has no wheels with a matching platform tag (e.g., `manylinux_2_39_aarch64`) and you
require onnxruntime-gpu==1.17.0, we can conclude that your requirements are unsatisfiable.
hint: Wheels are available for `onnxruntime-gpu` (v1.17.0) on the following platforms: `manylinux_2_28_x86_64`,
`win_amd64`
Dependency #1 – CUDNN_HOME
I found I need to set a CUDNN_HOME – I’m not sure of the best way to do this, but this is what I did.
mkdir -p ~/cudnn/
cd ~/cudnn/
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-aarch64/cudnn-linux-aarch64-9.13.1.26_cuda13-archive.tar.xz
tar -xf cudnn-linux-aarch64-9.13.1.26_cuda13-archive.tar.xz
export CUDNN_HOME=~/cudnn/cudnn-linux-aarch64-9.13.1.26_cuda13-archive
There has got to be a better way? Or is the weird python package of it from NVIDIA the best way, and I just can’t figure out how to make ONNX’s build.sh to use that package?
This CUDNN_HOME business is needed for ONNX.
Dependency #2 – ONNX Runtime
This was an adventure to get to build, as there I could not figure out how to get a compatible wheel from Microsoft for this, so I had to build the wheel from source.
I set my cargo cult build environment.
export TORCH_CUDA_ARCH_LIST=12.1a
export TRITON_PTXAS_PATH=$(which ptxas)
export CUDA_HOME=/usr/local/cuda
export UV_TORCH_BACKEND=auto
export MAX_JOBS=4
I found directions to build on DGX Spark on a comment in a GitHub issue Pip cannot find package for Nvidia DGX Spark (arm linux) · Issue #26351 · microsoft/onnxruntime · GitHub
uv init venv-build-onnxruntime
source ./venv-build-onnxruntime/bin/activate
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime
uv pip install cmake ninja packaging numpy setuptools
sh build.sh --config Release --build_dir build/cuda13 --parallel 4 --nvcc_threads 1 --use_cuda \
--cuda_version 13.0 --cuda_home $CUDA_HOME \
--cudnn_home $CUDNN_HOME \
--build_wheel --skip_tests \
--cmake_generator Ninja \
--use_binskim_compliant_compile_flags \
--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=121 onnxruntime_BUILD_UNIT_TESTS=OFF
mkdir -p ~/wheels
cp ./build/cuda13/Release/dist/onnxruntime_gpu-1.24.0-cp312-cp312-linux_aarch64.whl ~/wheels
deactivate
Now there is a wheel built that should work on the DGX spark.
Dependency #3 – ffmpeg
sudo apt install ffmpeg
This is needed for audio file processing in the demo app.
Build Steps
Now, things are ready to follow along from the GitHub directions almost.
Where their GitHub says:
git clone https://github.com/stepfun-ai/Step-Audio-EditX.git
conda create -n stepaudioedit python=3.10
conda activate stepaudioedit
cd Step-Audio-EditX
pip install -r requirements.txt
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
git clone https://huggingface.co/stepfun-ai/Step-Audio-EditX
I did the following:
uv venv venv-step-audio
source ./venv-step-audio/bin/activate
uv pip install ~/wheels/onnxruntime_gpu-1.24.0-cp312-cp312-linux_aarch64.whl
git clone https://github.com/stepfun-ai/Step-Audio-EditX.git
cd Step-Audio-EditX
Then I had to comment out line 7 in requirements.txt
diff --git a/requirements.txt b/requirements.txt
index de402f1..b8fa290 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,3 +6,3 @@ accelerate==1.3.0
openai-whisper==20240930
-onnxruntime-gpu==1.17.0
+#onnxruntime-gpu==1.17.0
onnxruntime
and I was able to finish up that step, and the rest of what I’ve tried below has been working.
uv pip install -r requirements.txt
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-Tokenizer
git clone https://huggingface.co/stepfun-ai/Step-Audio-EditX
It runs a little demo server on port :7860 that lets you do one of two things to an uploaded voice recording. You can “Clone” the voice and make it say anything you want. The other mode is an “Edit” mode where you can clean up the voice and edit the emotion. You can also do para-linguistic editing, but I haven’t figured out that one in the UX yet.