Deploying SLM/DSLM Workloads on DGX Spark and Possible Optimization Strategies

Hi everyone,

I am a researcher at the Research Center at University.

Our research focuses on Physical AI in extreme communication environments, particularly underwater and polar scenarios.

We are planning to utilize an NVIDIA DGX Spark system (GB10 Grace Blackwell Superchip with 128GB unified memory) to run Small Language Models (SLM) or Domain-Specific Language Models (DSLM) for tasks such as sensor data interpretation, channel modeling, and adaptive communication decision support.

I would appreciate guidance from the community regarding the following.

  1. Deploying SLM/DSLM on the DGX Platform

What is the recommended approach for deploying and operating SLM/DSLM workloads on DGX Spark?

Specifically, I would like to understand

  1. Which software stack is commonly used for SLM inference on this platform
    (e.g., TensorRT-LLM, Triton Inference Server, vLLM, NeMo, or other frameworks)

  2. Best practices for utilizing the 128GB unified memory architecture when running models with longer context windows or time-series sensor data

  3. Whether there are recommended container-based pipelines (Docker / Kubernetes / NGC stacks) for running LLM/SLM workloads on DGX systems

  4. Performance Optimization for SLM/DSLM Workloads

If additional performance optimization is required after deployment, what strategies are generally recommended on DGX Spark?

  • For example: Parameter-efficient fine-tuning approaches (e.g., LoRA / QLoRA) for adapting domain-specific models
  • Model compression techniques such as FP4 or INT8 quantization
  • GPU kernel or inference optimization through TensorRT-LLM or CUDA-based approaches

Any references to documentation, example projects, or relevant NVIDIA resources would be greatly appreciated.

Also, if there are any existing threads or official guides in this forum that cover similar topics, I would appreciate it if you could point me toward them.

Thank you for Read My Writes.