Abstract
Audio editing aims to modify sound content while maintaining its acoustic context. However, few studies have addressed temporal audio editing with precise control of event timing. We propose edit the moment, keep the rest (EMKR), an instruction-driven framework built on Stable Audio Open that supports adding, removing, replacing, moving, and extending. EMKR is trained with a temporal editing pipeline that constructs triplets (instruction, input audio, target audio), where the instruction carries information about when to edit. It further incorporates an explicit edit interval for precise timing and a source-event mask that preserves the original event instance in moving and extending. Experiments on mixtures from real-world datasets show that EMKR enables time-localized editing of polyphonic audio with 100 ms precision while preserving non-edited regions. This temporal control enables timing-critical editing tasks such as moving and extending.
Contents
1. Model Architecture & Key Ideas
1.1. Overview
1.2. Key Ideas
- Temporal Triplets: Training data is constructed as (instruction, input audio, target audio) triplets on the fly, where each instruction specifies what to edit and when.
- Interval-based Timing Conditioning: A reference interval R = [rs, re] identifies the target event instance, and an edit interval E = [es, ee] specifies where the change is applied. Four timing embeddings are injected into the DiT via global addition and cross-attention.
- Source-Event Masking: For moving and extending, a binary mask highlights the reference interval in the latent space, preserving the source event's acoustic identity while the model relocates or extends it.
- Five Editing Operations: Adding, Removing, Replacing, Moving, and Extending — all with instance-level, time-localized control in polyphonic mixtures.
2. Audio Editing Demos
Below we present editing examples for each of the five tasks. For each example, we show the input audio with its mel-spectrogram, the editing instruction with reference/edit intervals, and outputs from the Ground Truth, EMKR (ours), SAO-Instruct, and AUDIT.
Task: Adding
Insert a new sound event into a specified time interval of the mixture.
Example 1
Input Audio
Instruction
"Add a electric guitar event between 1.6s and 8.5s."
Output Comparison
Example 2
Input Audio
Instruction
"Add snare drum, start at 4.9s and end at 5.3s."
Output Comparison
Task: Removing
Remove a specific event instance from the mixture at the given time interval.
Example 1
Input Audio
Instruction
"Remove alarm from 6.7s to 9.6s."
Output Comparison
Example 2
Input Audio
Instruction
"Remove trumpet appearing between 0.3s and 3.0s."
Output Comparison
Task: Replacing
Replace a specific event instance with a different sound class at the same time interval.
Example 1
Input Audio
Instruction
"Replace bass guitar with zipper (clothing) from 6.3s to 10.0s."
Output Comparison
Example 2
Input Audio
Instruction
"Replace vehicle horn, car horn & honking by piano between 4.7s and 6.1s."
Output Comparison
Task: Moving
Relocate a specific event instance from the reference interval to a different edit interval, preserving its acoustic identity and the surrounding background.
Example 1
Input Audio
Instruction
"Move screaming from 7.3s–9.2s to 0.4s–2.3s."
Output Comparison
Example 2
Input Audio
Instruction
"Move alarm from 7.2s–10.0s to 0.3s–3.1s."
Output Comparison
Task: Extending
Extend an existing event instance to cover a longer time interval while preserving the original segment and background.
Example 1
Input Audio
Instruction
"Extend trumpet in 2.2s–4.1s to 1.1s–4.9s."
Output Comparison
Example 2
Input Audio
Instruction
"Extend piano in 7.0s–8.8s to 6.5s–10.0s."
Output Comparison