Abstract

Audio editing aims to modify sound content while maintaining its acoustic context. However, few studies have addressed temporal audio editing with precise control of event timing. We propose edit the moment, keep the rest (EMKR), an instruction-driven framework built on Stable Audio Open that supports adding, removing, replacing, moving, and extending. EMKR is trained with a temporal editing pipeline that constructs triplets (instruction, input audio, target audio), where the instruction carries information about when to edit. It further incorporates an explicit edit interval for precise timing and a source-event mask that preserves the original event instance in moving and extending. Experiments on mixtures from real-world datasets show that EMKR enables time-localized editing of polyphonic audio with 100 ms precision while preserving non-edited regions. This temporal control enables timing-critical editing tasks such as moving and extending.

Contents

  1. Model Architecture & Key Ideas
  2. Audio Editing Demos

1. Model Architecture & Key Ideas

1.1. Overview

EMKR overview figure
Figure 1. Overall architecture of the proposed EMKR framework.

1.2. Key Ideas

2. Audio Editing Demos

Below we present editing examples for each of the five tasks. For each example, we show the input audio with its mel-spectrogram, the editing instruction with reference/edit intervals, and outputs from the Ground Truth, EMKR (ours), SAO-Instruct, and AUDIT.

Task: Adding

Insert a new sound event into a specified time interval of the mixture.

Example 1

Input Audio

Input mel-spectrogram

Instruction

"Add a electric guitar event between 1.6s and 8.5s."

R = [1.6 s, 8.5 s] E = [1.6 s, 8.5 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram
AUDIT
AUDIT spectrogram

Example 2

Input Audio

Input mel-spectrogram

Instruction

"Add snare drum, start at 4.9s and end at 5.3s."

R = [4.9 s, 5.3 s] E = [4.9 s, 5.3 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram
AUDIT
AUDIT spectrogram

Task: Removing

Remove a specific event instance from the mixture at the given time interval.

Example 1

Input Audio

Input mel-spectrogram

Instruction

"Remove alarm from 6.7s to 9.6s."

R = [6.7 s, 9.6 s] E = [6.7 s, 9.6 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram
AUDIT
AUDIT spectrogram

Example 2

Input Audio

Input mel-spectrogram

Instruction

"Remove trumpet appearing between 0.3s and 3.0s."

R = [0.3 s, 3.0 s] E = [0.3 s, 3.0 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram
AUDIT
AUDIT spectrogram

Task: Replacing

Replace a specific event instance with a different sound class at the same time interval.

Example 1

Input Audio

Input mel-spectrogram

Instruction

"Replace bass guitar with zipper (clothing) from 6.3s to 10.0s."

R = [6.3 s, 10.0 s] E = [6.3 s, 10.0 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram
AUDIT
AUDIT spectrogram

Example 2

Input Audio

Input mel-spectrogram

Instruction

"Replace vehicle horn, car horn & honking by piano between 4.7s and 6.1s."

R = [4.7 s, 6.1 s] E = [4.7 s, 6.1 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram
AUDIT
AUDIT spectrogram

Task: Moving

Relocate a specific event instance from the reference interval to a different edit interval, preserving its acoustic identity and the surrounding background.

Example 1

Input Audio

Input mel-spectrogram

Instruction

"Move screaming from 7.3s–9.2s to 0.4s–2.3s."

R = [7.3 s, 9.2 s] E = [0.4 s, 2.3 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram

Example 2

Input Audio

Input mel-spectrogram

Instruction

"Move alarm from 7.2s–10.0s to 0.3s–3.1s."

R = [7.2 s, 10.0 s] E = [0.3 s, 3.1 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram

Task: Extending

Extend an existing event instance to cover a longer time interval while preserving the original segment and background.

Example 1

Input Audio

Input mel-spectrogram

Instruction

"Extend trumpet in 2.2s–4.1s to 1.1s–4.9s."

R = [2.2 s, 4.1 s] E = [1.1 s, 4.9 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram

Example 2

Input Audio

Input mel-spectrogram

Instruction

"Extend piano in 7.0s–8.8s to 6.5s–10.0s."

R = [7.0 s, 8.8 s] E = [6.5 s, 10.0 s]

Output Comparison

Ground Truth
Ground Truth spectrogram
EMKR (Ours)
EMKR spectrogram
SAO-Instruct
SAO-Instruct spectrogram