EMKR: Edit the Moment, Keep the Rest

Abstract

Audio editing aims to modify sound content while maintaining its acoustic context. However, few studies have addressed temporal audio editing with precise control of event timing. We propose edit the moment, keep the rest (EMKR), an instruction-driven framework built on Stable Audio Open that supports adding, removing, replacing, moving, and extending. EMKR is trained with a temporal editing pipeline that constructs triplets (instruction, input audio, target audio), where the instruction carries information about when to edit. It further incorporates an explicit edit interval for precise timing and a source-event mask that preserves the original event instance in moving and extending. Experiments on mixtures from real-world datasets show that EMKR enables time-localized editing of polyphonic audio with 100 ms precision while preserving non-edited regions. This temporal control enables timing-critical editing tasks such as moving and extending.

Model Architecture & Key Ideas
Audio Editing Demos

1. Model Architecture & Key Ideas

1.1. Overview

Figure 1. Overall architecture of the proposed EMKR framework.

1.2. Key Ideas

Temporal Triplets: Training data is constructed as (instruction, input audio, target audio) triplets on the fly, where each instruction specifies what to edit and when.
Interval-based Timing Conditioning: A reference interval R = [r_s, r_e] identifies the target event instance, and an edit interval E = [e_s, e_e] specifies where the change is applied. Four timing embeddings are injected into the DiT via global addition and cross-attention.
Source-Event Masking: For moving and extending, a binary mask highlights the reference interval in the latent space, preserving the source event's acoustic identity while the model relocates or extends it.
Five Editing Operations: Adding, Removing, Replacing, Moving, and Extending — all with instance-level, time-localized control in polyphonic mixtures.

2. Audio Editing Demos

Below we present editing examples for each of the five tasks. For each example, we show the input audio with its mel-spectrogram, the editing instruction with reference/edit intervals, and outputs from the Ground Truth, EMKR (ours), SAO-Instruct, and AUDIT.

Task: Adding

Insert a new sound event into a specified time interval of the mixture.

Example 1

Input Audio

Instruction

"Add a electric guitar event between 1.6s and 8.5s."

R = [1.6 s, 8.5 s] E = [1.6 s, 8.5 s]

Output Comparison

Ground Truth

EMKR (Ours)

SAO-Instruct

AUDIT

Example 2

Input Audio

Instruction

"Add snare drum, start at 4.9s and end at 5.3s."

R = [4.9 s, 5.3 s] E = [4.9 s, 5.3 s]

Output Comparison

Ground Truth

EMKR (Ours)

SAO-Instruct

AUDIT

Task: Removing

Remove a specific event instance from the mixture at the given time interval.

Example 1

Input Audio

Instruction

"Remove alarm from 6.7s to 9.6s."

R = [6.7 s, 9.6 s] E = [6.7 s, 9.6 s]

Output Comparison

Ground Truth

EMKR (Ours)

SAO-Instruct

AUDIT

Example 2

Input Audio

Instruction

"Remove trumpet appearing between 0.3s and 3.0s."

R = [0.3 s, 3.0 s] E = [0.3 s, 3.0 s]

Output Comparison

Ground Truth

EMKR (Ours)

SAO-Instruct

AUDIT

Task: Replacing

Replace a specific event instance with a different sound class at the same time interval.

Example 1

Input Audio

Instruction

"Replace bass guitar with zipper (clothing) from 6.3s to 10.0s."

R = [6.3 s, 10.0 s] E = [6.3 s, 10.0 s]

Output Comparison

Ground Truth

EMKR (Ours)

SAO-Instruct

AUDIT

Example 2

Input Audio

Instruction

"Replace vehicle horn, car horn & honking by piano between 4.7s and 6.1s."

R = [4.7 s, 6.1 s] E = [4.7 s, 6.1 s]

Output Comparison

Ground Truth

EMKR (Ours)

SAO-Instruct

AUDIT

Task: Moving

Relocate a specific event instance from the reference interval to a different edit interval, preserving its acoustic identity and the surrounding background.