まえがき
使い方
- 関数の仕様
- サンプルコードとプロット
librosa.phase_vocoderの中身について
- 重要な中間変数を抜粋
- メイン処理
補足：フェーズボコーダを使ったピッチシフトの実現方法
- librosa.phase_vocoderの音質について
参考文献

まえがき

オーディオ編集をしていると、楽器の録音データや楽曲データに、

「ピッチを変えずに、鳴っている時間長を短く/長くしたい」

「時間長を変えずに、ピッチを低く/高くしたい」

という処理をしたくなることが多々あると思います。

それらを実現するには時間伸縮やリサンプリングといった処理が必要ですが、

前者についてはフェーズボコーダという手法が知られています。

Pythonの音楽分析モジュールLibROSAでも、ピッチシフトとタイムストレッチを行う関数 librosa.effects.time_stretchとlibrosa.effects.pitch_shiftのコアとしてフェーズボコーダが利用されています。

本記事では、LibROSAに実装されているフェーズボコーダの関数 librosa.phase_vocoder の使い方と実装方法を紹介します。

使い方

関数の仕様

librosa.phase_vocoder — librosa 0.8.1 documentation

librosa.phase_vocoder(D, rate, hop_length=None)

引数

D:
- 2次元のnumpy配列の、短時間フーリエ変換 (STFT) 行列
  - 1次元目: 周波数
  - 2次元目: 時間フレーム
rate:
- 速度（時間長）を制御する変数
- rate> 0: Dの発音している時間長は短くなります
- rate< 0: Dの発音している時間長は長くなります
hop_length: int > 0
- 位相進み計算用のパラメタ
- STFTのフレームシフトのサンプル数を与える
- デフォルト (hop_length=None)では、(2 * (D.shape[0] - 1)) //4に設定される返り値
D_stretched
- 2次元のnumpy配列の、短時間フーリエ変換 (STFT) 行列
  - 1次元目: 周波数
  - 2次元目: 時間フレーム
    - ただし、長さはD.shape[1] / rateとなる

サンプルコードとプロット

f:id:Kurene:20201007214806p:plain

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt


n_fft = 2048
hop_length = n_fft//4

y, sr   = librosa.load(librosa.ex('trumpet'))
D       = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)

D_fast  = librosa.phase_vocoder(D, 2.0, hop_length=hop_length)
y_fast  = librosa.istft(D_fast, hop_length=hop_length)

D_slow  = librosa.phase_vocoder(D, 1./3, hop_length=hop_length)
y_slow  = librosa.istft(D_slow, hop_length=hop_length)

length = np.array([len(y), len(y_fast), len(y_slow)]).max()

y       = librosa.util.fix_length(y     , length)
y_fast  = librosa.util.fix_length(y_fast, length)
y_slow  = librosa.util.fix_length(y_slow, length)


plt.subplot(3,1,1)
plt.plot(y)
plt.title("Original")
plt.subplot(3,1,2)
plt.plot(y_fast)
plt.title("x2")
plt.subplot(3,1,3)
plt.plot(y_slow)
plt.title("x 1/3")

plt.tight_layout()
plt.show()

librosa.phase_vocoderの中身について

librosa.core.spectrum — librosa 0.8.1 documentation

重要な中間変数を抜粋

# time_steps: 
# 時間長を縮める場合
# np.arange(0, 10, 2.0, dtype=np.float)
# >>> array([0., 2., 4., 6., 8.])
# 時間長を伸ばす場合
# np.arange(0, 10, 0.5, dtype=np.float)
# >>> array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])
time_steps = np.arange(0, D.shape[1], rate, dtype=np.float)

# 位相補正用のベクトル
phi_advance = np.linspace(0, np.pi * hop_length, D.shape[0])

# フェーズアキュムレータ 
# 初期値として時間フレーム先頭のD[:, 0]を代入
phase_acc = np.angle(D[:, 0])

メイン処理

# librosa.phase_vocoderにおける、時間フレームごとの処理

for (t, step) in enumerate(time_steps):
    # int(step):int(step + 2)であるため、
    # colums.shape[1]は2以上となる (columns[:,2::] は使わない)
    columns = D[:, int(step):int(step + 2)]

    # 現在の時間ステップstepに応じて、線形補間により振幅を修正
    alpha = np.mod(step, 1.0)
    mag = ((1.0 - alpha) * np.abs(columns[:, 0])
               + alpha * np.abs(columns[:, 1]))

    # 修正した振幅成分と位相成分を、出力の配列に代入
    d_stretch[:, t] = mag * np.exp(1.j * phase_acc)

    # 位相進みを計算
    dphase = (np.angle(columns[:, 1])
                  - np.angle(columns[:, 0])
                  - phi_advance)

    # 位相ラッピング（-pi:pi の範囲に変換）
    dphase = dphase - 2.0 * np.pi * np.round(dphase / (2.0 * np.pi))

    # フェーズアキュムレータ（位相進みの累積）
    # 次の時間ステップでの位相として利用
    phase_acc += phi_advance + dphase

return d_stretch

基本的に、

f:id:Kurene:20201007193237p:plain

を計算することで時間伸縮後のSTFT行列を算出します。

実際は離散信号を扱うので、振幅magと位相phase_accは前後の時間フレームの重み付き平均や差分に基づいて計算しています。

phi_advanceについては以下をご参照ください。

A Phase Vocoder in Matlab

補足：フェーズボコーダを使ったピッチシフトの実現方法

librosa.effects — librosa 0.8.1 documentation

def pitch_shift(y, sr, n_steps, bins_per_octave=12):
    rate = 2.0 ** (-float(n_steps) / bins_per_octave)

    # Stretch in time, then resample
    # librosa.effects.time_stretch(y, rate)
    y_shift = core.resample(time_stretch(y, rate), float(sr) / rate, sr)

    # Crop to the same dimension as the input
    return util.fix_length(y_shift, len(y))

def time_stretch(y, rate):

    # Construct the stft
    stft = core.stft(y)

    # Stretch by phase vocoding
    stft_stretch = core.phase_vocoder(stft, rate)

    # Invert the stft
    y_stretch = core.istft(stft_stretch, dtype=y.dtype)

    return y_stretch

以上より、例えば音高を2倍するには、