Improve tempo/pitch change algorithms #1524
Replies: 7 comments 7 replies
-
SoundTouch should be (easily) fast enough for real-time. On my budget i7 laptop, it takes about 10 seconds to process 1 hour stereo audio, and a little over 2 minutes for 8 stereo tracks, 44.1 kHz, 1 hour duration. The processing time for multiple channels could be greatly improved with parallel processing. There are other apps that use libsoundtouch for real-time processing.
Input / output latency is said to be about 100 ms, which is a bit slow if used as a "live" effect (real-time on the recording input), but should not be a problem when applying after recording, provided that the app (Audacity) handles the latency to maintain playback synchronization. SBSMS "may" be quick enough for real-time processing, depending on the capabilities of the computer its running on. On my machine, 1 hour stereo (44.1 kHz) takes a little under 2 minutes. The main weakness of SoundTouch is that percussive sounds / transients tend to echo. This can often be improved by manually tweaking the algorithm parameters (not currently supported in Audacity). For sounds that do not have abrupt transients, SoundTouch can sometimes sound better than SBSMS. For example, simple generated tones will often have lower "noise" with SoundTouch. The other main weakness of SoundTouch is that the output length may not exactly match the input length (due to the sequence length / overlap not being an exact multiple of the input length). Currently, Audacity pads the end with silence if necessary, though this could possibly be improved. Some useful implementation details for SoundTouch are provided on the website: https://www.surina.net/soundtouch/README.html |
Beta Was this translation helpful? Give feedback.
-
see a related feature request in tenacity, requesting to use rubberband: |
Beta Was this translation helpful? Give feedback.
-
I've used Audacity many times, mostly to analyze/manipulate speech signals. I think you've got a solid overall understanding of the issues. There are also nonlinear speech speedup algorithms, which can be understood by listeners at higher speeds than linear speedup algorithms. I guess I'd have to ask what your primary goals/requirements are for Audacity. E.g. must it be a single algorithm for all use cases, or can you have one for speech, and another for music? How efficient must they be? For example, should it avoid floating point and use fixed-point instead? That was important for libsonic to be adopted on batter-powered devices, such as Android devices. I wrote libsonic specifically to work well at high speeds for voice, since I have very poor vision and need it in many cases to listen at high speed. WSOLA falls apared at high speed. For 2X and slower, almost any algorithm works fine for speech, including WSOLA. Audible.com switched to libsonic for high speed on Android, which I can tell by listening to it, and that enabled them to offer up to 3.5X speedup. iOS still uses WSOLA, which is why folks on iOS listen at lower speeds when using Audible.com: it is just too tiring to listen at over about 2.5X. I listen to audio books typically at 3.5X speedup, and at work I listen at 4X speedup. Some blind folks (e.g. Sina Bharam) can listen with reasonable comprehension at 7X speedup. I believe the next big improvement in this space is likely nonlinear speedup, skipping silence, and playing some consonants at low speed, while speeding up vowels and other parts of speech that can be sped up without impacting comprehension. Also, it is possible, and probably not very hard, to write a hybrid algorithm that incorporates both pitch-synchronous algorithms and WSOLA-like fixed-frame algorithms. We could detect if a single voice dominates the signal and use pitch-synchronous, and when we detect multiple voices, switch to fixed-frame algorithms. For <= 2.0X speed, just use WSOLA, since it works for everything. How important is exact linear speedup in Audacity? Do folks use Audacity when mixing music to stretch or speedup vocals? I haven't evaluated ESOLA before. I'll check it out to see if it is usable for voice at high speed. That's really the only area where I have anything useful to say anyway. |
Beta Was this translation helpful? Give feedback.
-
@LWinterberg - next step would be a mockup spec:
We could then put a designer on it. Really cool stuff :) |
Beta Was this translation helpful? Give feedback.
-
(thought: once we figure out more stuff here, we probably should use that algorithm for play-at-speed as well, the usecase being previewing long voiceovers for mistakes) |
Beta Was this translation helpful? Give feedback.
-
How about increasing the significant figures of the pitch/ frequency change? 3 after the decimal is not enough for conversion to higher frequencies. |
Beta Was this translation helpful? Give feedback.
-
This particular discussion is no longer relevant; we used the algorithm from staffpad in https://github.com/audacity/audacity/tree/master/libraries/lib-time-and-pitch which is a type of phase vocoder. |
Beta Was this translation helpful? Give feedback.
-
Background
Currently, Audacity uses a Waveform Similarity Overlap and Add algorithm (WSOLA, via Soundtouch) for fast tempo/pitch change and SBSMS for "high quality" pitch/tempo change (but which is an order of magnitude and a half slower). Neither supports real-time capabilities (probably only relevant for pitch change), so they'd need to get re-made for the upcoming stackable effects (#992) anyway.
As far as I understand the matter, WSOLA-algorithms work best for music and leave much to be desired for speech. For speech, Time-Domain Pitch-Synchronous Overlap and Add (TD-PSOLA) algorithms work better, with PICOLA being the most-used one (via sonic, which is what YouTube uses). Better still is ESOLA (Epoch-Synchronous Overlap and Add), which according to it's introductory study is both significantly faster and better than TD-PSOLA, and Fuzzy ESOLA which improves upon ESOLA for speech according to it's authors.
Ideas for the future
There may be different implementations of the same algorithm, eg esolafast for a faster ESOLA implementation than the reference.
If there's no clear winner, eg "WSOLA is the best for music, FESOLA is the best for speech", provide a toggle in the app that is clear on what the algorithms are good at ("optimized for music" / "optimized for speech")
Benefits
Using better tempo/pitch algorithms might give Audacity a qualitative edge over commercial application in this regard. Further, using a faster "high-quality" algorithm than SBSMS will result in less waiting. Real-time pitch change also would be nice.
Question
What algorithm or library to choose? Discuss.
Beta Was this translation helpful? Give feedback.
All reactions