Improve tempo/pitch change algorithms #1524

LWinterberg · 2021-08-24T09:57:15Z

LWinterberg
Aug 24, 2021
Collaborator

Background

Currently, Audacity uses a Waveform Similarity Overlap and Add algorithm (WSOLA, via Soundtouch) for fast tempo/pitch change and SBSMS for "high quality" pitch/tempo change (but which is an order of magnitude and a half slower). Neither supports real-time capabilities (probably only relevant for pitch change), so they'd need to get re-made for the upcoming stackable effects (#992) anyway.

As far as I understand the matter, WSOLA-algorithms work best for music and leave much to be desired for speech. For speech, Time-Domain Pitch-Synchronous Overlap and Add (TD-PSOLA) algorithms work better, with PICOLA being the most-used one (via sonic, which is what YouTube uses). Better still is ESOLA (Epoch-Synchronous Overlap and Add), which according to it's introductory study is both significantly faster and better than TD-PSOLA, and Fuzzy ESOLA which improves upon ESOLA for speech according to it's authors.

Ideas for the future

Identify what tempo/pitch algorithms are the best in {CURRENTYEAR} (in terms of performance and quality)
There may be different implementations of the same algorithm, eg esolafast for a faster ESOLA implementation than the reference.
Implement the winner over the current libraries
If there's no clear winner, eg "WSOLA is the best for music, FESOLA is the best for speech", provide a toggle in the app that is clear on what the algorithms are good at ("optimized for music" / "optimized for speech")

Benefits

Using better tempo/pitch algorithms might give Audacity a qualitative edge over commercial application in this regard. Further, using a faster "high-quality" algorithm than SBSMS will result in less waiting. Real-time pitch change also would be nice.

Question

What algorithm or library to choose? Discuss.

SteveDaulton · 2021-08-24T10:49:22Z

SteveDaulton
Aug 24, 2021
Collaborator

SoundTouch should be (easily) fast enough for real-time. On my budget i7 laptop, it takes about 10 seconds to process 1 hour stereo audio, and a little over 2 minutes for 8 stereo tracks, 44.1 kHz, 1 hour duration. The processing time for multiple channels could be greatly improved with parallel processing.

There are other apps that use libsoundtouch for real-time processing.
The SoundTouch website lists real-time use as one of its features:

Real-time audio processing possible:

input/output stream latency max. ~ 100 ms
Real-tome processing of CD quality stereo sound possible with a 133 Mhz Intel Pentium processor or better ;-)

Input / output latency is said to be about 100 ms, which is a bit slow if used as a "live" effect (real-time on the recording input), but should not be a problem when applying after recording, provided that the app (Audacity) handles the latency to maintain playback synchronization.

SBSMS "may" be quick enough for real-time processing, depending on the capabilities of the computer its running on. On my machine, 1 hour stereo (44.1 kHz) takes a little under 2 minutes.

The main weakness of SoundTouch is that percussive sounds / transients tend to echo. This can often be improved by manually tweaking the algorithm parameters (not currently supported in Audacity).

For sounds that do not have abrupt transients, SoundTouch can sometimes sound better than SBSMS. For example, simple generated tones will often have lower "noise" with SoundTouch.

The other main weakness of SoundTouch is that the output length may not exactly match the input length (due to the sequence length / overlap not being an exact multiple of the input length). Currently, Audacity pads the end with silence if necessary, though this could possibly be improved.

Some useful implementation details for SoundTouch are provided on the website: https://www.surina.net/soundtouch/README.html

2 replies

LWinterberg Aug 24, 2021
Collaborator Author

For sounds that do not have abrupt transients, SoundTouch can sometimes sound better than SBSMS

Feels like SBSMS shouldn't be called "high quality" then regardless of what happens, at least not without qualifier (like "optimize for speech (slow)")

SteveDaulton Aug 24, 2021
Collaborator

With music, SBSMS usually produces subjectively better sounding results, the notable exception being music that is legato throughout.
SBSMS also produces an output that is exactly the same length as the input (sample accurate).
SBSMS tends to also be better for speech than SoundTouch.
Overall, SBSMS tends to be higher quality than SoundTouch, but not always - it depends on the material being processed.

I agree that "High Quality (Slow)" is not an ideal name, but I can't think of a better name to adequately describe the difference that isn't either too technical, or too verbose ;)

Better that Audacity has both options rather than one or the other, though SoundTouch is likely to be much better suited to real-time processing than SBSMS due to being much faster.

stmueller · 2021-08-25T01:55:54Z

stmueller
Aug 25, 2021

see a related feature request in tenacity, requesting to use rubberband:
https://github.com/tenacityteam/tenacity/issues/466

1 reply

LWinterberg Aug 25, 2021
Collaborator Author

While Rubberband is probably better than Soundtouch (considering that Ardour switched to it as well), I think it's worth looking into new algorithms, too. There's some active research happening in this field as it's useful for AI-text-to-speech solutions. Just using what everyone else is using would be the easiest way forward, but I believe this is exactly how we ended up with Soundtouch :)

waywardgeek · 2021-08-25T16:26:03Z

waywardgeek
Aug 25, 2021

I've used Audacity many times, mostly to analyze/manipulate speech signals. I think you've got a solid overall understanding of the issues. There are also nonlinear speech speedup algorithms, which can be understood by listeners at higher speeds than linear speedup algorithms. I guess I'd have to ask what your primary goals/requirements are for Audacity. E.g. must it be a single algorithm for all use cases, or can you have one for speech, and another for music? How efficient must they be? For example, should it avoid floating point and use fixed-point instead? That was important for libsonic to be adopted on batter-powered devices, such as Android devices.

I wrote libsonic specifically to work well at high speeds for voice, since I have very poor vision and need it in many cases to listen at high speed. WSOLA falls apared at high speed. For 2X and slower, almost any algorithm works fine for speech, including WSOLA. Audible.com switched to libsonic for high speed on Android, which I can tell by listening to it, and that enabled them to offer up to 3.5X speedup. iOS still uses WSOLA, which is why folks on iOS listen at lower speeds when using Audible.com: it is just too tiring to listen at over about 2.5X. I listen to audio books typically at 3.5X speedup, and at work I listen at 4X speedup. Some blind folks (e.g. Sina Bharam) can listen with reasonable comprehension at 7X speedup. I believe the next big improvement in this space is likely nonlinear speedup, skipping silence, and playing some consonants at low speed, while speeding up vowels and other parts of speech that can be sped up without impacting comprehension.

Also, it is possible, and probably not very hard, to write a hybrid algorithm that incorporates both pitch-synchronous algorithms and WSOLA-like fixed-frame algorithms. We could detect if a single voice dominates the signal and use pitch-synchronous, and when we detect multiple voices, switch to fixed-frame algorithms. For <= 2.0X speed, just use WSOLA, since it works for everything.

How important is exact linear speedup in Audacity? Do folks use Audacity when mixing music to stretch or speedup vocals?

I haven't evaluated ESOLA before. I'll check it out to see if it is usable for voice at high speed. That's really the only area where I have anything useful to say anyway.

3 replies

waywardgeek Aug 25, 2021

I was not able to find sound samples for any ESOLA based algorithm playing faster than 2X speedup. Honestly, I thought the FESOLA sound samples were worse than most WSOLA and pitch-synchronous OLA implementations. I suspect the ESOLA authors have not extended it to > 2.0X yet.

The concept of epoch-synchronous OLA seems fishy, and I'm not surprised the FESOLA samples were distorted. They try to find pitch epochs based on high energy points, in theory corresponding to glottis opening events. However, just convert the waveform to frequency form, randomly add phase shifts to each frequency, and convert back, and it sounds almost the same, but the high energy points will have shifted to a very different point in the waveform.

Does anyone have voice sound samples from any new algorithm > 2.0X that sound good? I feel like there is a lot of room for improvement over libsonic's PICOLA. BTW, for high speed speech comparisons, I highly recommend asking blind speed listeners to help evaluate samples. This is never done in research studies, and the random folks asked to participate are never skilled at speed listening, and their ability to rate > 2.0X speedup is iffy at best. Good blind speed listeners will have very high correlation in their opinions of which high speed samples are better, unlike the random folks used in traditional studies.

LWinterberg Aug 26, 2021
Collaborator Author

Thanks a lot for your insights! To address some points you've raised:

must it be a single algorithm for all use cases, or can you have one for speech, and another for music?

We definitely can have a "optimize for speech / optimize for music" toggle, considering we currently have a "high quality (slow)" checkbox which switches between Soundtouch and SBSMS.

How efficient must they be?

Efficient enough for real-time pitch shifting on Audacity's minimum requirements (Windows 8.1 and SSE2 processors), so the efficiency requirements are probably not as strict as sonic's.

How important is exact linear speedup in Audacity? Do folks use Audacity when mixing music to stretch or speedup vocals?

This is something we'd want, yes.

I highly recommend asking blind speed listeners to help evaluate samples

That's a very good point. I'll definitely ask around the blind communities I've got access to when I get a firmer grip on what options remain in closer consideration.

Does anyone have voice sound samples from any new algorithm > 2.0X that sound good? I feel like there is a lot of room for improvement over libsonic's PICOLA.

I've had a difficult time finding anyone publish anything involving >2x and or even providing sound examples as well, I'm afraid. It's all papers. But yeah, I've got the same feeling; there's so much happening in the field, there ought to be something that works better than an algorithm from 1985 by now.

waywardgeek Aug 26, 2021

Sonic may be too optimized for performance for your needs, though I could be convinced to make a cleaner version that is more CPU heavy. The primary source of noise introduced by Sonic is due to the OLA step, where I only consider overlaps of the samples as they are. This introduces roughly -40 dB of noise, making the audio no better than 8-bit PCM, roughly cell phone quality. Distortion of high-speed speech is inevitable, but I could use sin(x)/x interpolation to create a smooth audio waveform, and overlap those more accurately. It would introduce resampling noise, but that should be cleaner than what libsonic does today. Similar enhancements can be applied to WSOLA. Both should run fast enough, even when converting to floating point and doing the sin(x)/x interpolation (libsonic has a very fast resampling algorithm already).

Tantacrul · 2021-09-17T10:15:35Z

Tantacrul
Sep 17, 2021
Maintainer

@LWinterberg - next step would be a mockup spec:

Where would each option go (assuming the use of different algorithms for different purposes)
Other related improvements or rejigging of UI/UX
Potential usercases, etc.

We could then put a designer on it. Really cool stuff :)

1 reply

LWinterberg Sep 17, 2021
Collaborator Author

To roll this up from the back:

I think we have the following use cases for stretching:

Normal-range stretching (0.25x-2x) for music (concerned about getting pitch and length perfect)
Normal-range stretching for speech (concerned about keeping speech intelligible)
Real-time pitch changing
Extreme speed-ups (up to 4-7x) for speech (blind users and speed listening, concerned about keeping it intelligible)
Extreme slow-downs (10-10^18+x slowdown) for ambient stuff (very relevant) - this definitely is using paulstretch

As for UI, I'd definitely want to merge the "change speed" and "change tempo" items into one and provide a "preserve pitch" checkbox which toggles between the two current effects (ie speed up and chimpunk-ify it, or speed-up with a stretching algorithm). On top of that, for normal-range stretching, there probably should be a radio button to choose between "optimize for speech" and "optimize for music".

I think there could be a clever way to scale the slider here. The slider could cover 3 areas, a logarithmic "very slow" range from 100x-2x slow, a linear "normal" range, and a logarithmic "very fast" range for 2x-100x. And if you move the slider into the "very slow" range, it switches the "optimize for music/speech" radio buttons out for an info text saying "optimized for extreme stretching (paulstretch)" or something like that, same thing for "very fast" should there be a good speed-up algorithm. (exact threshold are open to debate though)

Pitch changing would remain a separate effect. Should the new algorithms be comparable to SBSMS, I'd remove the "use high-quality stretching" toggle from it as I think the confusion caused by an option which is or isn't there depending on whether you use real-time or destructive pitch change outweighs the benefit of selecting in-between those.

Before we get started on designing any of this, we first need to figure out what algorithms to actually use. @waywardgeek has been a great help with this so far. I'll ask my local digital signal processing/digital audio professors for advice as well.

LWinterberg · 2021-10-29T23:52:25Z

LWinterberg
Oct 29, 2021
Collaborator Author

(thought: once we figure out more stuff here, we probably should use that algorithm for play-at-speed as well, the usecase being previewing long voiceovers for mistakes)

0 replies

duxglobal · 2024-02-03T10:01:26Z

duxglobal
Feb 3, 2024

How about increasing the significant figures of the pitch/ frequency change? 3 after the decimal is not enough for conversion to higher frequencies.

0 replies

LWinterberg · 2024-02-03T22:58:11Z

LWinterberg
Feb 3, 2024
Collaborator Author

This particular discussion is no longer relevant; we used the algorithm from staffpad in https://github.com/audacity/audacity/tree/master/libraries/lib-time-and-pitch which is a type of phase vocoder.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tempo/pitch change algorithms #1524

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Improve tempo/pitch change algorithms #1524

LWinterberg Aug 24, 2021 Collaborator

Background

Ideas for the future

Benefits

Question

Replies: 7 comments · 7 replies

SteveDaulton Aug 24, 2021 Collaborator

LWinterberg Aug 24, 2021 Collaborator Author

SteveDaulton Aug 24, 2021 Collaborator

stmueller Aug 25, 2021

LWinterberg Aug 25, 2021 Collaborator Author

waywardgeek Aug 25, 2021

waywardgeek Aug 25, 2021

LWinterberg Aug 26, 2021 Collaborator Author

waywardgeek Aug 26, 2021

Tantacrul Sep 17, 2021 Maintainer

LWinterberg Sep 17, 2021 Collaborator Author

LWinterberg Oct 29, 2021 Collaborator Author

duxglobal Feb 3, 2024

LWinterberg Feb 3, 2024 Collaborator Author

LWinterberg
Aug 24, 2021
Collaborator

Replies: 7 comments 7 replies

SteveDaulton
Aug 24, 2021
Collaborator

LWinterberg Aug 24, 2021
Collaborator Author

SteveDaulton Aug 24, 2021
Collaborator

stmueller
Aug 25, 2021

LWinterberg Aug 25, 2021
Collaborator Author

waywardgeek
Aug 25, 2021

LWinterberg Aug 26, 2021
Collaborator Author

Tantacrul
Sep 17, 2021
Maintainer

LWinterberg Sep 17, 2021
Collaborator Author

LWinterberg
Oct 29, 2021
Collaborator Author

duxglobal
Feb 3, 2024

LWinterberg
Feb 3, 2024
Collaborator Author