Sing along solutions – automatic lyric transcription for the music streaming age

Emir Demirel

PhD researcher

In collaboration with

“Excuse me while I kiss this guy...” Jimi Hendrix’s lyrics to Purple Haze are frequently misheard, and beg the question − if the lyrics of a song are not written down, how are we to know what they are? PhD student Emir Demirel, working with his supervisor Professor Simon Dixon, set out to solve this problem.

Around ninety percent of recorded songs do not have transcribed lyrics. Lyrics help listeners to understand and appreciate music. Accurate lyric transcription also helps composers to create sheet music of their songs from an audio recording.

Speech recognition software is widely used and has a high degree of accuracy transcribing the spoken word, but when it is tested with sung words, it operates poorly. Can a high-quality Automatic Lyrics transcription (ALT) system be developed, which creates a text file as close as possible to the original words of the song, or those captured by a human transcriber?

How does current speech recognition technology work?

There are two main approaches to recognising human speech and transcribing it:

Deep neural networks - Hidden Markov Model (DNN-HMM)

This system recognises phonemes, the basic building blocks of speech. All words consist of sequences of phonemes. The system “listens” to an acoustic file and maps it against the phonemes it knows, creating a number of phoneme possibilities for each sound. It maps these against the word probabilities it knows, using a pronunciation model.

Once all the probabilities are in place, the program creates a graph with all the possible word alternatives. This is then decoded using an algorithm such as beam search, which chooses the most likely next steps in a sequence.

End-to-end model

This system, in contrast, provides a single model of neural networks. A neural network is a computer system inspired by the human brain and nervous system. This allows researchers to model the type of information that is processed in human perception and cognition systems. End-to-end models purely consist of neural networks and can process information given an objective and sufficient amount of training data. In the context of speech recognition, they don’t require a pronunciation model, so they don’t need prior human or linguistic expertise.

What issues did the researchers face?

Demirel’s research received funding from Horizon 2020, the EU's funding programme for research and innovation. He set out to see how best to adapt state-of-the-art automatic speech recognition software to recognise singing data. He wanted to identify how sung lyrics differ from spoken words in terms of pronunciation. He would then use this knowledge to improve word recognition. His research focused on building a robust system that can operate well in varying acoustic conditions, such as different music styles, instruments or recording conditions.

There are existing lyrics transcription models, but the researchers noted there was a significant drop in effectiveness when monophonic (single voice/ acapella) models try to transcribe from polyphonic recordings, where the singing is accompanied, and vice versa. Would it be possible to construct a single model that worked across all possible domains?

Adapting speech recognition software to singing data

On their first attempt, Demirel and his co-researchers used a state-of-the-art DNN-HMM speech recognition framework, trained on Stanford University’s Digital Archive of Mobile Performances (DAMP) dataset. This is the benchmark collection of monophonic recordings used in the study of lyrics transcription.

They added their own novel acoustic model and exploited neural networks to build language models. With these adaptations, they were able to improve the word error rate by up to fifteen percent.

How do sung words differ from spoken?

Unsurprisingly, when singing, people form longer vowel sounds. They also sometimes omit the final consonant in a word. Demirel embedded these observations into a pronunciation dictionary and created a singer-adapted lexicon which can be used specifically for lyrics transcription. This again led to a marginal but consistent improvement in the word error rate.

A new system for transcribing lyrics

The research held to the development of a new, neural network architecture − Multistreaming Acoustic Modelling for Automatic Lyrics Transcription (MSTRE-Net) − an innovative system that attempts to model the way the human ear processes auditory information.

Previously, transcribers were either trained on acapella singing or polyphonic recordings or were trained using particular data sets – either the DAMP dataset previously mentioned, or the DALI dataset from IRCAM in Paris, which contains more than 5000 polyphonic recordings. Demirel merged both data sets and then used them to train a cross-domain model.

The researchers also taught the model to distinguish the silence and the instrumental (where non-vocal music is playing) sections to improve lyrics transcription performance.

Lyrics transcription visualised, showing lyrics from a Coldplay song

What has MSTRE-Net achieved?

Demirel and his co-researchers have presented five publications at peer-reviewed conferences. Their Open-Source software is available on GitHub.

It has also been turned into the first commercial application of lyrics transcription technology. In collaboration with Doremir, Scorecloud Songwriter allows composers to sing and play into a single mic to get a lead sheet, with lyrics and chords.

The technology is also in use in music practising software such as Moises App. The technology can also be used to transcribe lyrics for the thousands of songs that are released in the music environment each week.

What’s next?

Demirel hopes to further leverage rhyming information and lyrics in languages other than English. He will also be challenging the system on more complex cases like the “brutal” lyrics in death metal, opera singing and custom words.

Research spotlight

The camera as witness – can body-worn cameras reduce violence?

Around the world, trust in the police has plummeted. How can we hold those who have sworn to protect us accountable?
Who joins a political party and why?

Who joins or leaves a political party and why? What do parties think of their members and how does this impact their policies? And how does all of this play out in the media?

Learn moreWho joins a political party and why?
Are you listening? Verbatim Theatre gives voice and ear to care-experienced young people

Since 2015, the Verbatim Formula (TVF) has collaborated with cohorts of care-experienced young people across the UK, developing a methodology to help adults listen better and young people to be heard.
Disrupting the music industry with AI algorithms

Audio production can be time consuming, labour-intensive and expensive. A new cloud-based audio mastering service, LandR Audio, uses artificial intelligence algorithms to produce professional – but accessible – audio mastering.
Taking Action on Salt: How Queen Mary’s research improved the food on your plate

Action on Salt has comprehensively shown the relationship between salt intake and blood pressure. Reducing salt in food is one of the most cost-effective public health policies available to the Government.
Picture this – street art in Nepal displays powerful messages for women and girls

Street art, murals and graffiti can be seen all over urban areas in Nepal. Researcher Charlotta Salmi has considered how activists and agencies in Nepal use these media to raise awareness of gender-based violence (GBV) in the country.