Rencon: a "Turing Test for musical expression" (by Olivia Solon)

Dubbed a "Turing Test for music", Rencon is a competition that pits different computer systems against each other in a battle of musical expression.


Created by Erica Bisesi · Interview

At the end of July, the Speech, Music and Hearing Centre at Stockholm's KTH Royal Institute of Technology played host to an extraordinary music competition. A small audience -- which included one of Sweden's most highly acclaimed pianists Lucia Negro and the head of Sweden's Royal College of Music Stefan Bojsten -- listened to 12 different performances played on a grand piano. They were looking out for technical control, musicality, expressive variations and one other attribute: humanness.

This is because each of the performances were generated by computer systems and played by a laptop linked up to a Disklavier -- an electromechanically-controlled grand piano that can operate the keys and pedals without a human operator.

These ghostly performances were in aid of an annual competition called Rencon, which pits different computer systems against each other in a battle of musical expression. It is considered to be a musical Turing Test of sorts; the aim is to create a system that can play music in a manner that is indistinguishable from a human. The ultimate goal is to have a machine win the Chopin Competition, one of the most respected accolades in classical music.

Rencon [Musical Performance RENdering CONtest] was launched in Japan in July 2002 as a workshop at an international conference in Kyoto between three different research groups. In a bid to advance the field more rapidly, Rencon was repositioned as an international competition in the same vein of the RoboCup. A decade on and there are now around 10 research groups working in the field who regularly compete against each other. This year's event was organised by Anders Friberg, along with Roberto Bresin and colleagues from the KTH Sound and Music Computing Group.

Teams take very different approaches to the software that they develop. In simple terms, each system takes the original score, reads all of the symbolic notations -- the grammar -- and then interprets it to produce an expressive variation, emphasising certain parts more than others.

"You follow the grammar when you read text, stopping at the end of a phrase, breathing and then starting the next phrase. Otherwise you don't have speech. But then you can say the same word to communicate different emotions. It depends on how you emphasise or pronounce the word. The same applies to music," explains Roberto Bresin from the KTH Sound and Music Computing Group. "On top of that you can change the performance style. Performance style follows a grammar but you can exaggerate or dampen some acoustic tonalities."

"If you take a musical score and play it exactly as it's written -- without changing the duration or the loudness of the notes -- then this is a deadpan performance," says Erica Bisesi, a senior postdoctoral researcher at the University of Graz, adding that the resulting performance would sound like a mobile phone ringtone.

Top-down or bottom-up?
There are, broadly, two different approaches to the same problem. The first is to take the grammatical approach, whereby you teach a computer first of all how to read the melody and its associated notations and then use certain parameters to influence the tempo and dynamics of the piece, with a view to mimic some of the creative flourishes that we see in human piano performances.

The second approach involves machine learning -- heavy statistical analysis of a database of previous human musical performances and their scores, which is used to train the system. The idea is that you can then input a new piece of music and it will drawn on elements from previous human performances. There are, of course, also hybrid systems.

Then there are fully automatic systems -- where you feed in the score and it spits out a performance -- and human-conducted systems, where the computer follows the rhythm set by a person. In Rencon, these two sorts of systems are divided into different categories.

Teams that enter Rencon are given two pieces of music about an hour before they -- or rather their system's output -- goes on stage. In July's competition, these were Sonata K.466 by Domenico Scarlatti and Prelude No.3 by Nino Rota.

Once given the MIDI files of the deadpan performances, the teams can then fine tune their system to make sure they are optimised for the appropriate style of music -- each system might, for example, have a different set of rules for baroque, classic, romantic or modern pieces. They then hand a new MIDI file back to the organisers, and this is played on the Disklavier in front of an audience.

Erica Bisesi and her colleagues Richard Parncutt and Anders Friberg -- from the Universities of Graz and KTH -- took a grammatical, bottom-up approach to the challenge. The system is based on a previous formulation called Director Musices, a performance rendering system (developed at KTH) that introduces expressive deviations to input scores. The team starts by introducing phrases to the score -- taking big phrases and dividing them into sub-phrases and sub-sub-phrases. These phrases can be modelled by varying the tempo and the dynamics with musical features such as ritardando (slowing down), accelerando (speeding up) diminuendo (getting quieter), crescendo (getting louder).

"We extended the previous formulation by relating expressive features of a performance not only to global or intermediate structural properties, but also accounting for local events. Previously, once you segmented a piece and applied the rules to all of the phrases, they looked the same along the duration of the piece. Now each phrase can be rendered differently," explains Bisesi.

In order to achieve this, Bisesi and colleagues introduced accents to the model. Accents are, broadly, local events that tend to capture the attention of listeners. This could be metric, melodic or harmonic events -- anything that captures attention. So, for example, a melodic accent occurs at the highest and lowest tones of a melody and at local peaks and valleys, while a harmonic accent is related to roughness and harmonic ambiguity -- for example an unexpected chord or cadence -- or harmonic familiarity or expectedness. The team's system can identify these accents automatically, assign a salience to each one, modelling the volume and tempo around these events into a curve.

The curves are designed to mimic human movement. When people move their hand, be it to pick up a cup or to press onto piano keys, they tend to accelerate and then decelerate in a fluid manner. Were you to plot the acceleration on a graph, it would be parabolic. It is this biological motion that the Rencon competitors must mimic if the music is to avoid sounding robotic.

Bisesi explains: "I am a pianist. When I play the piano and am playing an ascending melody and want to increase the sound from the start to the end, I have to press with more weight on the key. The shape of this exertion is mapped to a movement. Our perception is comfortable with this movement."

Listeners -- at least musically-literate ones -- can "hear" these shapes. It's precisely this that means that human conducted systems generally sound better. The machines don't have to create these shapes, they can simply follow those of a real person.

The winning system in the human conducted category was VirtualPhilharmony, developed by Takashi Baba from Kwansei Gakuin University along with colleagues at Soai University. It involved using a theremin as a sensor to capture the movements of a conductor. These movements were then used to control the tempo and dynamics of the music. Here's how the system played Scarlatti (MP3). The team that scored the highest in the automatic category (not conducted by a human) came from the Nagoya Institute of Technology in Japan. Their system took a similar approach to Bisesi and colleagues, modelling deviations from the score with various curves, depending on the particular events within the music that they wanted to highlight.

Human-machine collaboration
Baba told that the human element in his team's performance gave them an advantage. He added that neither the case-based, machine learning approach nor the rule-based bottom-up approach are sufficient for creating a truly expressive performance.

He explains this by talking about the way both systems might approach a particular notation in the score of a tune, using the "f" symbol, which means forte (loudly), as an example.

In the rule-based approach, a system might search for all parts of the music marked with "f" and play them with a loud volume. In the case-based approach, a system would scan the database of previous human performances looking for samples where this particular piece of music was played. The more samples there are where the note is played with loud volume, the higher the probability that the note in the new piece will be played with loud volume.

"But there are many kinds of 'f'," Baba explains, adding that a human would never play every one with the same volume. Each 'f' has a different purpose, depending on its context and the performer. But these aren't marked in the score. It doesn't say "this f is designed to convey anger" or "this is a triumphant f". How each one is played comes down to the way that humans interpret the music, the way they try to tell the story on the partiuclar day that they are performing. "There is a black box between rule-based and case-based approaches. This is where you find art, sensitivity, creativity and humanity."

The interactive systems have an advantage over the automatic systems because a "human can intervene in this black box". That is to say that the conductor can help to bring some of the creativity and emotion to the performance. Baba believes that by improving these interactive systems -- and better understanding the connection between the score, the conductor and the performance -- we'll be able to start unpicking what's going on inside the black box. Bresin adds that a really talented pianist will surprise the audience with deviations that they are not expecting. "Conducting allows for these sorts of deviations. It's hard to mimic in an automatic system."

Sometimes it's precisely an attempt at such a deviation that identifies a performance as computer generated, even if it is expressive. Besisi gives the example of a computer placing the wrong emphasis on the classic Beethoven Ba Ba Ba Bum of the fifth symphony -- something a human wouldn't mess with.

Alexis Kirke, a member of the University of Plymouth's Interdisciplinary Centre for Computer Music Research and co-editor of the Guide to Computing for Expressive Music Performance sat through all of the Rencon Performances. He picked up on some more subtle giveaways. In the Scarlatti "the left hand and the right hand at times seemed to interact in ways that didn't make sense. The notes of the lead melody wouldn't coincide with the notes in the left hand. It's not so unusual but it didn't feel human."

"Sometimes things sounded pleasant but the metronome was too perfect. It was too much in time. Human performances are slightly erroneous, but if you make it too erroneous it can sound like a mistake," he adds.

However, he adds: "Although I have heard Disklaviers before I had never been in a room with a Disklavier being controlled by a computer system which is meant to make it sound human. There was something quite profound in the moment for me, coming face-to-face with this piano-embodiment of a robot 'expressing itself' with our main language of emotion -- music."

The AI challenge
Rencon is not simply a fun competition to see if we can differentiate between a robot and a human performer: it represents a major AI challenge -- or, as Baba puts it, clarifying what's going on inside the black box.

"We don't really understand human expressive performance fully," adds Kirke. "You can take the same piece of music and ask a trained pianist to play it happily, sadly or angrily and they can do. In order to simulate performance, we need to tap into who we are at the deepest level, what it means to be human and to experience beauty."

Bresin adds: "If you can model how human beings express themselves, you have a better insight into human expression in general."

Music has the added advantage of being a universal language. Research into musical expression could also be applied to other fields, such as robotics or general interactions between humans and machines.

For Bisesi, the main challenges for Rencon relate to non-verbal communication and education. "It's not really about trying to replace human performers with a computer. The aim is to have a system which allows us to understand just one aspect of human communication. If we can communicate and explain music in a systematic way, it can be useful for educational purposes."

Baba believes that systems like VirtualPhilharmonia could also be applied to musical therapy, by allowing people with limited musical skills to express themselves. He suggests that children, people with dementia and those with learning disabilities could find pleasure through musical performance -- playing complex music by simply waving their hands. This, in turn, may "activate their brains".

While there is anecdotal evidence to suggest that there may be therapeutic benefits to musical therapy, very few rigorous trials have been carried out.

Flawed methodology
Despite being dubbed the "Turing Test for music", Rencon suffers from a fairly major methodological flaw: the judges know that none of the pieces are played by humans. In the standard Turing Test for artificial intelligence, the judge is presented with both real people and machines.

"Knowing that all of the performances were machine-generated was a problem. It meant that I was looking for mistakes. When one note sounds unnatural people will throw away the whole performance," Kirke says.

So powerful is stage presence in our evaluation of musical recital that a study published in August found that people shown videos of piano competitions could pick out the winners more often than those who could also hear the music.

The study concludes that both novices and experts make judgements about music performance quickly and automatically on the basis of visual information. Novices were able to quickly identify the actual competition winners at very high rates based on the visual passion that they displayed.

The Rencon setup -- a laptop plugged into a Disklavier -- makes for a pretty bland visual accompaniment. One solution would be to get a pianist to sit at the piano and fake the movement of playing the computer-generated piece. This sort of performance could be interspersed with real performances in order to reduce the bias of prior knowledge.

Bresin believes that current technology provides a performance on a par with a pianist who has played for between five and seven years. "But we can go further, to almost diploma level [10 years]."

However, at that point it may become more challenging to make improvements as it's at that point when pianists start to develop personality. (Olivia Solon, 02.09.2013)

Source Wired UK · Date Sep 2, 2013