Austria (FWF) and Taiwan (MOST):
Culture- and Location-aware Music Recommendation and Retrieval
and development in music information retrieval (MIR) and music
recommender systems has seen a remarkable increase during the last
couple of years, driven by academia and industry alike (e.g.,
Spotify, Deezer, or Pandora). While the need for user-centric
approaches in these fields was highlighted several times, little effort
has been devoted to the study of culture- and location-specific
differences between users when accessing possibly huge music
collections. Investigating such differences will likely pave the way
for improvements in respective music access systems. Addressing this
lack of research, the joint seminar will foster the
exchange of ideas and will shed light on the subject matter from
several perspectives. In particular, we will contemplate novel
approaches in music retrieval and music recommender systems in the
context of different cultural background and locations of users.
|Johannes Kepler University, Linz|
|University of Innsbruck|
|Vienna University of Technology|
|Academia Sinica, Taipei|
|National Taiwan University, Taipei|
|National Chengchi University, Taipei|
Day 1: Workshop
|09:00 - 09:10||Opening and Welcome|
|09:10 - 09:40||Andreas Rauber,
Vienna University of Technology
Repeatability Challenges in MIR Research
Repeatability of experimental science is an essential ingredient to establish trust in eScience processes. Only if we are able to verify these and have verified components available, we can integrate these to perform increasingly sophisticated, transparent research building on each others developments. Such validation of repeatability and trusted reuse is particularly challenging in MIR research. Contrary to many other settings, raw data (i.e. music files underlying copyright restrictions) cannot be shared easily, rendering comparison of different approaches a difficult endeavor. Secondly, researchers rely increasingly on highly dynamic data sources such as social media and dynamic audio databases for their research, making precise identification of the data used in a particular study again a challenging task. Thirdly, signal processing algorithms employed in MIR, even when following standardized descriptions, may lead to differing results due to minute variations in the specific implementation of signal processing routines. All these characteristics render repeatability and validation of MIR a highly desired but hard to attain goal. In particular we will focus on a detailed analysis of these challenges to repeatability approaches for benchmark data sharing via time-stamped and versioned data sources data identification via time-stamped queries as recommended by the Research Data Alliance (RDA) means to capture an experiment execution context and validation data to enable ex-post validation of experiments.
|09:40 - 10:00||Alexander Schindler,
Vienna University of
Music Video Analysis and Retrieval
In the second part of the last century its visual representation has become a vital part of music. Album covers became a visual mnemonic to the music enclosed. Music videos distinctively influenced our pop-culture and became a significant part of it. Music video production makes use of a wide range of film making techniques such as screen-play, directors, producers, director of photography, etc. The applied effort creates enough information such that many music genres can be guessed by the moving pictures only. Stylistic elements emerged over decades into prototypical visual descriptions of music genre specific properties. Elements related to fashion, sceneries, dance moves, etc. Advances in visual computing provide means to facilitate this information to enhance open music information retrieval problems in a multimodal way. This workshop is intended to provide an overview of the application of image and video analysis to music information retrieval tasks. By introducing the technologies, toolsets and evaluation results the following questions will be addressed:
|10:00 - 10:20||Rudolf Mayer, Vienna
Music and Lyrics - Multi-modal Analysis of Music
Multimedia data by definition comprises several different types of content modalities. Music specifically inherits e.g. audio at its core, text in the form of lyrics, images by means of album covers, or video in the form of music videos. Yet, in many Music Information Retrieval applications, only the audio content is utilised. Recent studies have shown the usefulness of incorporating other modalities. Often, textual information in the form of song lyrics or artist biographies, were employed. Lyrics of music may be orthogonal to its sound, and differ greatly from other texts regarding their (rhyme) structure. Lyrics can thus be anaylsed in many different means, by standard bag-of-words approaches, or approaches also taking into account style and rhymes.
The exploitation of these properties has potential for typical music information retrieval tasks such as musical genre classification. Specifically of use can be the combination of features extracted from lyrics with the audio content, or further modalities.
Lyrics can also be interesting from cross-language aspects - sometimes there are cover versions with lyrics in a different language. In these cover versions, the message of the song, and the mood and emotions created by the lyrics, might be different than in the original version. This offers further interesting research opportunities for multi-modal analysis, and calls for a stronger focus on the lyrics analysis.
|10:20 - 10:50||Break|
|10:50 - 11:20||Jen-Tzung
Chien, National Chiao Tung University
Bayesian Learning for Singing-Voice Separation
This talk presents a Bayesian nonnegative matrix factorization (NMF) approach to extract singing voice from background music accompaniment. Using this approach, the likelihood function based on NMF is represented by a Poisson distribution and the NMF parameters, consisting of basis and weight matrices, are characterized by the exponential priors. A variational Bayesian expectation-maximization algorithm is developed to learn variational parameters and model parameters for monaural source separation. A clustering algorithm is performed to establish two groups of bases: one is for singing voice and the other is for background music. Model complexity is controlled by adaptively selecting the number of bases for different mixed signals according to the variational lower bound. Model regularization is tackled through the uncertainty modeling via variational inference based on marginal likelihood. We will show the experimental results on MIR-1K database.
|11:20 - 11:40||Yu-Ren
Chien, Academia Sinica
Alignment of Lyrics With Accompanied Singing Audio Based on Acoustic-Phonetic Vowel Likelihood Modeling
Here at the SLAM lab, I have been working on the task of aligning lyrics with accompanied singing recordings. With a vowel-only representation of lyric syllables, my approach evaluates likelihood scores of vowel types with glottal pulse shapes and formant frequencies extracted from a small set of singing examples. The proposed vowel likelihood model is used in conjunction with a prior model of frame-wise syllable sequence in determining an optimal evolution of syllabic position. New objective performance measures are introduced in the evaluation to provide further insight into the quality of alignment. Use of glottal pulse shapes and formant frequencies is shown by a controlled experiment to account for a 0.07 difference in average normalized alignment error. Another controlled experiment demonstrates that, with a difference of 0.03, F0-invariant glottal pulse shape gives a lower average normalized alignment error than does F0-invariant spectrum envelope, the latter being assumed by MFCC-based timbre models.
|11:40 - 12:00||Thomas
Chan, Academia Sinica
Hypercomplex and Informed Source Separation for Machine Listening and Brain-Computer Music Interfacing
As an introvert who could not communicate with people in a (British) pub, I had begun to investigate the cocktail party problem, also known as source separation, in 2002 (http://mpeg7ease.sourceforge.net/). Source separation entails the recovery of the original signals given only the mixed signals. In particular, musical signal separation and brain signal separation are difficult problems because there are more sources than sensors. For a better separation, additional guidance is needed in the form of source models and side information. In this talk, we will present our current and future work on guided source separation as applied to musical and brain signals, with many potential applications for music information retrieval. As I am also an amateur singer and pianist (http://www.mutopiaproject.org/cgibin/piece-info.cgi?id=368), this talk will mostly concentrate on our singing voice separation work.
|12:00 - 12:20||Zhe-Cheng
Fan, National Taiwan University
Singing Voice Separation and Pitch Extraction from Monaural Polyphonic Audio Music via DNN and Adaptive Pitch Tracking
With the explosive growth of audio music everywhere over the Internet, it is becoming more important to be able to classify or retrieve audio music based on their key components, such as vocal pitch for common popular music. In this talk, I am going to describe an effective two-stage approach to singing pitch extraction, which involves singing voice separation and pitch tracking for monaural polyphonic audio music. The approach has been submitted to the singing voice separation and audio melody extraction tasks of Music Information Retrieval Evaluation eXchange (MIREX) in 2015. The results of the competition shows that the proposed approach is superior to other submitted algorithms, which demonstrates the feasibility of the method for further applications in music processing.
|12:20 - 14:00||Lunch|
|14:00 - 14:30||Markus
Schedl, Johannes Kepler University Linz
Music Retrieval and Recommendation via Social Media Mining
Social media represent an unprecedented source of information about every topic of our daily lives. Since music plays a vital role for almost everyone, information about music items and artists is found in abundance in user-generated data. In this talk, I will report on our recent research on exploiting social media to extract music-related information, aiming to improve music retrieval and recommendation. More precisely, I will elaborate on the following questions:
|14:30 - 15:00||Peter
Knees, Johannes Kepler University Linz
Only Personalized Retrieval can be Semantic Retrieval or: What Music Producers Want from Retrieval and Recommender Systems
Sample retrieval remains a central problem in the creative process of making electronic music. In this talk, I am going to describe the findings from a series of interview sessions involving users working creatively with electronic music. In the context of the GiantSteps project, we conduct in-depth interviews with expert users on location at the Red Bull Music Academy. When asked about their wishes and expectations for future technological developments in interfaces, most participants mentioned very practical requirements of storing and retrieving files. It becomes apparent that for music interfaces for creative expression, traditional requirements and paradigms for music and audio retrieval differ to those from consumer-centered MIR tasks such as playlist generation and recommendation and that new paradigms need to be considered. Despite all technical aspects being controllable by the experts themselves, searching for sounds to use in composition remains a largely semantic process. The desired systems need to exhibit a high degree of adaptability to the individual needs of creative users.
|15:00 - 15:30||Eva
Zangerle, University of Innsbruck
The #nowplaying Dataset in the Context of Recommender Systems
The recommendation of musical tracks to users has been tackled by research from various angles. Recently, incorporating contextual information in the process of eliciting recommendation candidates has proven to be useful. In this talk, we report on our analyses on the effectiveness of affective contextual information for ranking track recommendation candidates. Particularly, we performed an evaluation of such an approach based on a dataset gathered from so-called #nowplaying tweets and looked into how incorporating affective information extracted from hashtags within these tweets can contribute to a better ranking of music recommendation candidates. We model the given data as a graph and subsequently exploit latent features computed based on this graph. We find that exploiting affective information about the user's mood can improve the performance of the ranking function substantially.
|15:30 - 16:00||Break|
|16:00 - 16:20||Andreu
Vall, Johannes Kepler University Linz
Fusing Web and Audio Predictors to Localize the Origin of Music Pieces for Geospatial Retrieval
Localizing the origin of a music piece around the world enables some interesting possibilities for geospatial music retrieval, for instance, location-aware music retrieval or recommendation for travelers or exploring non-Western music -- a task neglected for a long time in music information retrieval (MIR). While previous approaches for the task of determining the origin of music either focused solely on exploiting the audio content or web resources, we propose a method that fuses features from both sources in a way that outperforms stand-alone approaches. To this end, we propose the use of block-level features inferred from the audio signal to model music content. We show that these features outperform timbral and chromatic features previously used for the task. On the other hand, we investigate a variety of strategies to construct web-based predictors from web pages related to music pieces. We assess different parameters for this kind of predictors (e.g., number of web pages considered) and define a confidence threshold for prediction. Fusing the proposed audio- and web-based methods by a weighted Borda rank aggregation technique, we show on a previously used dataset of music from 33 countries around the world that the median placing error can be substantially reduced using K-nearest neighbor regression.
|16:20 - 16:40||Ken-Shin
Yeh, National Tsing Hua University
AutoRhythm: A Music Game With Automatic Hit-Timing Generation And Percussion Identification
In this talk, we will introduce a music rhythm game called AutoRhythm, which can automatically generate the hit timing for a rhythm game from a given piece of music, and identify user-defined percussion of real objects in real time. More specifically, AutoRhythm can automatically generate the beat timing of a piece of music via server-based computation, such that users can use any song from their personal music collection in a rhythm game. Moreover, to make such games more realistic, AutoRhythm allows users to interact with the game via any object that can produce a percussion sound, such as a pen or a chopstick hitting against a table. AutoRhythm can identify the percussions in real time while the music is playing. This identification is based on the power spectrum of each frame of the filtered recording obtained from active noise cancellation, where the estimated noisy playback music is subtracted from the original recording.
|16:40 - 17:00||Chun-Ta
Chen, National Tsing Hua University
Polyphonic Audio-To-Score Alignment Using Onset Detection And Constant Q Transform
We proposes an innovative method that aligns a polyphonic audio recording of music to its corresponding symbolic score. In the first step, we perform onset detection and then apply constant Q transform around each onset. A similarity matrix is computed by using a scoring function which evaluates the similarity between notes in the music score and onsets in the audio recording. At last, we use dynamic programming to extract the best alignment path in the similarity matrix. We compared two onset detectors and two note matching methods. Our method is more efficient and has higher precision than the traditional chroma-based DTW method. Our algorithm achieved the best precision, which are 10% higher than the compared traditional algorithm when the tolerance window is 50 ms.
|17:00 - 17:30||Li Su,
Music Technology of the Next Generation: Automatic Music Transcription and Beyond
Up to date, music is still a less explored area in modern digital multimedia application. When it comes to music, people always have many desired, unrealized, but forgotten dreams: Is it possible for me to learn music efficiently, happily without expensive tuition fee? Have tools make my singing voice beautiful? Make my own album by myself? Or learn to write songs easily? Solutions to all these general desires are either unseen or utilized merely by a small group of professional musicians. User-centered, personalized, portable and ubiquitous applications of smart music processing, with applications for music appreciation, music education, music gaming, music production, and even the preservation and revitalization of musical cultural heritage, is becoming the arena of cutting-edge music technologies, and will be the heart of future digital music market. Automatic music transcription (AMT), as one of the most challenging problems in machine listening of music, will play a significant role for the next wave of music technology from different perspective. Having strong power of research and development on music information retrieval (MIR) and a large amount of musicians and music producers, Taiwan has a niche for launching the next revolution of music technology, succeeding the fashion of global online music streaming lasting for one decade, by incorporating MIR technology, augmented reality (AR), internet of things (IoT), new signal processing and machine learning techniques, and creative ideas. This emerging field will be found a new opportunity for Taiwan, where people are finding the future position of our information technology, cultural and creative industry, all of which are now undergoing great challenges.
Day 2: Forum
|09:00 - 09:30||Yi-Hsuan Yang,
Quantitative Study of Music Listening Behavior in a Social and Affective Context
A scientific understanding of emotion experience requires information on the contexts in which the emotion is induced. Moreover, as one of the primary functions of music is to regulate the listener’s mood, the individual’s short-term music preference may reveal the emotional state of the individual. In light of these observations, this paper presents the first scientific study that exploits the online repository of social data to investigate the connections between a blogger’s emotional state, user context manifested in the blog articles, and the content of the music titles the blogger attached to the post. A number of computational models are developed to evaluate the accuracy of different content or context cues in predicting emotional state, using 40,000 pieces of music listening records collected from the social blogging website LiveJournal. Our study shows that it is feasible to computationally model the latent structure underlying music listening and mood regulation. The average area under the receiver operating characteristic curve (AUC) for the content- based and context-based models attains 0.5462 and 0.6851, respectively. The association among user mood, music emotion, and individual’s personality is also identified.
|09:30 - 10:00||Martin Pichl,
University of Innsbruck
Towards a Context-Aware Music Recommendation Approach
Collaborative filtering based recommender systems utilizing social media data have been proven to be useful. However, collaborative filtering is a rather general approach. In order to provide better music recommendations, we adapt this standard approach in a way that it incorporates information about the music consumption context. This enables us to develop recommender systems that better fit into the field of music recommendation. In the seminar, we will elaborate on how to match social media data with data crawled from music streaming platforms as well as how this combined data can be utilized. To be precise, we focus on:
|10:00 - 10:30||Bruce Ferwerda, Johannes Kepler University Linz
Creating Personalized (Music) Experiences Through Personality and Affect
Personality is the coherent patterning of affect, cognition, and desires (goals) as they lead to behavior. To study personality is to study how people feel, how they think, what they want, and finally, what they do. In this talk I present some of our recent personality based work and how these findings can help to create personalized (music) experiences. The talk will elaborate on the following topics:
|10:30 - 11:00||Chih-Ming Chen & Ming-Feng Tsai, National Chengchi University
Network Embedding Methods for Query-based Recommendations
A common goal of recommendation systems is to predict the users’ unseen, but potentially liked, items based on their previous preference. Therefore, the users can only passively receive the generated recommendations. In this study, we propose the query-based recommendation idea that enables the users to start a music radio by their given query. This way the task becomes an information retrieval-like problem, and we solve it by using the network embedding methods on the user-to-feature listening graph. We will present a live demo of the system developed so far.
|11:00 - 11:30||Chia-Hao Chung & Homer Chen, National Taiwan University
User Listening Behavior Analysis by Latent Representation
Understanding the listening behavior of users is important for a music streaming provider. To do such analysis, a useful tool is needed. In this presentation, we will talk about how representation learning methods can be applied as a tool to analyse the listening behavior of users. Moreover, we will show some preliminary results of the analysis on a real-word listening log.
|11:30 - 12:00||Fernando Calderon & Yi-Hsin Chen, National Tsing Hua University
Multilingual Emotion Classifier using Unsupervised Pattern Extraction from Micro-blog Data
The connected society we live in has allowed users to willingly share opinions online at an unprecedented scale. In recent years there has been an incremental understanding that if these opinions are analyzed and interpreted correctly they can provide useful information such as understanding how people feel or react towards a specific topic. This has made it crucial to devise algorithms that efficiently identify the emotions expressed within the opinionated content. Traditional opinion-based classifiers require extracting high-dimensional feature representations, which become computationally expensive to process and can misrepresent or deteriorate the accuracy of a classifier. This work proposes an unsupervised graph-based algorithm to extract emotion bearing patterns from micro-blog posts. The extracted patterns will then become features over which the classifier is built on, avoiding dependency on predefined emotional dictionaries, lexicons or ontologies. The system also considers that posts may be written in multiple languages. It then takes advantage of the pattern extraction method to successfully perform in different languages, domains and data sets. Experimental results demonstrate that the extracted patterns are effective in identifying emotions for English, Spanish, and French Twitter streams. We also provide additional experiments and an extended version of our algorithm to support the classification of Indonesian microblog posts. Overall, the results indicate that the proposed approach bears desirable characteristics such as accuracy, generality, adaptability, minimal supervision and coverage.
|12:30 - 13:00||Xiao Hu, Hong Kong University
Cross-Cultural Music Affect Recognition
In music mood prediction, regression models are built to predict values on several mood-representing dimensions such as valence (level of pleasure) and arousal (level of energy). Many studies have shown that music mood is generally predictable based on music acoustic features, bu¬¬t these experiments were mostly conducted on datasets with homogenous music. Little research has been done to explore the generalizability of mood regression models cross datasets, especially those with music in different cultures. In the increasingly global market of music listening, generalizable models are highly desirable for automated processing, searching and managing music collections with heterogeneous characteristics. In this study, we evaluated mood regression models built on fifteen acoustic features in five mood-related musical aspects, with a focus on cross-dataset generalizability. Specifically, three distinct datasets were involved in a series of five experiments to examine the effects of dataset size, reliability of annotations and cultural backgrounds of music and annotators on mood regression performances and model generalizability. The results reveal that the size of the training dataset and the annotation reliability of the testing dataset affect mood regression performances. When both factors are controlled, regression models are generalizable between datasets sharing common cultural background of music or annotators.
|13:00 - 13:30||Alexander Schindler, Vienna University of Technology
Cross-Cultural Music Perception and Analysis
Music Information Retrieval research of the past decade has a commonly acknowledged overemphasis on Western music. Although various studies indicate that that people from different cultural backgrounds may perceive music in a different way, such aspects have been widely disregarded in MIR studies. This long-standing issue starts to gain attention from the research community. Recent research focuses on cross-cultural music perception and analysis on various cultural backgrounds. This talk will provide an overview of the challenges state-of-the-art music analysis faces concerning cross-culturalism.
|13:30 - 15:00||Thomas Lidy, Vienna University of Technology
From Music Feature Extraction to Music Feature Learning
Over the past 15 years, Music Information Retrieval (MIR) research has made remarkable progress, designing new algorithms for notation-based and audio-based analysis of musical material with the purpose of solving a multitude of problems around music recognition: identification of key, mode, instruments, beats, notes, onsets, artists, composers, genre, mood and more. In the audio-based domain, feature analysis algorithms that extract relevant information from the audio signal have been devised in order to model these musical concepts.These features have been “handcrafted”, i.e. designed by research scientists with a background in music, signal processing and/or psycho-acoustics, or adapted from other domains, such as the frequently used MFCC features that originated in speech recognition. By contrast, automatic feature learning methods try to infer the features automatically from the input, i.e. the audio signal. After the successful application of Deep Neural Networks in Image Retrieval, Deep Learning approaches are becoming more and more utilized in music retrieval approaches. This talk is intended to revisit the traditional approaches in MIR and discuss questions around how automatic feature learning will influence the research and applications in the near future:
|15:00 - 17:30||Open discussion|
Day 3: Networking
|09:00 - 13:30||Visit to National Taiwan University (Host: Homer Chen)|
|13:30 - 17:00||Visit to National
Chengchi University (Host: Ming-Feng Tsai)
Johannes Kepler University Linz (JKU), Department of Computational Perception
E-Mail: markus [dot] schedl [at] jku [dot] at
Academia Sinica, Research Center for IT Innivation (CITI), Music and Audio Computing (MAC) Lab
E-Mail: yang [at] citi [dot] sinica [dot] edu [dot] tw