Conference Agenda

Session Overview
B03: Data from Video and Music Platforms
Thursday, 07/Mar/2019:
12:00 - 1:00

Session Chair: Simon Kühne, University Bielefeld, Germany
Location: Room 158
TH Köln – University of Applied Sciences


Why not to use popularity scores from platforms. The hidden biases of YouTube data

Merja Mahrt

Heinrich-Heine-Universität Düsseldorf, Germany

Relevance & Research Question: Many social media platforms give access to information such as number of likes, shares, or views of their content items, facilitating large-scale analyses of content popularity. These data, however, are likely to contain “hidden biases” (Crawford, 2013), partly because of intransparent algorithmic decisions that determine what is included in such data (Gillespie, 2012). Studying the popularity of YouTube content, how do popularity scores from the platform itself compare to other data sources?

Methods & Data: A content analysis of the daily YouTube top 10 in Germany examines popularity, genre, and author distribution (1,164 unique videos from March-August 2014 were collected). In an online survey in September 2014, 1,665 respondents were asked for their use of online video sites and awareness of then recent popular online videos. Clickstream data from Nielsen Germany for June 2014 were likewise analyzed for popularity of content (data on 8,147 YouTube users who watched around 250,000 different videos were acquired from Nielsen).

Results: The three data sets show very different sides of what was popular in spring and summer 2014 in Germany. The survey as well as the clickstream data reveal clear biases in YouTube use by age, gender, education, income, and apparent content interest. The most popular videos from YouTube’s own top 10 and Nielsen’s clickstream data only partly match, with gaming being dominant according to YouTube itself, while Nielsen users concentrated much more on then current music videos.

Added Value: The comparison of three data sources highlights that platform scores for popularity are essentially black boxes that may contain numerous biases due to over- or underrepresentation of users and intransparent platform decisions about what is included or even advertised to the larger usership. Research that solely relies on platform data thus risks systematic biases that put into question the validity of this approach to social research.


Crawford, K. (2013, April 1). The hidden biases in big data. Harvard Business Review Blog. Retrieved from

Gillespie, T. (2012). Can an algorithm be wrong? Limn, n.v.(2). Retrieved from

Mahrt-Why not to use popularity scores from platforms The hidden biases-177.pdf

Rank eater versus Muggle: The impact of the two consumer orientations on the ranking in the digital music market

Junmo Song, Eehyun Kim

Yonsei University, Korea, Republic of (South Korea)

Relevance & Research Question:

How do consumers behave in the digital music market? Previous discussions on the music market assume consumers who respond to reasonable factors such as the quality and cost of the song. In the real music market, however, consumers often show unconditional preference for certain singers or songs, and they are organized into groups called fandom, which also exerts social influence. Therefore, the aim of this study is to show the presence of ‘collective pick’ carried out by organized consumers and to demonstrate the impact of collective pick on the music market.

Methods & Data:

This research is based on hour-based ranking data of 1,898 songs from ‘Melon’, ‘Genie’, ‘Mnet’, ‘Bugsmusic’, which account for 70 percent of the total streaming market share in South Korea, from May 27 to December 31, 2018. Hourly data enables the operational definitions as follows: the ranking early in the morning is highly dependent on the repetitive streaming of organized consumers, and the ranking at the end of the day reflects the impact of the general consumers. In other words, the difference in rank between two time zones represents the boosting effect of the collective pick. In this study, we analyze the effect of collective pick on the short and long-term performance of a song through fixed effect panel analysis, sequence analysis, multinomial logistic regression.


Depending on the characteristics of a singer, song and platform, the consumer's collective pick works differently. The chart preemption effect of this fan-boosting allows the song to outperform at the daily life cycle level. On the long-term life cycle level, fan boosting has a positive effect on the long-term survival of the song in the chart. Boosting by fans have a significant effect in both short and long-term level even when controlling other factors such as the initial entry rank of music and the characteristics of singers.

Added value:

This study is meaningful in that it confirms that the performance of a song in the digital music market is not simply determined by the quality and cost of music, but by the collective pick of fans.

Song-Rank eater versus Muggle-205.pdf

Methods and Tools for the Automatic Sampling and Analysis of YouTube Comments

M. Rohangis Mohseni1, Johannes Breuer2, Julian Kohne2

1TU Ilmenau, Germany; 2GESIS – Leibniz Institute for the Social Sciences, Germany

Relevance & Demand: YouTube is currently the largest and most important video platform on the Internet. For young people, YouTube is already partly replacing television. While there exist a number of studies on the use of YouTube, there are comparatively few quantitative empirical studies that deal specifically with user comments. As suggested by Thelwall (2017), we believe that part of the reason for this is that researchers are not aware of the potential of the YouTube API or do not know how to use it for their research.

Methods/Tools: We compared different tools for sampling and analyzing YouTube comments, paying special attention to the functionality and usability of the options in order to arrive at a practical decision-making aid for various use cases. We investigated the possibilities of automated evaluation (sentiment analysis of text and emojis, content analysis, topic modelling), but also the problems and limitations associated with typical YouTube comments (e.g., use of irony, slang, unusual words, creative spelling). In this process, we especially focused on challenges in working with emojis (parsing; display differences between platforms; cultural and inter-individual differences in the use and interpretation of emojis, etc.).

Script Creation & Use: We created an R script (doi:10.17605/OSF.IO/HQSXE) that collects comments via the YouTube API, and parses the comments into a dataset while extracting additional information (e.g. user ID, timestamps, used emojis, number of likes) to prepare them for further analyses. As a potential use-case, the current version also includes a combined sentiment analysis of text and emojis.

Added Value: Emojis play an important role in YouTube comments. They can strongly influence the meaning of a comment and, thus, the associated sentiment. Nevertheless, they are rarely taken into account in automated analyses of YouTube comments. We included first solutions for the extraction and preparation of emojis for subsequent analyses in our R script. We will also provide attendants with a short tutorial on how to use the three tools we discuss in-depth (YouTube Data Tools, Webometric Analyst, and tuber package for R).

Mohseni-Methods and Tools for the Automatic Sampling and Analysis-216.pptx