LFM-2b Dataset

Corpus of Music Listening Events for Music Recommendation and Retrieval

Description

This web page used to host the LFM-2b dataset of more than two billion listening events, intended to be used for various music retrieval and recommendation tasks.
The dataset is not available for download anymore due to license issues.

Dataset Variants

For the sake of study reproducibility we used to host both the full version of the dataset and the one released with the paper Investigating gender fairness of recommendation algorithms in the music domain published in Information Processing & Management journal (IP&M). We also used to host the 2020 Subset of the data used in RecSys'22 paper ProtoMF: Prototype-based Matrix Factorization for Effective and Explainable Recommendations.

Joined Dataset

The dataset used to contain the ~2b music listening events in a .tsv format with the following columns:

Field	Type	Meaning
User Id	Integer	Unique user Id
Country	String	Two-letter country code of the user
Age	Integer	Age of the user
Gender	String	Gender of the user as specified on Last.fm
Track Name	String	Name of the track
Artist Name	String	Name of the artist
Timestamp	Timestamp <YYYY-MM-DD 00:00:00>	Timestamp of the listening event

Adapted for Track Recommendation

The dataset used to be also available in three different files for an easier integration in Recommender Systems:

user_track_playcount.zip (~8.6GB)
Also available as a sparse matrix (user x track) in COOrdinate format:
user_track_playcount_coo_matrix.npz (~6.3GB)

Field	Type	Meaning
User Id	Integer	User Id
Track Id	Integer	Id of the pair Track Name and Artist Name
Playcount	Integer	Number of times the user listened to the Track overall

song_ids.zip (~2.6 GB)

Field	Type	Meaning
Track Id	Integer	Track Id
Artist Name	String	Full artist name
Track Name	String	Full track title

user_demographics.zip (4.0 MB)

Field	Type	Meaning
User Id	Integer	User Id
Country	String	Two-letter country code of the user
Age	Integer	Age of the user
Gender	String	Gender of the user as specified on Last.fm
User registration date	Timestamp <YYYY-MM-DD 00:00:00>	The creation date of the corresponding Last.fm account

Adapted for Artist Recommendation

In the case of artist recommendation, we also used to provide the following files:

user_artist_playcount.zip (~1.8GB)
Also available as a sparse matrix (user x artist) in COOrdinate format:
user_artist_playcount_coo_matrix.npz (~1.2GB)

Field	Type	Meaning
User Id	Integer	User Id
Artist Id	Integer	Unique Id of the Artist
Playcount	Integer	Number of times the user listened to the Artist's tracks overall

artist_ids.zip (~170MB)

Field	Type	Meaning
Artist Id	Integer	Artist Id
Artist Name	String	Full name of the artist

Available Files

File	Size	md5sum	Records	Fields
albums.tsv.bz2	245M	938e232f0d5d7a9162487088829378ed	24,237,348	album_id, album_name, artist_name
artists.tsv.bz2	45M	0fcaea92c8c2fb1e247c5a3d0d4e8e3e	5,159,580	artist_id, artist_name
listening-events.tsv.bz2	14G	cebd1047535d562a67801377ca2db0e4	2,014,164,872	user_id, track_id, album_id, timestamp
spotify-uris.tsv.bz2	49M	7ac4d2e0a6845cd3a658a0a5e486a602	2,378,113	track_id, uri
tracks.tsv.bz2	641M	d89ca3c4d5344a6166da5cf5305e71fe	50,813,373	track_id, artist_name, track_name
listening-counts.tsv.bz2	2.3G	9c761797d89640e2137670598031f577	519,293,333	user_id, track_id, count
users.tsv.bz2	797K	0aca8ab3a67ea71b1422d28c6c76d834	120,322	user_id, country, age, gender, creation_time
lyrics-features.json.bz2	3.8G	500f36096bf2f06119dd3a0d15141ca8	1,266,554	features{...}
tags.json.bz2	142M	f49891c59a7028a1fb3b798544e6edeb	2,230,814	<tag, weight>+
tags-micro-genres.json.bz2	38M	e937e2d0317323c77e3e4f02b4d5be5b	1,638,468	<micro-genre, weight>+

Format Clarifications

artists: name of 5,159,580 artists.
albums: name of 24,237,348 albums, accompanied with the names of their artists.
tracks: name of 50,813,373 tracks, accompanied with their artists.
users: information of 120,322 users, containing country, age, gender, and creation-time. Country is specified according to ISO 3166 Alpha-2 country code; empty if unknown. Age is the age of the user; -1 if unknown. Gender is either "m" (male), "f" (female), or "n" (neutral); empty when no gender information is present. Creation-time indicates the time that the user profile is created.
listening-events: 2,014,164,872 LEs, where each data point consists of the ID of the user, the ID of the track and the album, and the timestamp of the event. Artist can be inferred from tracks using column track_id.
listening-counts: has 519,293,333 records, containing the number of times a user has listened to a certain track.
spotify-uris: the URI of 2,378,113 tracks is provided, which can be used for crawling audio features or additional metadata from Spotify. Note that URIs are only specified for the tracks in the LFM-2b which are also included in Spotify's catalog.
lyrics-features: provides 1,266,554 records, containing the lexical features, compression ratios, entropy values, and vector embeddings of the lyrics of the subset of tracks for which we could retrieve lyrics.
tags: for a subset of 2,230,814 tracks, the user-generated tags are provided. Each of these tracks are annotated by users with one or more tags in the form of <tag, weight> pairs (tags = <tag, weight>+). Weights are values between 1 and 100 rounded to the nearest integer. The tag with the most annotations for a given song gets a weight of 100, and all other weights are set to the relative percentages of the most common one. Overall, there are 1,041,819 unique tags in the dataset.
tags-micro-genres: we also provide a subset of tags, containing 1,638,468 records exclusively with the information of micro-genres, fine-grained indications of musical genres or styles.

The 2020 Subset is a subset of the CHIIR dataset that contains only the listening events, users, tracks, and albums listened during the 2020 year (from 01/01/2020 till 20/03/2020).

Available Files

File	Size	md5sum	Records	Fields
albums.tsv.bz2	75M	a14dc60880f7bfd33dacdc383d100fc7	1,677,077	album_id, artist_name, album_name
tracks.tsv.bz2	180M	2ee8abf44877df790bdd9633e35416ff	4,082,530	track_id, artist_name, track_name
listening_events.tsv.bz2	1.3G	c21a76483b0fc4ab0de48281ccc79c32	30,357,786	user_id, track_id, album_id, timestamp
users.tsv.bz2	0.5M	c903989c9791dcf3834be8467bf7175e	15,258	user_id, country, age, gender, creation_time

Code

The code used in the submission to the Information Processing & Management journal can be found on Github.