LFM-2b Dataset

Corpus of Music Listening Events for Music Recommendation and Retrieval

Description

This web page hosts the LFM-2b dataset of more than two billion listening events, intended to be used for various music retrieval and recommendation tasks.

Dataset Variants

For the sake of study reproducibility we host both the full version of the dataset and the one released with the paper Investigating gender fairness of recommendation algorithms in the music domain published in Information Processing & Management journal (IP&M).

Joined Dataset

You can download the LFM-2b dataset here: LFM-2b.zip (~132.7GB, uncompressed).
A preview of the LFM-2b dataset can be found here: LFM-2b-preview.tsv (~5.8MB).
The dataset contains the ~2b music listening events in a .tsv format with the following columns:
Field Type Meaning
User Id Integer Unique user Id
Country String Two-letter country code of the user
Age Integer Age of the user
Gender String Gender of the user as specified on Last.fm
Track Name String Name of the track
Artist Name String Name of the artist
Timestamp Timestamp
<YYYY-MM-DD 00:00:00>
Timestamp of the listening event

Adapted for Track Recommendation

The dataset is also available in three different files for an easier integration in Recommender Systems:

user_track_playcount.zip (~8.6GB)
Also available as a sparse matrix (user x track) in COOrdinate format:
user_track_playcount_coo_matrix.npz (~6.3GB)
Field Type Meaning
User Id Integer User Id
Track Id Integer Id of the pair Track Name and Artist Name
Playcount Integer Number of times the user listened to the Track overall
song_ids.zip (~2.6 GB)
Field Type Meaning
Track Id Integer Track Id
Artist Name String Full artist name
Track Name String Full track title
user_demographics.zip (4.0 MB)
Field Type Meaning
User Id Integer User Id
Country String Two-letter country code of the user
Age Integer Age of the user
Gender String Gender of the user as specified on Last.fm
User registration date Timestamp
<YYYY-MM-DD 00:00:00>
The creation date of the corresponding Last.fm account

Adapted for Artist Recommendation

In the case of artist recommendation, we also provide the same following files:

user_artist_playcount.zip (~1.8GB)
Also available as a sparse matrix (user x artist) in COOrdinate format:
user_artist_playcount_coo_matrix.npz (~1.2GB)
Field Type Meaning
User Id Integer User Id
Artist Id Integer Unique Id of the Artist
Playcount Integer Number of times the user listened to the Artist's tracks overall
artist_ids.zip (~170MB)
Field Type Meaning
Artist Id Integer Artist Id
Artist Name String Full name of the artist

Available Files

File Size md5sum Records Fields
albums.tsv.bz2 245M 938e232f0d5d7a9162487088829378ed 24,237,348 album_id, album_name, artist_name
artists.tsv.bz2 45M 0fcaea92c8c2fb1e247c5a3d0d4e8e3e 5,159,580 artist_id, artist_name
listening-events.tsv.bz2 14G cebd1047535d562a67801377ca2db0e4 2,014,164,872 user_id, track_id, album_id, timestamp
spotify-uris.tsv.bz2 94M 218ee272401dd4e554755096e0bdaa9f 4,624,359 track_id, uri
tracks.tsv.bz2 641M d89ca3c4d5344a6166da5cf5305e71fe 50,813,373 track_id, artist_name, track_name
listening-counts.tsv.bz2 2.3G 9c761797d89640e2137670598031f577 519,293,333 user_id, track_id, count
users.tsv.bz2 797K 0aca8ab3a67ea71b1422d28c6c76d834 120,322 user_id, country, age, gender, creation_time
lyrics-features.json.bz2 4.3G 0e5074ac809b1a0495841af5c3991a7f 1,266,554 features{...}
tags.json.bz2 142M f49891c59a7028a1fb3b798544e6edeb 2,230,814 <tag, weight>+
tags-micro-genres.json.bz2 38M e937e2d0317323c77e3e4f02b4d5be5b 1,638,468 <micro-genre, weight>+

Format Clarifications

artists
name of 5,159,580 artists.
albums
name of 24,237,348 albums, accompanied with the names of their artists.
tracks
name of 50,813,373 tracks, accompanied with their artists.
users
information of 120,322 users, containing country, age, gender, and creation-time. Country is specified according to ISO 3166 Alpha-2 country code; empty if unknown. Age is the age of the user; -1 if unknown. Gender is either "m" (male), "f" (female), or "n" (neutral); empty when no gender information is present. Creation-time indicates the time that the user profile is created.
listening-events
2,014,164,872 LEs, where each data point consists of the ID of the user, the ID of the track and the album, and the timestamp of the event. Artist can be inferred from tracks using column track_id.
listening-counts
has 519,293,333 records, containing the number of times a user has listened to a certain track.
spotify-uris
the URI of 4,624,359 tracks is provided, which can be used for crawling audio features or additional metadata from Spotify. Note that URIs are only specified for the tracks in the LFM-2b which are also included in Spotify's catalog.
lyrics-features
provides 1,266,554 records, containing the lexical features, compression ratios, entropy values, and vector embeddings of the lyrics of the subset of tracks for which we could retrieve lyrics.
tags
for a subset of 2,230,814 tracks, the user-generated tags are provided. Each of these tracks are annotated by users with one or more tags in the form of <tag, weight> pairs (tags = <tag, weight>+). Weights are values between 1 and 100 rounded to the nearest integer. The tag with the most annotations for a given song gets a weight of 100, and all other weights are set to the relative percentages of the most common one. Overall, there are 1,041,819 unique tags in the dataset.
tags-micro-genres
we also provide a subset of tags, containing 1,638,468 records exclusively with the information of micro-genres, fine-grained indications of musical genres or styles.

Code

The code used in the submission to the Information Processing & Management journal can be found on Github.

last edited by ol on 2021-11-03