Finally!

Finally, after over 7 months, the Mario Maker 2 data I collected over a full month in February 2022 is released. As I mentioned in the API blog post I had collected this data mainly in anticipation of doing data analysis and archiving data that might otherwise be lost. Under my care, and as a 2.3 terabyte SQLite file, neither of those things could effectively happen. The dataset was:

Too big to iterate over with code, a full sweep of every level took a day on file IO only
Too big to effectively share with people, making it useless as an archive

Previous attempts

I've made two repositories, one solely for graphs (it contained the original list of 2019 uncleared levels played by YouTubers) and the other a replay of every ninji (a project I'm still working on). Both still hit the massive issue of file IO being slow on such a big database. Progress was slow and not rewarding. No one could enjoy the data, especially when me, the creator, couldn't enjoy it!

Too big?

The eventual inevitable conclusion was that the database was too big! Well none of the fields are truly useless right? Right? Well yeah actually, every column was useful in its own way. One column in particular was very important but as a user of over 90% of the database filesize and as data that could possibly be regenerated with level viewers the level thumbnails had to go. It was a hard loss but the appeal of mass usage is higher than having a bunch of pretty pictures.

Machine learning?

So now that the database is manageable in size how do I distribute it? And to whom? Two solutions stuck out to me:

Create a torrent and spread it around, hoping the parts being seeded over time would allow for mass adoption without increasing my internet bill
Upload the dataset to Huggingface and appeal to data scientists, who would know best about how to make the data useful

Both were fine but the latter seemed especially interesting to me. The dataset being officially on Huggingface lends some credence to it and (hopefully) establishes a precedent of having video game data available to data scientists. I'll mention why I think data scientists could benefit from more video game data below but first how does this fragile SQLite database become a clean easily accessible dataset on Huggingface?

Getting it on there

Data science is dominated by two file formats:

CSV: Massively inefficient, doesn't natively support binary data
Apache Parquet: One of the most efficient file formats on the planet, supports binary data

Okay so the choice seems obvious... what's the hold up? Well SQLite and Parquet are very different. SQLite is a traditional database with tables and other constructs, Parquet is Column based and only stores one table per file without the option to append to it. The two file formats are in completely different universes and finding tools to convert from one to another was difficult.

I eventually settled on this project with 1 GitHub star... it filled an invaluable niche, am I the only one who's ever seen this??? In any case, I let the program run for a couple of days and it spat out 15 pristine Parquet files that were ready to be uploaded to Huggingface for the world to finally enjoy and use the biggest collection of video game data designed for data analysis (or at least the biggest publicly accessible).

The datasets

We've got 15 datasets here, 15 datasets that were originally created 7 months ago and are now accessible to all of you!

TheGreatRambler/mm2_level: The most useful dataset, contains metadata and level data which can both be used for either analysis or even generation!
TheGreatRambler/mm2_user: Contains every player at the time of scraping, player level preferences and their online performance could be used to build a picture of what certain kinds of players prefer.
TheGreatRambler/mm2_ninji: Contains every ninji replay and its metadata from all 21 ninji courses, seeing as Nintendo probably has no plans to ever host a ninji event again this list is actually somewhat definitive. Player movements could be used to train a neural network to play optimally... combined with other input of course.
TheGreatRambler/mm2_world: Contains every super world, including their thumbnail and the as of yet unknown file format that encodes the world map structure. It's a surprisingly creative feature that could be useful for determining how players tend to group levels.
TheGreatRambler/mm2_level_comments: Every comment on every level, text and custom image comments are both fantastic contenders for transformers and image synthesis respectively.
TheGreatRambler/mm2_level_played: Simply what players have played what levels. The largest dataset in the entire collection by rows: over 1 billion. None of the others come close. It's relatively simple but it could probably be analyzed with graphs to make complex connections between every player in the game.
TheGreatRambler/mm2_level_deaths: Another simple dataset, just the locations of deaths in levels, but also very large, 300+ million rows. Can be used to determine what makes a level truly hard and maybe even train a neural network to avoid those spots when playing by making connections between dangerous level elements and safer ones.
TheGreatRambler/mm2_user_liked: Liked levels displayed on a user's profile ingame.
TheGreatRambler/mm2_user_posted: Uploaded levels displayed on a user's profile ingame, should be equal to the number of levels in total but is slightly smaller.
TheGreatRambler/mm2_user_first_cleared: First clears displayed on a user's profile ingame.
TheGreatRambler/mm2_user_played: Played levels displayed on a user's profile ingame.
TheGreatRambler/mm2_user_world_record: World records displayed on a user's profile ingame.
TheGreatRambler/mm2_user_badges: All user badges at the time of archival.
TheGreatRambler/mm2_ninji_level: The 21 ninji levels, for use in the ninjis dataset.
TheGreatRambler/mm2_world_levels: Every super world level.

Schema

This is how the various datasets are related to one another:

Datasets schema

Uses

The way I see it we've never had something so comprehensive from a video game with user created content before, Minecraft skins could indeed be considered more comprehensive but Mario Maker 2 allows for a lot more creativity than a skin image. Some uses I can see are:

Generating levels with a specific difficulty and an emphasis on certain enemies/blocks
Generating custom comment image art by training on data from mostly inexperienced artists
Sentiment analysis on comments to see what tends to anger players
Training a neural network to play Mario based on the limited ninji data

And some even more interesting concepts:

What do casual video game players prefer?
How do relationships between players manifest in video games with user created content?
What are the most engaging aspects of game design, that can keep players returning for much longer than others?
What events influence the popularity of a video game and can they be controlled?

I invite anyone to answer these questions and your own if you want to, I hope you enjoy!

Legality

Firstly, the EULA for Mario Maker 2 states:

you grant Nintendo and its affiliates and subsidiaries a worldwide,
royalty-free, irrevocable, perpetual, non-exclusive, and
sublicenseable license to use, reproduce, modify, adapt, publish,
translate, distribute, perform, and display all or any portion of your
User Generated Content, and to incorporate your User-Generated
Content in other works, in any form, media or technology now
known or later developed, including for promotional or marketing
purposes, without any payment to you.

While Nintendo does have ownership over this data there is precedent for publishing it in this way. In addition, no game data from Mario Maker 2 itself, like assets in the game, are present in this dataset. There is also precedent for file formats not being copyrightable. Distributing reverse engineered files, as level data is, should not be a problem.

Citation

The Mario Maker 2 Dataset Collection is available to researchers and data scientists under the Attribution-NonCommercial-ShareAlike 4.0 International license. In case of publication and/or public use, as well as any dataset derived from it, one should acknowledge its creators by citing this blogpost. A bibtext is as follows:

@inproceedings{mariomaker2:dataset,
    author       = {TheGreatRambler},
    title        = {Mario Maker 2 Datasets},
    days         = {24},
    month        = {sep},
    year         = {2022},
    url          = {https://tgrcode.com/posts/mario_maker_2_datasets}
}

Questions?

Visit my discord to ask questions about these datasets in particular, any of my other projects or for any reason at all!

Credits

Kinnay, for providing NintendoClients which made this all possible
Wizulus, for providing expertise in hosting large APIs and safely sharing big data
Cakiki, for providing expertise and help in putting this data on Huggingface
Tatiaus, for providing initial help in designing the API
Gible_V
Funnier04
Warspyking
AppleSinger
TO4Dw