Created Sunday, September 25, 2022
Finally, after over 7 months, the Mario Maker 2 data I collected over a full month in February 2022 is released. As I mentioned in the API blog post I had collected this data mainly in anticipation of doing data analysis and archiving data that might otherwise be lost. Under my care, and as a 2.3 terabyte SQLite file, neither of those things could effectively happen. The dataset was:
I’ve made two repositories, one solely for graphs (it contained the original list of 2019 uncleared levels played by YouTubers) and the other a replay of every ninji (a project I’m still working on). Both still hit the massive issue of file IO being slow on such a big database. Progress was slow and not rewarding. No one could enjoy the data, especially when me, the creator, couldn’t enjoy it!
The eventual inevitable conclusion was that the database was too big! Well none of the fields are truly useless right? Right? Well yeah actually, every column was useful in its own way. One column in particular was very important but as a user of over 90% of the database filesize and as data that could possibly be regenerated with level viewers the level thumbnails had to go. It was a hard loss but the appeal of mass usage is higher than having a bunch of pretty pictures.
So now that the database is manageable in size how do I distribute it? And to whom? Two solutions stuck out to me:
Both were fine but the latter seemed especially interesting to me. The dataset being officially on Huggingface lends some credence to it and (hopefully) establishes a precedent of having video game data available to data scientists. I’ll mention why I think data scientists could benefit from more video game data below but first how does this fragile SQLite database become a clean easily accessible dataset on Huggingface?
Data science is dominated by two file formats:
Okay so the choice seems obvious… what’s the hold up? Well SQLite and Parquet are very different. SQLite is a traditional database with tables and other constructs, Parquet is Column based and only stores one table per file without the option to append to it. The two file formats are in completely different universes and finding tools to convert from one to another was difficult.
I eventually settled on this project with 1 GitHub star… it filled an invaluable niche, am I the only one who’s ever seen this??? In any case, I let the program run for a couple of days and it spat out 15 pristine Parquet files that were ready to be uploaded to Huggingface for the world to finally enjoy and use the biggest collection of video game data designed for data analysis (or at least the biggest publicly accessible).
We’ve got 15 datasets here, 15 datasets that were originally created 7 months ago and are now accessible to all of you!
This is how the various datasets are related to one another:
Datasets schema
The way I see it we’ve never had something so comprehensive from a video game with user created content before, Minecraft skins could indeed be considered more comprehensive but Mario Maker 2 allows for a lot more creativity than a skin image. Some uses I can see are:
And some even more interesting concepts:
I invite anyone to answer these questions and your own if you want to, I hope you enjoy!
Firstly, the EULA for Mario Maker 2 states:
you grant Nintendo and its affiliates and subsidiaries a worldwide, royalty-free, irrevocable, perpetual, non-exclusive, and sublicenseable license to use, reproduce, modify, adapt, publish, translate, distribute, perform, and display all or any portion of your User Generated Content, and to incorporate your User-Generated Content in other works, in any form, media or technology now known or later developed, including for promotional or marketing purposes, without any payment to you.
While Nintendo does have ownership over this data there is precedent for publishing it in this way. In addition, no game data from Mario Maker 2 itself, like assets in the game, are present in this dataset. There is also precedent for file formats not being copyrightable. Distributing reverse engineered files, as level data is, should not be a problem.
The Mario Maker 2 Dataset Collection is available to researchers and data scientists under the Attribution-NonCommercial-ShareAlike 4.0 International license. In case of publication and/or public use, as well as any dataset derived from it, one should acknowledge its creators by citing this blogpost. A bibtext is as follows:
@inproceedings{mariomaker2:dataset, author = {TheGreatRambler}, title = {Mario Maker 2 Datasets}, days = {24}, month = {sep}, year = {2022}, url = {https://tgrcode.com/posts/mario_maker_2_datasets} }
Visit my discord to ask questions about these datasets in particular, any of my other projects or for any reason at all!