A Mario ML Diffusion Model

Created Tuesday, May 23, 2023

link Motivation

It is clear generative AI has improved immensely in the last few years, even months. Mario Maker 2 poses an interesting problem for current technology, however. It:

  • Must produce a playable level
  • Must stay within entity limits

And most important of all:

  • Must produce discrete values of ID

This last point poses an interesting problem. Current diffusion models, the most promising generative image AI model, start with noise and add small increments of color to each pixel to ease the image into something usable. Some of the benefits of this approach include:

  • Configurable starting noise, so images are effectively seedable
  • Can stop after a certain number of iterations for performance

Diffusion models depend on the ability to tweak values within the matrix. In Mario Maker 2 object IDs are integers, there is no gradual tweaking towards an object ID. Can I solve this problem while still conserving the properties of seedability and configurable iterations?

link Model

The model is a neural network with an input of i neurons, where i is the number of possible object IDs, times j, where j is the number of adjacent objects to include in the training data. The output is a probability distribution of i object ID neurons, where the one with the best probability that has an ID within the allowable range is chosen. The number of adjacent objects to train on is configurable and is called the context size in this model. The visual below demonstrates how a context size of 3 is trained, where 3 represents the 3x3 square surrounding the object ID the model is trying to output.

Model context size

link Inference Idea

The theory behind inference in this model comes from a background in cellular automata, where decisions about the next state are made by nearby objects. This approach was applied to Mario Maker 2 by training on a configurable context size of adjacent objects and learning what the expected middle object should be. The level starts as random noise of all allowable objects and this inference is performed for every object in the level, where a context that stretches off the side of the level is considered to be air. This is called one iteration. The same process is performed again on the objects generated by the last iteration until n iterations has been reached. This approach is a blend of cellular automata and diffusion.

link Results

This approach inevitably began to prioritize not the objects that actually fit a specific circumstance but instead the objects that just appeared the most. Multiple iterations would not improve the output but just cause a small number of the allowed objects to crowd out all other potential objects. An approach I experimented with was limiting the number of samples per object so that the same number of samples per object was included but this did not fix the overcrowding, it just led to rarer objects becoming dominant instead.

link Examples

link 7x7, no limit, no air, 5000 levels, width 100, height 30, iterations 30, air probability 0.85, allowed objects [4, 5, 8]

Example generation 1

link 7x7, no limit, no air, 5000 levels, width 100, height 30, iterations 30, air probability 0.85, allowed objects [4, 5, 6, 8]

Example generation 2

link Questions?

Use the Contact button to the side or join my Discord.