Predicting Roblox Popularity: Why Dislikes Matter More Than Likes

Predicting Roblox popularity means using data like player counts, likes, dislikes, and ratings to figure out how many visits a game will get. Researchers built a machine learning model that correctly explains about 78% of why some Roblox games blow up while others don't.

The Quick Answer: What Actually Makes a Roblox Game Popular?

Here's the thing that surprised everyone: dislikes matter more than likes.

The Random Forest model found these importance scores:

Feature	Importance Score
Dislikes	0.48
Likes	0.25
Active Players	0.22
Favorites	0.03
Rating	0.02

So dislikes have almost twice the predictive power of likes. Why? Games that get lots of dislikes also tend to get lots of attention. They're controversial. They're talked about. And controversy drives traffic, even when it's negative.

The Dataset: 1,000 Top Roblox Games

The study pulled data from the top 1,000 games on Roblox. Each game had 8 columns of data:

Rank (position in the top 1000)
Name (game title)
Active (current players online)
Visits (total visits, this is what we're predicting)
Favourites (how many people saved it)
Likes
Dislikes
Rating (calculated from likes vs dislikes)

The top game, Blox Fruits, had 483,372 active players, 41.3 billion visits, 13.6 million favourites, and 8.5 million likes at the time of data collection.

How Random Forest Works (The Math)

Random Forest is basically a bunch of decision trees that vote on the answer. You train 150 different trees, each one looks at the data slightly differently, and then you average their predictions.

The Core Formula

The expected prediction for a new input X looks like this:

Where:

A_n(X,Θ) is the terminal region (leaf node) containing X under randomization parameter Θ
N_n(X,Θ) is the number of training samples falling within that region
E_Θ is the expectation taken over all possible randomizations of tree structures

In plain English: the model finds all training games that landed in the same "bucket" as your new game, then averages their visit counts.

Why It Actually Works

The theoretical guarantee is that as your dataset grows infinitely large, the Random Forest estimator converges to the true regression function:

This means the predictions get closer and closer to reality as you feed more data into the model.

Building Each Tree

For each tree t in the forest:

Draw a bootstrap sample D_t from the original dataset
Construct a decision tree using randomized feature selection
For each terminal node, calculate the local prediction:

Aggregate all tree predictions to get the ensemble output:

The study used 150 trees with a maximum depth of 15, which balances accuracy against overfitting.

The Correlations That Matter

Before building the model, the researchers checked which features actually correlate with visits.

Strong positive correlations:

Likes and Visits: 0.77
Active Players and Visits: 0.75
Dislikes and Visits: 0.81

Interesting negative correlation:

Likes and Dislikes: -0.87

That last one makes sense. Games people love don't usually get hated. But here's the weird part: both likes AND dislikes correlate positively with visits. More feedback of any kind means more traffic.

Weak correlations:

Rating and Visits: 0.04
Rating and Likes: 0.18

Rating barely matters for predicting visits. A game with an 80% rating and a game with a 95% rating can have wildly different visit counts.

Model Performance Numbers

The model was trained on 800 games (80%) and tested on 200 games (20%).

R² Score: 0.7814

This means the model explains 78.14% of the variance in visit counts. That's pretty solid for predicting something as chaotic as game popularity.

Out of Bag Score: 0.5676

Random Forest has a built in validation trick. Each tree only sees about 63% of the data due to bootstrap sampling, so you can test each tree on the data it didn't see. The OOB score of 0.57 shows the model learned real patterns, not just memorized the training data.

Error Metrics:

Mean Squared Error (MSE): 544,899,627,096,328,000
Root Mean Squared Error (RMSE): 738,173,169
Mean Absolute Error (MAE): 208,532,159

These numbers look massive because visits range from tens of thousands to billions. The MAE tells you that on average, predictions are off by about 208.5 million visits. For games with billions of visits, that's acceptable. For smaller games, the model struggles more.

Why Dislikes Beat Everything Else

This is the controversial finding. Dislikes had an importance score of 0.48, nearly double the next feature.

Think about it this way: games that attract enough attention to get lots of dislikes are games people actually played. They're games people have opinions about. And in the Roblox algorithm, engagement is engagement.

A game with 100,000 likes and 50,000 dislikes has way more momentum than a game with 10,000 likes and 200 dislikes. The first game is being talked about. The second is being ignored.

This doesn't mean you should try to make a bad game. But it does mean that playing it safe and making something nobody has strong feelings about is worse for visibility than making something polarizing.

The Distribution Problem

Most Roblox games are invisible. The data shows a heavily right skewed distribution:

Mean active players: 5,736
Standard deviation: 26,584
Range: from a few hundred to 483,372

A few games dominate player engagement completely. The histogram shows most games clustered near zero, with a long tail stretching toward the outliers.

Same pattern for visits:

Mean visits: 636 million
Range: 28,198 to 55.6 billion
Standard deviation: 2.9 billion

The top games have billions of visits. Most games have millions. The gap between "successful" and "mega successful" is enormous.

Limitations You Should Know About

The model struggles with extremes. Games with very high or very low visit counts get predicted less accurately. The scatter plot of actual vs predicted visits shows more spread at the tails.

The data is only publicly available metrics. Session duration, in game purchases, player retention rates, and other internal metrics weren't included. Roblox doesn't share those publicly, and they probably matter a lot.

Data sparsity for smaller games. The dataset only includes the top 1,000 games, so we don't know how well this model would work for predicting success of brand new games or games ranked 5,000th.

Correlation isn't causation. Dislikes correlate with visits, but getting more dislikes won't magically boost your visits. Popular games get both likes and dislikes because they're popular, not the other way around.

What Game Developers Should Actually Do

Based on the feature importance analysis:

Focus on generating strong reactions. Neutral games get ignored. Games that make people feel something, even if some of that feeling is negative, get talked about.
Active player count matters. It's the third most important feature. Games that can maintain concurrent players signal to the algorithm (and to potential new players) that something worth playing is happening.
Don't obsess over your rating percentage. Rating had an importance score of 0.02. A 92% rated game doesn't automatically beat an 85% rated game in visits.
Favourites don't drive traffic much. Importance score of 0.03. People favourite things and forget about them.

The Technical Setup

For anyone wanting to replicate this:

Python libraries used:

pandas for data loading
scikit learn for the Random Forest Regressor
matplotlib for visualization
joblib for model serialization

Model configuration:

n_estimators = 150 (number of trees)
max_depth = 15 (how deep each tree can grow)
n_jobs = -1 (use all CPU cores)
random_state = fixed seed for reproducibility

Data preprocessing steps:

Remove non numeric characters (commas, hashtags) from numeric columns
Convert strings to integers/floats
Drop rows with missing values using dropna()
Exclude Name and Rank columns from features (they don't predict anything useful)

Future Research Directions

The paper suggests several ways to build on this work:

Add player behavior data like average session length and return rates
Test other algorithms like Gradient Boosting or Neural Networks
Include time series data to see how games grow over time instead of just snapshots
Analyze genre effects to see if certain game types have different success patterns

The Bottom Line

You can predict Roblox game popularity with decent accuracy using publicly available metrics. The Random Forest model explains about 78% of why games succeed. And the biggest surprise is that negative feedback (dislikes) is the strongest predictor, probably because it indicates engagement rather than quality.

The math works. The model is reproducible. And the insight about dislikes challenges the assumption that you need to make everyone happy to succeed on Roblox.