How Many Copies Will Dwarf Fortress Sell?
A forecasting experiment in statistics and sales, using Steam wishlists
This analysis was performed by a 3rd party economist, systemchalk, out of professional courtesy and curiosity. Kitfox hosts their analysis with permission in a spirit of knowledge-sharing and transparency, knowing it may prove incorrect. Please treat the all parties involved with similar kindness and intellectual respect. It was originally written November 22, 2022.
tl;dr using wishlists and treating Dwarf Fortress like a past Kitfox release, Dwarf Fortress could sell roughly 160,000 units on Steam in the first two months.
Introduction and motivation
Wishlists are a popular measure to forecast sales, but have a mixed track record of success. With the upcoming release of Dwarf Fortress, now is an opportune time to analyze how successful wishlists have been at forecasting Kitfox’s previous releases, and to apply the most promising methods towards forecasting Dwarf Fortress’ sales. The results indicate that forecasts based on wishlists should be thought of as the “best available” source of forecasts, rather than a good source of forecasts, but there are some patterns that show promise.
What is the best method for forecasting?
Seven of the eight games Kitfox Games has published on Steam were considered using their net wishlists prior to release. A common method for forecasting sales from wishlists involves multiplying wishlists by some number (usually a ratio of total sales to total wishlists from past releases). However, there are multiple candidates for what to estimate, what wishlist information to use, and what time periods to consider.
Each of the combinations of time period, sales, and wishlist measure were estimated and then ranked based on their accuracy and the spread of the lower and upper bound of its confidence interval. In addition to total sales, median sales were considered and then total sales were back cast using the previous patterns from the previous releases.
The results from the 117 candidates were mixed. The lowest error was 52.52% of sales, while the worst performing method had an error of 99.67%. The best performing estimate that was credible enough to use in a useful forecast had an error of 56.39% and was the result of back casting from the median.
While the candidates used to forecast Dwarf Fortress are closer to the bottom end of the errors, it is important to recognize that these error rates exceed the common contingency of 30% frequently recommended to novice developers.
Key Takeaways: There is some evidence that better forecasts can be obtained by considering median sales instead of totals. However, the error rates of even the best forecasts reaffirm they are highly uncertain.
How many units will Dwarf Fortress sell?
Dwarf Fortress is close to release (December 6, 2022 at the time of this writing) and so the available wishlists should be reasonably close to the total prior to release. While the missing weeks of wishlists may bias the estimate downward, this is a useful case to see how well the method applies in practice without knowledge of what the actual sales are.
Four methods were considered and are described by the ratio they use. Each method forecasts the first two months of sales for Dwarf Fortress and reports the estimate, lower bound, upper bound, and the difference (spread) between the upper and lower bounds. The methods are ranked by the accuracy of the method for prior Kitfox releases, with the most accurate (57.72%) at the top, and the least (67.42%) at the bottom. Each method is labeled by the ratio used to calculate the estimate with sales as the numerator and wishlists as the denominator.
For example, Total/Mean means total sales are calculated based off the mean of wishlists, while Median/Total means median sales are calculated using total wishlists (and then back cast in the case of medians). Results:
Again, the results are mixed. Two of the results (including the one with the lowest error) estimate around 160,000 units sold and the average of all estimates is 163,979. Unfortunately, even with a tighter confidence interval than would be standard, there is quite a bit of variability in the results, with the spread for each estimate representing millions of dollars of revenue if Dwarf Fortress is priced comparably to other Kitfox releases.
While the spread in table 1 is wide, this is the simpler of the two cases. The danger when using the narrower confidence interval is that rarer but possible outcomes with both upward and downward surprises fall outside this range. When using the standard confidence interval, the range is even larger:
Table 2 covers a wide range of possible scenarios, but is almost as bad as no forecast at all. Even the most precise estimate has a range that spans over three quarters of a million units sold. This reflects the uncertainty present in any forecast based off of wishlists, especially based off of such a small sample.
What is important to note is that the middle values (estimate column) remain the same, and it is the range of values that shifts (with a wider range, we are more confident the true value will fall within it). Estimates like this are often more useful when reported with some kind of variance or confidence interval to express just how precise the estimate is. However, people often want a single figure and so it is common to report the middle value.
Key Takeaways: Based on historical analysis of wishlists, Dwarf Fortress is forecast to sell 162,905 units in its first two months, give or take 40,000 units (the upside is, in fact, much higher, as seen in Table 1). The wide ranges of forecasts reflect the significant uncertainty inherent in forecasting based off wishlists. This variability only considers the ‘best case’ for forecasting based off of wishlists and does not consider other factors such as Dwarf Fortress being a special case due to its size, availability as a game off of Steam, or time available as a wishlist.
Epilogue / Editor’s Note
Editor’s note, from Tanya of Kitfox:
This could also be headlined “So what IS the effect of the Steam algorithm in a ‘snowball’ effect, exactly?”, because I think we’re about to find out.
When I shared the estimation on Twitter, multiple people expressed they felt it was quite low. I agreed, and I’m likely to believe 200k is more likely than 120k, primarily due to the way that Steam appears to promote successfully selling games, and causes them to become increasingly more successful.
When I mentioned my feeling to systemchalk, they replied “ even if it was 256,879 after two months it’d be considered in the range […] this is basically like trying to drive with the rearview mirror.”
Meanwhile, many smart folks believe Steam Followers is a more accurate tool to predict sales (in 2019 the factor prescribed was ~2.5, which came down to 2 at some point). And our Steam followers are presently around 120k, which comes out to something higher, but not too far off the statistical range. Just food for thought.
So there you have it! Let’s see how things go!
And for the exceptionally eager, here’s a more technical breakdown of the methods.
Technical Appendix for Nerds
This section is intended to be optional but to go into some of the specifics as to what I specifically did. The value in this is both to ‘check the work’ as well as communicate some of the reasoning, rather than just have tables appear out of thin air.
What games were used?
The specific games used were: Shattered Planet, Moon Hunters, The Shrouded Isle, Six Ages: Ride Like the Wind, Lucifer Within Us, Boyfriend Dungeon¸ and Pupperazzi. Fit for a King was removed from the data set as since the short wishlist period was considered inappropriate for comparison.
There are limitations with the data set that should be confronted directly. This is a sample of 7 games that span 8 years of a marketplace that has changed considerably. For example, Steam introduced refunds over this period (albeit early in the period, 2015). Since refunds are intended to make players more willing to buy and try games, including games prior to the introduction of refunds potentially drags the estimate down, as players after the change are expected to be more likely to buy. There are other
concerns, but this is a suitable illustration as to why the problems extend beyond simply using a small convenience sample.
While the analysis includes estimates of variability (which is expected to be high given the small sample size), the options really do seem to be deal with a severely limited data set or abandon any hope of analysis altogether. Given that similar ratios (ones that performed rather poorly in the historical tests no less) have been used in the past, it seemed worthwhile to report the findings, but it must be stressed that conclusions should be considered suggestive and a motivation for more research, rather than the
basis for a major decision.
What estimates were tested?
The main exercise was to test a wide range of candidates for wishlist forecasts to see which ones fit Kitfox’s historical performance best, and then to rank them based on accuracy.
The following measures of wishlists were considered:
- Total: Net wishlists up to release. Intuition: total interest for the game prior to release that will be contacted when the game is available.
- Average: Arithmetic mean of daily net wishlists. Intuition: average interest in the game prior to release acts as a proxy for the interest in the game when it is available for purchase.
- Median: Median (50th percentile) of daily net wishlists. Intuition: similar to the intuition for average, but the median is less responsive to extreme values and so is a proxy for ‘core’ interest in the game that is not driven by exceptional events (PAX etc.)
- Deciles: The 10th to 90th percentile of net daily wishlists. Intuition: similar to the median, but allowing for the possibility that the representative value (for forecasting sales) is not necessarily the middle value.
The following measures of sales were considered:
- Total: Gross sales after release. Intuition: This is the most direct measure of what a developer is interested in. Net sales would not be appropriate since returns will be due to factors unrelated to wishlisting.
- Average: Arithmetic mean of gross daily sales. Intuition: a representative value of daily sales may be a better fit for most (all but the total) of the wishlist values.
- Median: Median (50th percentile) of daily net wishlists. Intuition: there is a significant difference between sales on release day and two weeks after release. The justification is similar to the average, but better addresses the variability of daily game sales.
Candidate ratios were produced by producing all the different combinations from these lists, regardless of how credible they were a priori. For example, total sales / total wishlists follows the common intuition about wishlist forecasts, while the median / 2 nd decile would estimate median sales based off some of the worst performing days for wishlists. The median would then be back cast using methods previously developed for Kitfox in another project.
Each of the combinations are considered over different time periods from 1 week to 13 weeks (covering the first quarter of release). As with ratios, some time periods are more intuitive and useful to developers than others, although shorter periods may be of analytical interest and so were not removed.
Each combination of ratio and time period was then calculated using the seven published Kitfox games. Specifically, the ratio was calculated for each individual game and then an estimator was calculated using the harmonic mean. In addition, a measure of likely ranges and an accuracy measure were calculated.
The harmonic mean was used as it is more appropriate when calculating a ratio. The Wikipedia offers examples of the calculation, but for the purposes of this discussion the choice is to overcome a potential problem in other calculations of this ratio which is the use of the arithmetic mean. This is not appropriate in this case, for the same reason that the results don’t seem to work out if you try to calculate your average speed on a run using the arithmetic mean.
How were the estimates evaluated?
The first criteria an estimate was graded on was the mean absolute percent error (MAPE). Because the candidates considered means, medians, and totals, it was not appropriate to directly compare errors, since an error between medians is almost certain to be smaller than an error in total sales.
The MAPE involves calculating the percent difference from the true value in absolute terms (i.e. ignoring the positive or negative) and then calculating the average of those errors. It follows that a lower MAPE
meant that, on average, the forecast of sales was more accurate than one with a higher MAPE. However, given that the errors are roughly 50% to 100%, even the best forecasts are very inaccurate forecasts.
The second criteria, was a narrower range of likely values. The measure of likely ranges outlined in the article corresponds to the confidence interval. My background is in economics and so I started with a 95% confidence interval, which is standard in that context. The 95% confidence interval is also expressed as covering outcomes that are within two standard deviations. It was clear that these confidence intervals were too wide to be of any practical use.
Before discussing the change, it may be helpful to provide an intuition about confidence intervals. One interpretation of the confidence interval is to say that if an experiment were to be repeated 100 times (in this case, 100 parallel dimensions where Dwarf Fortress is released), 95 of the results would fall within the confidence interval. The alternative I chose, the 68% confidence interval, shows the benefit and drawback of this choice: a tighter range of values, but now only 68 of the parallel dimensions would
fall within them.
There are two justifications for loosening the restrictions. First, the dilemma was similar to choosing to work with a small sample from one developer in the first case: report something that is as useful as
nothing, or report something more actionable and clearly indicate where the compromise was made.
Second, the 68% confidence interval (which corresponds to one standard deviation instead of the two of the 95%) appears to be more acceptable in the gaming context, as it appeared in some of the exhibits for Epic Games v. Apple. The choice of confidence interval always involves tradeoffs between how willing we are to throw out useful results and how willing we are to tolerate errors. Games forecasting probably does justify the kind of strictness that, say, education policy evaluation does, and so a relaxing of standards may be justified, although it recommends caution.
If a hypothetical developer were to find the rest of the analysis sound but want to rely more heavily on forecasts in making major decisions, they should consider the stricter confidence interval, and more data would generally be expected to shrink the range.
Notes on the results
Not all 117 of the estimates calculated were serious candidates for forecasting. The two-month results tended to cluster together and appeared to be good candidates for a forecast. Another factor in favour
of two-month results was that it aligned with prior work that identified patterns in daily sales over the first two months of the game. This is why the estimate with the lowest credible error (MAPE) is not one that is used in calculating the forecasts. The lowest credible error belonged to a 12-week forecast, which does suggest that 1 st quarter forecasts may be worth looking into, but was determined to provide enough of a benefit to ignore the better fit with the prior work on 2-month periods.
It should be noted that the backcasting method relies on results from games that are also used in the forecasting work. This creates a difficulty in that the estimates represent an ‘ideal’ case and that the backcasting introduces more error than expected into the sales estimates. Some spot checks to account for this showed the error only increased by an amount comparable to the difference between two good candidate estimators (about 1 or 2%), but it does introduce another caution when considering median estimates.
One general weakness to this method overall is that the best wishlist metrics tend to be total wishlists and mean wishlists. This is a shame since total and mean wishlists become more useful the closer they are to release (although means may be informative so long as the outliers are large enough to keep the omitted days from altering the mean too much). An ideal method would be to use a wishlist measure that could be obtained as early in a game’s development as possible and remains a target for future
Why report results you’re doubtful about?
Throughout the original article and this technical note there have been cautions and compromises. If asked for my personal feelings about forecasting based off of wishlists I would say I am skeptical but would not rule out the possibility. However, my added value does not come from my opinion but my analysis. By presenting the work, the aim is to provide additional information and hopefully provoke further research on the most promising avenues of research.
It is fair to say that the relatively low point estimator for the Dwarf Fortress sales forecast annoyed some people already. What is interesting is that most of the alternatives people proposed fell within the confidence interval in table 1. This may reflect less familiarity with confidence intervals. Measures of variability are not yet common in popular data articles (although it is my hope that this may change in the future). What is more important is to as where these alternatives came from. Intuitions are good and useful checks on our estimates, but if they were a good long term forecasting tool, there wouldn’t be so much effort put in developing alternatives.
The aim has not only been to forecast Dwarf Fortress on its own, but evaluate wishlists as a forecasting method more generally. Lying underneath the result that got attention are 100+ potential forecasts that were not used. The purpose in forecasting Dwarf Fortress was to give a ‘live’ example of the best candidates from a broad examination of wishlist measures as a whole and to present it in a way that could not rely on what was “obvious” only in hindsight. Criticism is not a negative here but rather a desirable outcome (provided it is constructive).
There are different potential sources of error. One may be that it is inappropriate to compare Dwarf Fortress to Kitfox’s previous releases (as seems likely to be the case). This is different than thinking the use of the harmonic mean is mistaken, or that a minimum MAPE of 50% is too large to make any meaningful statement. This last casts doubt on the ability to derive meaningful forecasts from wishlists at all.
The hope is that by presenting both the reasoning and quantifying the uncertainty surrounding these forecasts that responses may go beyond simple objection and instead offer their own reasoning or promote some reflection on the practice of forecasting and the sharing of best practices.