The conundrum about Story Points — pointless or not?
A mathematical & statistical point of view
Introduction
This article researches the nature of Story Points from a mathematical & statistical point of view and how it relates to using Story Points to estimate and forecast work in a knowledge-work-based environment such as the IT software development industry.
The article is broken down into three main parts:
- Part 1 — Analysis of the Story Point concept and the core problem with Story Points
- Part 2 — How Story Points as a concept relates to mathematics & statistics
- Part 3 — Summary and conclusion
The origin of Story Points is beyond the scope of this article. Please compare other articles available over the Internet such as the recording on Youtube called “The Genesis of Story Points” by Codebots, and articles by Ron Jeffries and similar sources.
Links to resources that the author had used during the writing of this article are posted near the end of the article.
PART 1 — Analysis of the Story Point concept and the core problem with Story Points
Story Points
Story Points are arguably a popular tool for relative estimation of work items, thus not providing any definitive value of the work item being estimated, whether in terms of time, money, or anything else that we may want to estimate such as risk or effort.
One of the most popular methods of estimation with the use of Story Points is the Planning Poker, where participants of an estimation activity use values taken from Fibonacci’s sequence — so 0, 1, 2, 3, 5, 8, 13, 21, 34, etc. or sometimes, from the pseudo-Fibonacci’s sequence, where instead of 21, 34, 55, etc. we have 20, 40, 60, etc..
The item under scrutiny is being estimated to be of the abstract and relative and is given a value of one of those sequence numbers which in turn represents…
The core problem with Story Points
The core problem with Story Points (in the author’s opinion) is that if something has a numerical value (so 3, 5, 8, 69, 555, 2137,12345, whatever other value comes to mind) then people in general (author’s assumption) would treat such a value as a cardinal value, meaning that such a value is quantifiable on a 1:1 basis with other values, and can be compared with them without understanding the underlying statistics.
Let’s assume a simple situation, where we have three items:
- a watermelon (a cardinal value = 1 entity)
- an orange (a cardinal value = 1 entity)
- a 100g of Pistachio nuts (also a cardinal value = 1 abstract unit of Pistachio nuts)
How to compare such entities?
By size?
Well, it’s rather simple considering that one assumes that each one of the entities in this given data set represents some set values such as:
- a ripe watermelon weighing approx 2kg
- a ripe orange weighing approx 300g
- a set of fresh Pistachio nuts, just after the harvest, weighing 100g
In such a case the ordering of such values would be — a watermelon is bigger than an orange which is bigger than a loose handful of Pistachio nuts
Yet if we’d like to compare those different entity classes via some other scale then we may come to different conclusions.
Assume that we’d like to compare the amount of water (or just the size) that the entity has in this data set:
- clearly, the watermelon has the highest amount of water in it due to its sheer size
- orange comes second
- Pistachio nuts come third
Yet if we’d like to compare the entities towards the sheer caloric volume including calories, glucose, etc., then the ordering may be a bit different. So compared to the base of calories (kcals) the ordering look like that:
- 100g of Pistachio nuts with shells
- 1 orange — per 100g
- 1 watermelon — per 100g
Though taking into consideration that people rather consume whole units of oranges or watermelons, then it is clear that watermelons are more caloric per 1 full ripe unit.
It all depends on various factors that we want to measure. Each scaling uses different parameters to measure up to a cardinal value in a set of ordinal values.
The comparison needs to be done with an assumption in mind, otherwise, it is just a random comparison of apples to oranges (a category error) with results being just as random, so meaningless in the end.
PART 2 — How Story Points as a concept relates to mathematics & statistics
Before we begin the proper comparison we need to establish a common understanding of the following terms:
- Nominal, cardinal and ordinal values
- Qualitative vs quantitative measures
- Abstract estimates (e.g. Story Points) vs detailed estimates (e.g. time estimates, money estimates)
- Calibration & historical data
- Business perspective
- The mathematical alchemy
- The Flaw of Averages
Nominal, cardinal and ordinal values
Nominal values denote a label of a thing, without giving a quantitative value. Those values do not show a quantity or rank.
Examples — a car. A mug of beer. A computer.
Cardinal numbers are denoting quantity, not an order of size.
Examples — you weigh 100kg. You took 10 days off to travel. You have 10 books on your shelve.
Ordinal numbers are denoting the order of the value. The difference between values is not known.
Examples — you are taller than your colleague Bob. It takes you more time than the average to read a particular book. You can lift more than your colleagues at the gym.
Qualitative vs quantitative measures
Qualitative measures denote an opinion on something rather than a quantitative amount. Qualitative measures may be processed as quantitative measures provided that the ranges of answers are finite (such as scales from -3 to +3, NPS, similar) or that answers provided to qualitative questions are taken from a finite set of answers, even though they may not have a numerical value assigned to them (how do you find this new dark chocolate? Is it good, better than other ones or the best from the whole set?).
Examples — most of the clients think that our pizzas are good. People think that prices for bus tickets are too high. I like my work.
Quantitative measures denote a hard number of something rather than an opinion. Quantitative measures need to be finely tuned to the purpose of the research, otherwise, we may end up being caught in many various biases related to quantitative measures without considering the context. We do not know the relative difference between items in the data set and do not know the difference between items taken into consideration at this point.
Examples — a team did 10 work items per 2 weeks. Bob ate 2 kebabs last night. Sam sold 3 cars last month.
Abstract estimates (e.g. Story Points) vs detailed estimates (e.g. time estimates, money estimates)
Abstract estimates such as Story Points are supposedly a tool to measure the relative “something” parameter of any work item, where the “something” parameter may refer to effort (with various definitions of what effort is), time (with various interpretations of what to measure, when to measure and in what way to measure), complexity (with various definitions of what complexity comprises of), any combinations of those parameters or something else altogether.
Once we assume that Story Points are a container for that “something” parameter measured in “some” way, we can move on to the “Velocity” measurement where things are going to be pretty interesting in one way or the other.
Velocity is a measure of how many (this is going to be important in a moment) things can be done in any given amount of time (as businesses operate in two values — time and money, and unless you get your pay in Story Points you’d rather agree) measured in Story Points projected upon a timeline, based on historical data based on completion of work items with an assigned Story Point value measured with tools such up burn-up/down charts or calculated manually in a spreadsheet.
The term “Velocity” can be a source of confusion, especially if one learned from physics that it is the speed in a direction (a vector).
Possibly you may see the problem with Story Points right now.
There is also an additional layer of complexity to be taken into account.
Given that:
- Story Points may have various parameters ascribed to them (such as complexity, time, risk, etc.)
- Story Points are a value assigned to a work item
- a collection of work items (of unknown relative size and effort required) that are done/ready, produce a sum of ascribed Story Point value
- the sum of ascribed Story Point values is called Velocity
- Velocity is projected over any n amount of time such as — a Sprint (with length ranging from 1–4, or even more weeks), a month, a quarter, half a year, and so on
- predictions are made about how many Story Points in Velocity one or the other work-team can deliver during the next n amount of time
There are so many cognitive biases and category errors in such an approach that it provides material for a different article, as such explanations are beyond the scope of this article.
Detailed Estimates such as time & money require a method to project an upcoming work into time, as time equals money, considering Full Time Equivalents, Man Days, and similar measures of money in relation to time.
One can say — “It’ll take me 5 days to complete this work”.
Such a statement can be considered problematic due to many factors such as:
- what’s the probability that the work will take 5 days, and no more?
- on what basis does one concludes that such a statement is of high probability?
- what is the quality of such statements?
- what other factors are taken or omitted when considering time? Meetings, context switches, emergencies, life situations (e.g. someone getting ill from the common cold)
- etc.
Given that it’s quite hard to precisely estimate work in time measurement given all of the variables that may be in place, the time estimations got a bad reputation of being not quite accurate.
Yet as businesses operate on time & money then it becomes clear why detailed estimates are the preferred option over abstract estimates and why abstract estimates tend to be taken as detailed estimates, even when not accurate.
What is a Story Point (singular)?
A singular Story Point is nothing of itself. It’s an abstract measurement to be used in an ordinal measurement. It does not have any cardinal value assigned to it.
What are Story Points (plural)?
Story Points are ordinal numbers meaning that they show a referential ordering of items in any given data set, without denoting by how much/many of any given factor(time/risk/effort/others) one story/task is bigger/smaller to any other in any given data set.
As with ordinal numbers, you will be able to see that Story #5 is subjectively bigger than Story #7, yet not so much as Story #8, but there is no referential scale that we can measure those stories to each other as there is no definitive ruler or referential scale to compare such Stories to.
So the natural conclusion is that the ordering of those stories would be #7, #5, #8.
And that’s OK unless one accepts such high variance in time, and as we know, time is money in a business context.
Thus we cannot conclude whether Story #5 is twice as big as Story #7 and whether Story #8 is twice as big as Story#5. We also cannot conclude whether Story#8 is bigger by twice, thrice or more amount than Story #7.
This brings us to the concept of calibration.
Calibration & historical data
There are two more things we need to cover. The first is the calibration part.
Considering that Story Points are abstract measures, it would be reasonable to consider calibrating them to the time, based on the historical data of all our combined work.
Calibrating Story Points to a time scale is a rather simple exercise to do, it can be covered in one or two workshops, lasting no more than several hours.
Bring your experts together, facilitate a workshop, discuss the differences in understanding, and come to a unified conclusion.
Such a workshop would result in having a calibrated price list of what a particular value of Story Points means when it comes to the time scale and what it means when it comes to the scope of what is incorporated in any particular values in Fibonacci’s (or pseudo-Fibonacci’s) sequence.
Unless such Stories are calibrated to a definitive ruler (or a referential scale) such as time then such ordinal values have absolutely no meaning on their own.
Thus we come to a sort of a taboo in the whole Agile movement — time.
Some random quotes that you may have heard:
“Developers cannot properly forecast in time.”
“Story Points are bereft of time measurements to free forecasts from whatever biases may be there.”
“Time is such a fluid measure that it cannot be forecasted.”
And similar.
We can assume that such a point of view would work perfectly fine if one would live up in a static environment where time is nothing to worry about.
In the real world, the so-called “real-time”, people are being born, mature, live up, do various things, and ultimately pass away, as it’s a natural cycle of life, at least in the times that we are living in (who knows what the future will bring, maybe some transhumanists ideas will come to life, but it’s just a digression here).
Also, in the real world, people do operate their businesses, invest money and expect returns on their investments. And sometimes they lose money.
Such investments are calculated on two constant factors — money (a cardinal number) and time (also a cardinal number). Money invested over time is expected to bring a Return on Investment.
So how do Story Points fit into that grand picture of real life?
Historical data
Historical data is something that we are sure of. It happened, it cannot be changed.
Let’s assume that one work item (on average) takes 4 days to complete. In case of variance in that time we can research why there is a variance in time.
Possible causes:
- the work item was not properly refined according to the Rightsizing philosophy (so it was not calibrated to the team standard of what is an “average” work item that they can start working with)
- there were many intrusions into the work process — such as “emergency tickets” that need to be done ASAP (for whatever reason) over the normal plan of work for any given amount of time
- maybe the so-called Definition of Ready & Definition of Done parameters were not taken seriously enough
- risk management was absent at that time
- company standards were loose enough to accept that
- etc.
Considering that historical data is ever-growing one can use statistical methods to measure the mean, mode, and median ranges of what is possible.
Methods such as linear regression & Monte Carlo may also be of help, provided that the data is of high enough quality and of the right category to fit into those models.
Business perspective
Businesses operate on two constants — time and money.
Money value can be derived from the time value given that the costs are stable.
Of course, there can be additional known & hidden costs due to various things such as:
- licenses for the software needed to actually start the work
- costs related to the needed infrastructure
- costs related to the “freelance services” when the business decides it’s more optimal to buy skills (so people who work freelance for example) from the market instead of developing them in-house
- pure randomness such as the lead developer leaving or calling in sick leave for a prolonged period of time and similar
- etc.
The mathematical alchemy
Considering all of the above, we need to ask one important question:
How is it viable to “transmute” ordinal values such as Story Points into cardinal values such as Velocity measure?
It is not, it’s a category error when one equates different entities in any given data set into cardinal numbers in the resulting data set without considering the differences between those categories and items included in any given data set in any of the categories taken into consideration.
How come it would be possible to average items of different categories, magically (via mathematical alchemy) into a definitive set of cardinal values just because those composite ordinal values have numbers?
It would seem reasonable that if something has a value then we can use that value in a straightforward way and transmute it (an ordinal value) into a cardinal, quantifiable value.
This brings us to the next point.
The Flaw of Averages
The Flaw of Averages describes a situation where one averages a quantitative amount of things from different categories (e.g comparing 1 apple to 1 car or saying that on average a human and a dog have 3 legs) without taking into consideration variation between items within one category and variation between items between various categories taken into an averaging calculation.
In this example, it is not known how long 1 Sprint takes.
It is also not known how long Stories #1, #2, and #3 took to complete in the given timeline of 1 Sprint.
It is not known the relative size of any work item and the effort that was required to complete it.
It is also not known whether Story #3 took approximately twice, thrice, or more as much as Story #1 as there is no referential scale between items.
What is possible:
- that Story #1 took the most time due to underestimating the task at hand or due to the randomness of life (e.g. the expert that was going to finish it quickly got sick and no one else was able to do the task as fast)
- that Story #2 was similar in size and required effort to Stories #1 and #3
- that Story #3 took the least amount of time due to overestimating a task with many unknowns taken into consideration
- anything is possible
To add more flaws one can project such Velocity value over the next n amounts of time, let’s assume 4 examples:
- given that the Velocity in the 1st Sprint equals to 21 Story Points then it would be rational to assume that the 2nd Sprint would also equal to the Velocity of 21 Story Points, right?
- the randomness of life happened and the Velocity in the next Sprint equals to 7 Story Points
- in the next Sprint the work team managed to complete work equating to 33 Story Points
- the total Velocity counts of 10 Sprints are as follows — 5, 7, 12, 15, 18, 21, 25, 26, 33, 50
The total Velocity over 10 Sprints equals:
5 + 7 + 12 + 15 + 18 + 21 + 25 + 26 + 33 + 50 = 212
Thus the average Velocity per 1 Sprint, based on the 10 Sprints sample equals:
212/10 = 21,2 Story Points
Thus in half of the cases, the Velocity was lower by varying amounts than the average, and in the other half, it was higher by varying amounts.
Given that 21.2 Story Points is an average per 1 Sprint, we can assume that if the Sprint takes 10 days then the average Velocity per 1 day would be:
21.2/10 = 2.12 Story Points
Readers should by now understand what is the core problem with Story Points and averages.
Examples of dubious comparisons used in the context of Story Points relative value
Animal “size”
Which one of those animals is bigger?
Without a scale to compare to it is not possible to compare different animals from different categories (in this example — species).
In order to properly compare animals from different categories, a comparison scale in relation to a parameter is required.
Size in centimeters:
- Saint Bernard is the smallest
- Hippo comes second
- The elephant comes third
- Giraffe is the tallest one
Yet the problem is with averages. What is “a hippo” for example? A baby hippo? A male hippo? An adult hippo?
Adding an additional layer of granularity in form of an additional parameter being “age” we can compare entities in different classes:
- an adult Saint Bernard (on average) is bigger than a baby hippo (on average)
- an adult hippo (on average) is bigger than a baby elephant (on average)
- baby hippos, elephants, and giraffes are taller than a Saint Bernard pup
Also, “bigger” may refer to weight in kilograms, not a height in centimeters, or it may refer to any other definition of what “big” is, used in any given context.
Food & beverages comparison
Other example includes comparing relative Story Points values to food & beverages where e.g. an espresso equates to 1 Story Point, pizza equates to 5 Story Points, a full course meal equates to 8 Story Points, and all you can eat buffet equates to 13 Story Points.
Foods & beverages cannot be treated as one category as there are many factors that can be taken into account:
- time to prepare (varies by skill of the cook)
- time to eat (I guess that one cannot realistically compete with Samantha Ramsdell or similar people, also what’s the point of gobbling up delicious food?)
- the amount of food & beverages being eaten/drank (a heavy-weight bodybuilder may need thrice the amount of the calories that you need)
- caloric value & nutritional value (let’s say that 1 beer may be more caloric than 1 bowl of veggies, yet less nutritional than those veggies)
- quality (varies by ingredients available)
- weight in grams/kilograms
- the number of pieces in a set (one can have 8 chicken wings on a plate or 200g of fries, etc.)
- etc.
Thus treat this as a humoristic mention.
PART 3 — Summary and conclusion
Back to Story Points
Considering all of the above, it is quite reasonable to state that the concept of Story Points (an ordinal value) is quite abstract and does not relate to any cardinal value at all.
Any attempt to equate Story Points (an ordinal value) to a cardinal value such as time or money is futile unless Story Points are calibrated to time or money (or any other factor) beforehand.
Now, a simple question:
- Given that Story Points would be calibrated to the time value, would it transform then into a time estimate and thus, make the whole hassle of using Story Points redundant?
This is a good question.
Story Points as time estimates (in ranges)
Considering that Story Points can be calibrated towards the cardinal value of time (so how long any given item may take to complete, presented in e.g. work-days) it may not be that far-fetched to conduct such an exercise in the reader’s own context.
Of course, one may end up with random results such as:
- the “standard work item” in team A equals 3 Points, while the “standard work item” in team B equals 5 points
This may be OK if you’d like to assume statistical biases that may come into play in the resulting comparison such as that team A 3 points equating to 1.6x points on the team B scale (or a “pricing list” of sorts).
Such calculations and comparisons are possible yet they entail an amount of manual spreadsheet number crunching that you may or may not be willing to do.
Yet take a look at the picture below.
Do you see the potential problem that may arise in mixing A) various measurement scales and B) various orders of magnitude between two or more contexts in which such orders of magnitude are not calibrated to time, risk, complexity, etc., and C) mixing options A & B?
There is also a second option available, yet the viability of such an option may be a bit questionable.
One may consider calibrating ALL of the various teamwork into a singular calibrated scale which may or may not result in a definitive comparable scale that would allow one to compare the workload vs delivery of each and every team that has been taken into consideration.
The author has no data to back up such a claim, yet hypothetically, using dedicated calculations one would, once again hypothetically, be able to compare each team’s potential using a 1:1 basis given that one knows constants in the given context.
Another possibility to use Story Points is to treat them as buckets (or T-shirt sizes), not numbers
One can consider using Fibonacci’s sequence (or pseudo-sequence) as a way to categorize items into abstract sizes of small, medium, large, or anything in between, including expanding that scale to extra-small, extra-large, or extra-extra-large (XXL), and so on, in a way mimicking the sizing of clothing or cup/mug/bucket sizes.
The problem is that unless abstract entities such as buckets or t-shirt sizes are calibrated to cardinal values then they do not have any real value and are similar to Story Points, without the cardinal value assigned in Fibonacci’s sequence (or the pseudo-sequence), instead using abstract S, M, L, XL, etc. categories.
The subjective conclusion
Unless Story Points:
- are calibrated to time (where time = money)
- are calibrated to the relative amount of work over any given task of any given order of magnitude in the pseudo-Fibonacci sequence 1 2 3 5 8 13 20 20+ (so that tasks in any given subset of tasks are statistically similar or with a standard deviation prone to the law of big numbers)
- have a shared and unified understanding of what is actually being estimated with those points (effort, time, complexity, dependencies, risks, etc., all of which are pretty abstract by themselves, unless described in detail)
If all of those conditions are met then you can use Story Points as time estimates with some degree of precision.
If all of those conditions are met with all of the work teams using Story Points then arguably one can compare Story Points between teams.
If any of those IF conditions are not met then Story Points are pointless (pun intended).
While Story Points do provide some value as a tool to start a discussion on any given work item, they cannot be used as an input to any mathematical operations regarding statistics.
The rationale
Calibration of story points to time and to the relative (yet still within some x-y ranges) amount of work that any given task/story/ticket/etc. may entail is required in order to have any kind of control over work & delivery planning, work execution, monitoring of progress, risk management, and similar practices related to running a business.
Otherwise — if any task in any given order of magnitude in the pseudo-Fibonacci sequence 1 2 3 5 8 13 20 40+ may take 1h to 160 h then any attempt to plan any kind of work over any kind of roadmap or any kind of other schedules is bound to randomness and by that is imprecise.
Flip a coin instead, or roll dice, it takes less energy than scheduling and has similar chances of hitting the estimates, which are random.
Also, unless everyone doing estimations have the same idea of what they are estimating then the resulting estimates are random and thus — meaningless when it comes to planning work and the author would argue that anyone using such vague numbers and mixing cardinal and ordinal numbers are just using magical wishful thinking where everything is averaged and as a result — it does not really matter.
Or in other words — we may call it a “mathematical alchemy” as described before.
Of course, there are other things that we may include in the calculation such as mixing the development work counted in story points with maintenance work not estimated in story points, but that’s of little importance to this article.
If the foundations of any given estimation approach are botched, then any kind of result of using such estimation approach is therefore also botched by design, resulting in meaningless data.
Fortunately, there are methods more grounded in mathematics & statistics that can be of value to estimate work and have a point, yet that’s another story for another time.
Thank you for reading this article. I am curious about your point of view. Do you agree, disagree or have a totally different view on the topic?
Please upvote, share and subscribe if you’d like to see more of such articles.
The author would like to thank all of the following reviewers listed in alphabetical order for their valuable advice:
Thank you so much. Your contribution was highly valuable.
Citation:
Please cite this article in press as: Jarosz Maciej; The conundrum about Story Points — are they pointless or not? (2022); https://medium.com/p/b5d715180c96/
Sources:
https://herdingcats.typepad.com/my_weblog/2015/12/the-story-point-problem.html/
https://herdingcats.typepad.com/my_weblog/2015/08/the-flaw-of-averages.html
https://herdingcats.typepad.com/my_weblog/2015/03/flaw-of-averages.html
https://herdingcats.typepad.com/my_weblog/2017/04/some-dark-sides-to-agile.html
https://herdingcats.typepad.com/my_weblog/2017/04/velocity-versus-speed.html
https://medium.com/geekculture/story-points-are-dumb-just-use-hours-eccad9bb5c6f
https://www.bennorthrop.com/Essays/2012/velocity-and-story-points-they-dont-add-up.php
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
https://en.wikipedia.org/wiki/Box%E2%80%93Jenkins_method
https://www.bennorthrop.com/Essays/2012/velocity-and-story-points-they-dont-add-up.php
https://en.wikipedia.org/wiki/Ordinal_number
http://swreflections.blogspot.com/2012/06/agile-estimating-story-points-and-decay.html
https://www.mountaingoatsoftware.com/blog/why-i-dont-use-story-points-for-sprint-planning
https://ronjeffries.com/articles/019-01ff/story-points/Index.html
https://medium.com/in-the-pocket-insights/stop-estimating-start-forecasting-b275f9f81c45
https://www.youtube.com/watch?v=UEDvKAqOuZU
https://tasks.illustrativemathematics.org/content-standards/HSN/Q/A/tasks/1696
http://www.twitlonger.com/show/n_1sn8h2t
https://getnave.com/blog/story-point-estimation/
https://getnave.com/blog/thin-tailed-vs-fat-tailed-distribution/
https://observablehq.com/@troymagennis/story-point-velocity-or-throughput-forecasting-does-it-mat