The textbooks always show beautiful, perfect data. It’s always completely there, it never has to be scrubbed: Then you get to your first job. In the real world, bad data is one problem, but you also have to deal with the fact that sometimes you just do not have the data — it’s missing, and you have to live with it. Statisticians have some tricks for filling in the missing data, with some cautionary notes here. The general name is “imputation” in statistics.
Mother Celko’s Ice Cream and Good Times Emporium
This sort of article is often better done with made-up example that is intuitively obvious to the reader. I am going to pick a chain of ice cream parlors, which has many stores in very different neighborhoods, a central warehouse that distributes goods (ice cream flavors, various sizes of cups and cones, spoons and the all-important napkins) to the stores in the chain.
Each store is expected to turn out a weekly report of its sales. They get their ice cream by the liter, and sell it by cups and cones (by the count). In short, this is a very vanilla sample business (sorry, the pun was too easy). What do we do when a store fails to turn out its weekly report? What if the weekly report is not complete?
Quick and Dirty
The real quick and dirty way to handle missing data is to simply ignore it. Much as I hate to admit it, sometimes this works. If you have a large sample, it is possible that the missing data will not have the effect of destroying the summary data. But there’s an underlying assumption here that the missing data is in some sense negligible. In our case, we can look at our inventory and get estimates of future purchases and sales without the weekly details. What we have lost by not having the real sales information is trends and particulars. If, say, store #5 loses all the weekly paperwork, we at least have the information about how many cones and cups and liters of ice cream (by flavors) we sent to this store and what was in the local inventory. With a little math, the known data will tell us the weekly totals, but it does not tell us what days of the week were strongest or weakest in sales, nor what days of the week favored a particular flavor. We might guess that sales of vanilla are very uniform on every weekday; but was the sale of 20 liters of “Chunky Linoleum” the result of one customer coming in and making a huge purchase? Or did some celebrity endorse this horrible flavor with the effect that we now have an established demand with lots of customers. That is, should we change our buying patterns?
Other quick and dirty tricks would be to use a minimum, so we will not overestimate; or use a maximum, so we will not underestimate. We could also use an average, on the assumption that the average is very representative for each day. But perhaps the median would be a better measure of central tendency if we have outliers. In short, even with a “quick and dirty” missing value, there’s a lot of adjustments to be made.
Some missing values will be local to particular data sources. Assume we have an ice cream parlor in Koreatown, we know that they are the only store that sells “Kimchi Ripple”, and we need to fill in its missing values differently from a regular ice cream parlor that is more concerned with Chocolate, Vanilla, Rocky Road and Butter Pecan.
In SQL, we have the NULL to hold place for missing data. But it has its own computational rules. They may or may not make sense for your particular data model. I am going to assume most of the readers are SQL programmers, and are familiar with these problems. Essentially, the original aggregate functions in SQL were set up to use only known values and drop the NULLs.
Most new SQL programmers know that the ANSI/ISO standards require that an aggregate function will drop NULLs, then perform the computation. But it also has to produce a warning that NULLs have been dropped. The reason we put this in the SQL language was so that code with aggregate functions will know whether or not we are working with complete data. The usual example I give what I’m doing classes is a sales team report, where one of the salesmen has failed to turn in his paperwork, so his teams scores are incorrect. But because other members of his team did their jobs, your query will produce an aggregated team score anyway.
But using DEFAULT has a different set of problems. The default value we pick has to be absolutely safe, and therefore is probably not representative. And it has to be a constant! Replacing the missing data with a constant (usually zero or one), always feels fundamentally wrong. In particular, using zero came from the first versions of FORTRAN. When the input data on cards and un-punched (blank), column was read as zeros. This led to some interesting things. For those of you who work with medical records you use the ICD codes (International Classification of Disease). In the early versions of this standard, all zeroes was cholera. So when the diagnosis was missing, we left the punch card columns blank and ran the data through Fortran-based statistical programs. The result was about a third of all the patients in our hospital studies showed up as cholera instead of “undiagnosed” instead. (The current ICD-10 code is “A00.0”, if you happen to have this disease).
Getting back to the topic, we already know that the default week for a weekly report is a unique, particular week. We just do not know how many cones were sold for a particular week, but we are absolutely certain that the week itself existed. Once we have to use this temporal dimension, any calculations have to deal with the missing data. We cannot simply skip to the aggregate level and ignore the details which created the aggregation.
Does the Data Actually Exist?
While this sounds obvious, there is a tendency to look at what the prior situation was in terms of the current encoding schemes and measurements. One of my favorite examples is projecting back in time to get quantities of products that did not yet exist. So if we have a seasonal pumpkin ice cream, but we know it does not go on sale until October, we know we have exactly zero sales of it in September. However, a shipment of the new flavor might have been made in September in order to get set up for a Halloween promotional. Because pumpkin ice cream exists in the warehouse, you can be fooled into putting it into sales.
If you switched from US customary units to metric, and data collected before the changeover, then the missing data should be in US customary units. We were trying to get a consistent view of our data over time. But first we need to have that old data. Another version of this, is providing missing data in packaging that did not yet exist or cease to exist. The principle is fairly simple but can be tricky to implement; an estimate must be expressed as it would have existed at the time of its creation.
The next question is “at what point do we not have enough data to do anything?” If you don’t have enough data, then you are simply pulling things out of thin air. The rule of thumb, and that’s exactly what it is, says that if you have over 5% missing data, then you are not able to make good approximations. This rule of thumb can be overridden by the particulars of your situation, but for legal reasons, it might be a good idea to put a note in your reports and telling the auditors who come through about possible errors.
Waldo Tobler was an American geographer who pioneered analytical cartography and computational geography. His first law of geography is that “everything is related everything else, but near things are more related than distant things” applies to other data as well. What changes the definition of “space” and “neighborhood” to more general concepts involving time and measurements.
In our case, the ice cream parlors, we had a temporal dimension. The easiest estimate of the sales of vanilla ice cream cones for the missing week is probably the total sales from the prior week for the same product. But a better way is to look for trends in the data and extrapolate from the prior week. If the report week is in the summer, we would expect consumption of ice cream to increase slightly week by week. Likewise, if the week is in the winter, then we expect it to smooth out to whatever winter levels are.
There is a story about a German statistics professor in Nazi Germany who weighed his bread rations. He found that the average weight was less than it was supposed to be. The expectation was the weight should have a normal, or gaussian, distribution, because the factors of making bread are pretty much independent.
He inferred that this meant his baker was using a bread mold that was shorting the customers. When confronted, the baker promised to fix the situation. The professor measured his bread ration again and found that most – not all — of his rations were at or above the required weight; a skewed distribution! He then went back and found that it was a “ragged right half” of a normal distribution, and the professor determined that the actual average weight of the bread had not changed. The baker was simply giving the professor larger pieces. The professor then turned the baker over to the authorities for fraud.
Finding the missing values given a normal (Gaussian) distribution can be done by building a histogram. Once you got that, you do a simple curve fit under the assumption that the high point of the histogram is the mean of the actual population.
The normal distribution isn’t the only option in real data. Probably the next most common one in the real world will be a Zipfian or Pareto distribution. What this says is that the most common things are very, very common in the rare things are, well, very rare. We can probably be sure that we will sell more vanilla or chocolate cones than we will sell “Chunky Linoleum” cones.
Finally, we have a Poisson distribution. This occurs when there is a queue that backs up a little bit. In our ice cream parlor example, the backup would be liters of ice cream in the freezer, waiting to be served. The distribution takes its shape from the average service time of elements in the queue. If the queue moves fast, the distribution looks pretty flat, but of the real world. The delay in service time means it tends to bunch up toward the low end and then gradually flatten out as elements are processed.
This gets more complex than I wish to get in a short column. If you like to read math books, you could find this list interesting for starters.
- “Statistical Analysis with Missing Data” by Roderick J. A. Little and Donald B. Rubin (ISBN 0-471-80254-9). this is actually aimed more at medical data than commercial data, but the principles will still apply.
- “Interpolation and Approximation” by Philip J. Davis (ISBN 978-048662495-2).
- “Interpolation: Second Edition” by J. F. Steffensen (ISBN 978-048645009-4).
Both of these last two books are available from Dover Publications. There are more general, but they are still math books. Interpolation used to be much more popular than it is today because before cheap calculators we had to use slide rules and lookup tables. Basically, if you are given two consecutive values of a function, an interpolation makes an approximation of value that lies between the two.