As I was going up the stair
I met a man who was not there!
He was not there again today,
Oh how I wish he’d go away!
– “Antigonish” (1899) by William Hughes Mearns.
It is also known as ‘The Little Man Who Was Not There’ and was made into a hit song by Glen Miller.
Missing data is really hard to handle. SQL programmers are a little bit spoiled because our language has NULLs in it and beginning SQL programmers tend to assume that this is the way to handle all missing data. This is wrong and when the column is NULL-able, you need to define exactly what that NULL means in context. In the second version of his relational model Dr. Codd had two kinds of NULLs; one for the case when the value of an attribute was missing, and one when the attribute itself was missing (and therefore, it could never have a value).
Why is a Value Missing?
You can be several reasons we do not have a value. It can be completely random and have nothing to do with the data itself. For example, there was a fire or some natural event that destroyed our data. This is often referred to as the ‘the dog ate my homework’ excuse. Sometimes this is actually true.
Data can also be missing due to the nature of the data itself. For example, I can say with absolute certainty that data about my pregnancies is missing because it’s affected by my gender (Dr. Codd’s second kind of NULL). While this example is going to be 100% dependent on gender, other missing values may be less certain. For example, many people are hesitant about reporting their income or age.
I have to make a decision about male pregnancies. Do I report them with zeros, so that summations will work easier when I pass data to a non-SQL package? Or do I use the NULL, to mark them as an impossibility? At the time I’m writing this (2018 June), there’s been a posting on a SQL Server forum which asks about how to handle year-to-date running totals. If a particular month has already occurred, then you can get a running total pretty easily. However, if I need a total for a month that has not yet occurred, the poster wanted to use a NULL in the table. But then he wanted to update all of the ‘future NULLs’ with the last known total; this is obviously invalid, unless you are clairvoyant.
Deleting Missing Data
The easiest way to handle missing data is simply to delete it. But you have to decide if you want to delete the entity involved (that is, delete rows) or to delete the attribute (that is, delete columns).
Entity deletion is done quite a lot. Some of the incoming data is simply a mess. The bad news with this method is sometimes you really cannot afford to throw it all out. As an example of this, McGraw-Hill Educational Testing Services gets warehouses of Scantron forms. Many of these forms are seriously damaged the point they do not scan. Rather than tell a kid that his test disappeared, they pass it on to human beings who try to figure out what was intended and reconstruct it manually. It’s actually a small percentage of the total volume the handle, but it’s a pretty impressive absolute number. But some of the forms are such a mess you really cannot do anything with them; they look like somebody mopped the floor with them.
Attribute deletion is rarer than entity deletion. Dropping the column goes across all the rows in a table. But you probably put that column in the original schema DDL for a reason. And now you cannot use it! A good question is when an attribute (column) it so messed up becomes useless. In my experience, the most common reason for this is different data sources. They do not have to be intentionally bad or negligent, but they have different precision, or they have different scales of measurement. One of the classic stories in IT was when Disneyland Japan was using metric measures and they had some of their parts supplied in US customary units. I’ll discuss this problem shortly.
Replacing Missing Data
Statisticians have replaced missing data with various techniques for a long time. Longitudinal data is the same attribute measured over time, using a series of observations. However, there can be missing values in the series.
Several common techniques presume that things don’t change very much over time. This means you can look to the value before or after the gap and decide to carry the last known value forward, or the next known value backwards. The bad news is if there’s a trend, either increasing or decreasing, these methods will hide it.
A better method is to look at the boundary values and interpolate a value between them in the gap. It tends to make for smoother data. The tradeoff is that cyclic or seasonal data can suffer. The easiest way is to assume that all data is linear, and simply computed value with a simple linear fit. This is an assumption a very strong one! You better have a good reason for it. It’s also quite possible that data is growing or decaying exponentially over time. Again, anytime you provide a missing value, you’re making a leap of faith.
It is easy to compute the average with AVG() for each instance, and assume that it can be used to represent the missing value. It will smooth out very nicely in other aggregate statistics when you analyze the whole set. But it also means you reduced the variance in the data. Then you have to decide whether or not the mean was the right choice. The other two common measures of central tendency in the data set are the median (or better, a weighted median) and the mode. However, there are other statistics, such as the geometric mean that might be appropriate.
Notice that so far, we only considered continuous values. If the attribute is categorical, we have some other choices. One way to do this is to make a ‘miscellaneous’ or ‘missing’ category in your nominal scale. If you start to get too many occurrences of such values, then your data is pretty dirty, and you need to rethink what you’re doing.
Another approach is to assume that the mode (most frequently occurring value) is the best choice, especially if the data is skewed in some way. For example, we might assume that the sex code for this particular population (a database of NFL football players versus a database of NFL cheerleaders) is ‘male’ and routinely plug it in. This obviously has some problems. If we’re lucky, we might be able to use other attributes to make a better educated guess. If the first name of this entity is ‘Mary’, then we can reasonably assume the sex code should be ‘female’; if the first name is ‘Vladimir,’ then we can reasonably assume the sex code should be ‘male’ instead. There are also people who only use their initials for a first name. However, the old kids television show ‘The Secret World of Alex Mack’ had the scripts before they had the lead actor. They deliberately picked a gender ambiguous name for the series.
A better way, but it takes more computing, is to look for multiple attribute values. For example, if you’re looking for sex code, we might include the first name, the occupational code, age, and whatever other attributes have. Each value for each attribute is assigned a weight, and the weights are added to give a total score for how likely each replacement value is. There some statistical tools for doing this, and it can be surprisingly accurate. The gimmick is any unusual entity in your database can get lost in the assumption that it’s not an outlier. For example, a 70+ year-old high school teacher is usually going to be female; this means you’re going to miss all older retired male volunteer teachers that just came into the database.
Mixed Scales and Data Loss
Things like length and temperature are measured on what are called ratio scales. This means they are expressed as multiples or subdivisions of one standard unit, and that you can convert among such scales with a little algebra (for example, °F = °C × 9/5 + 32 is the formula for converting Celsius to Fahrenheit temperatures). You might want to look up a prior article of mine on scales and measurements.
In spite of the fact that there is a formula for doing the conversions, there is no guarantee that each measurement is kept to the same number of decimal places after conversion. In the United States, we tend to keep Fahrenheit temperatures in whole numbers, however Celsius temperatures tend to be recorded to at least one decimal place. Thus, some entities in the database are being measured by a different instrument with different scales. You now have data loss from rounding errors!
Nominal Scale Loss
Many years ago, I helped set up a database for the sales department of a pharmaceutical company. Part of the database included the medical specialties of the doctors, which they obtained by buying commercial data. The problem was they had two sources for this data; one of which used an alphabetic encoding system and the other used a numeric system. There were apparently no industry standards at the time. The real problem was the scales did not match. For example, one provider thought pediatric cancer doctors belonged under pediatrics, while the other provider put them as a specialty in oncology. Even when you could do a one-for-one translation, the principal specialty of the same doctor would vary.
The real problem was that depending who loaded which tape into the database, the encoding scheme used would change from one to the other. Oh, did I mention that someone gave everybody full admin privileges on the database? Developers made scripts that would do ALTER TABLE statements that matched whichever vendor they favored when they loaded a tape. The database kept getting overwritten unpredictably.
It took up a few months before the salesman noticed that the printouts they were getting for their sales calls didn’t match each other. This sounds obvious if you’re a database person and are very aware of data. But pharmaceutical salesman are not database people, so it took a little while.
The problem then became converting the specialty codes which did not match. We simply made arbitrary decisions, based what we felt was the best match. The result was data loss, even after we decided on one system.
There’s an old proverb that “a man with two wristwatches can never be sure exactly what time it is.” And this was a really good example of it.
The bottom line is that there is no universal, magic technique for putting in missing data. Sorry. No magic. You have to make these decisions on a case-by-case basis. But as you make these decisions, the aggregate of the data is affected. Obviously, the best solution is not to allow missing data at all and to get perfect inputs. If you have a way to do that, please let me know. The best I’ve been able to do is reduce the amount of bad input after 40 years in IT.