The SQL of Textonyms

The task of finding textonyms in SQL involves importing a list of common words and doing transformations on every word to convert it into what you'd need to type into the numeric keypad of your mobile phone to get that word. It's not that hard to do, but what is the quickest and most efficient way of doing it? Phil Factor investigates.

Textonyms

When you typed in words on a mobile phone’s digital pad to ‘text’ someone, there was always the possibility of a particular combination of digits being ambiguous in that it corresponds to more than one word. Either you had ‘autocorrect’ on, and you chose which word you want, or you had to press repeatedly to cycle through the characters as you typed the word.

These alternative words for the same string of numbers are called textonyms. What percentage of common words in English would suffer from this sort of ambiguity, what are the strings of numbers that represent a large number of textonyms? What ambiguous text strings match words where all the characters are different? What is the largest group of words represented by a texting digit string?

This a very easy problem to solve in SQL Server. The only irritation is that, although the code is relatively simple, and you can go on to make all sorts of interesting discoveries about possible misunderstandings when texting without checking the screen, the initial setup is not quick, because you are having to transform around 60,000 words. In truth this really doesn’t matter a bit but as we so often have to do other similar transformations on data when inserting it into a table, I thought it would be interesting to explore the process to see where one could speed this up.

We’ll assume that the digit keys are mapped to letters as follows:

Finding Textonyms using SQL

Creating the table

We’ll assume that you have a table with all the common words in it. If you haven’t got one of these, the data is in a zipped file with a link at the end of the article. One of the Blog posts linked to in the references section has some hints on how to read this into a database table of common words.

Our first task is to create the table with the 600,000 common words of the English language in it in one column, and the numeric ‘text’ version in the other

Now. Let’s consider how we convert a string into a text version. Doing a single string is pretty trivial.

There is, of course, some disguised iteration here, and this is not a simple expression. Experienced database developers will, if nobody else is looking, get a faster transformation by doing many nested REPLACE functions, one for each letter of the alphabet. This turns it into a simple expression that is far more versatile.

Just to save space, we will imagine for a moment that there were only seven letters in the alphabet. If this were the case it would look like this

The language has become rather weird when restricted to these seven characters but the full expression for the entire alphabet is tiresome to display in the article. Incidentally, we can find just those words that use these seven characters very simply with this expression.

Now, can this first method translate easily to around 60,000 common words in the English language?

We convert both of these into a function and fill the table

Hunting for Textonyms in SQL

Once we have our table, the rest is trivial. We can do all sorts of analysis on the results. First we’ll find what percentage of common words of the English language are unique in terms of their text digits.

85% is a pretty high proportion which is why ‘Texting’ became so popular with phone users.

You can easily find the text digits with the most ambiguity, which represent the most textonyms. The SQL is a bit more complex because there could be more than one in first place.

You can list all the ambiguous strings of text digits if you wish.

We can find all palindromic text versions for words that aren’t palindromes.

We can find all the duplicates that do not have a single letter in common …

… or we can find every pair of textonym that have no character in common.

Alternative ways of doing data transformations. 

There is a lot we can do once we have our table. Before we got started with exploring textonyms, we weren’t bothered about performance as it was a one-off process and though it took several seconds, it was of little concern. The same goes for the SQL that explored the textonyms. However, what is the speediest way of creating a table of common words and their text digits? I didn’t like that function: it seemed as if it would be slow. I’m often creating ETL routines for importing text and doing transformations on them so now we have a decent-sized sample, I thought it would be amusing to find out.

To time an import routine, you really have to time the whole period that it takes from the creation of the table to getting a result. In my case, I needed to time the entire period from start, having nothing but the table of common words, to the point at which we have it confirmed that 85% of them are unique and unambiguous. I tried out five different approaches to the task, but I’ve kept the code separate as a download because it is otherwise a bit distracting. (I actually tried out some other approaches such as a Quirky Update and XML-based approach that were both impossibly slow.)

The first thing to try out was to take away the function entirely, and do the transformation within the SQL expression that inserted the words into our table.

Remember that expression that only did the first seven letters of the alphabet? We expand it to do them all. It’s big and it’s ugly, but it reduces the end-to-end time by around 20%. We reckon that we can do better by doing this simple expression in a table-valued function. Well, it is actually about the same time.

What about moving the logic of the data transformation into the table itself as a calculated column? We can do this easily with a simple expression and by doing so we deduce the time by a half. This is a saving worth pursuing. You can, of course, use a user-defined function in a computed column expression but the drawback to this is that SQL Server chooses a sequential plan rather than the faster parallel plan. Also they are difficult in temporary tables and you have to make them deterministic so that you can then specify constraints on the column or make them persisted. All in all, it is better to keep them to sql scalar expressions if possible. However, scalar functions can be made to work.

Using scalar functions in computed columns

Here is a simple example of this technique

Imagine that we want to have postcode information extracted from an address. We could use a calculated column to do the work so that the column automatically remains correct after any type of changes to the data.

We might do this

We’ve made the function deterministic by using WITH SCHEMABINDING and this has allowed us to persist the calculated data. We could put an index on this column as well.

Trying out the Technique to find Anagrams

We can us the technique to find words that are anagrams of each other.  From end to end is a ten-second wait, but once you have the table then there there is a lot of  ways you can explore the results. I was somewhat surprised to find that the longest number of words that are all anagrams of each other is seven.

Which will give this…

 anagrams

Conclusion.

With any functionality that you put into a database, you have a number of possibilities. Even with this very simple task, importing a table of common words and doing a transformation on it, there were many different ways of doing it. In this case, the use of a computed column turned out to be the best approach but you can’t assume that this is necessarily the general case. If there is a general rule it is that one should avoid duplicating functionality if it can be avoided. However, if code like this goes into a trigger or a computed column, it can be exasperatingly tricky to find in a hurry. The great advantage, though, is that data can be entered via a simple sql expression without any thought to applying the transformation to each row as part of the process. It happens auto-‘magically’. My instinct is to place this type of transformation as close as possible to the data.

References

SQLPromptSmall