Data Science Laboratory System – Testing the Text Tools and Sample Data

Anyone who is frequently faced with preparing data for processing needs to be familiar with some industry-standard text-manipulation tools. Awk, join, sed, find, grep and cat are the classics, and Buck Woody takes them for a spin in his Data Science Laboratory

This is the second in a series on setting up a Data Science Laboratory server – the first is located here. My plan is to set up a system that allows me to install and test, and experiment with, various methods to store, process and deliver data. These systems range from simple text manipulation to Relational Databases and distributed file and compute environments. Where possible, I will install and configure both the platforms and code locally, while still leveraging a “cloud” platform. My goal is to work with the following types of data systems:

  • Text Systems
  • Interactive Data Tools
  • Programming and Scripting Languages
  • Relational Database Management Systems
  • Key/Value Pair
  • Document Store Databases
  • Graph Databases
  • Object-Oriented Databases
  • Distributed File and Compute Data Handling Systems

I’ll repeat a disclaimer I made in the previous article:

Note: In this article and the ones that follow the information presented is not an endorsement or recommendation to use any particular vendor, software or platform; it is an explanation of the factors that influenced my choices. You can choose any other platform, cloud provider or other software that you like – the only requirement is that it fits your needs. As I always say – use what works for you. You can examine the choices I’ve made here, change the decisions to fit your needs and come up with your own system. The choices here are illustrative only, and not meant to sell you on a software package or vendor.

In the previous article I explained my choices for the platform and the operating system, and I also covered a few tools that I selected for text-based data interaction. In this article, I thought it would make sense to demonstrate some of those text-based tools. This series isn’t intended to be a full tutorial on how to use each of the tools I cover – but examples are essential for some types of learning, so I’ll include them from time to time. It helps to set context with concrete examples.

The purpose of this system is to learn – and to that end, I’ll explain some of the examples I use to do that. There is no substitute for reading up on the tools you install, and experimenting with them until you’re comfortable with what they do. Only then will you know which tool(s) to use to solve a given problem set. Along the way, you should make sure you follow a scientific process of documenting your controls and variables, so that others can repeat and verify your work. Also ensure that you document the experiments carefully. That documentation serves to educate others – you could even place it in a blog for quick retrieval and wide publication.

As part of the data lifecycle, one of the first steps is to ensure you understand the data you’re working with. Many of these tools help you do that – the first being a solid understanding of text-based data. I’ll cover that in this series, and then follow on in other articles showing examples of testing data using statistical testing, variance, confirmation testing, data spread and so on.

Text Tools Examples

I installed several tools in the last article using the Cygwin suite, and I’ll introduce a few that I use most often in this section, and I’ll also use them in future articles as I go.

Regardless of your data sources, you’ll invariably run into situations where you need to work directly with text. Almost every data platform allows you to input, output and manipulate text, and If you’re familiar with working in large database platforms, you might be surprised to learn the wide array of tasks you can accomplish quickly and easily in a text file.

In all of the examples that follow, you’ll want to read up first about “regular expressions”. This is no trivial task, but almost every text utility you’re liable to work with depends on them. A good place to start is here: http://en.wikipedia.org/wiki/Regular_expression and then follow the links there to learn even more. It will take you some time, but working with regular expressions is a very valuable skill to learn.

What follows is not a tutorial on the utilities I use, but examples that show a medium or perhaps slightly complex implementation. It’s meant only to pique your interest to learn more. My normal learning process is to take a task I want to accomplish, and then experiment with ways to do that. To that end, I’ll start with the “cat” utility.

The cat utility takes a file and sends it to the standard output, such as the screen. It’s often compared to the MS-DOS command of “DIR”. But even this simple utility has uses for the data scientist when combined with other tools.

Recently I saw a documentary that explored why the detective novelist Agatha Christie was so popular. By analyzing her stories, the researchers were able to find many interesting clues. As part of their analysis, they noticed she wrote in a fairly predictable way – for instance, they found that whenever she mentioned a road as the first method of travel in the book, the murderer was always male. They also found that when she wanted you to really get a point, she used the same word three times in a long sentence. The researchers used a “word frequency count” – a useful tool for showing the number of times each word in a work is used. Never assume that because a utility seems to have a limited scope, that it can’t provide big results.

I took the persondata_en.nq file that I mentioned in the last article and ran the cat utility along with the tr, uniq and sort utilities (links to those in the last article) to get a count of how many times each word is used:

Here’s a sample of the output:

  • 1696021 ontology
  • 1745043 Reference
  • 1769720 w3
  • 1769805 www
  • 20541043 http
  • 2802116 link
  • 3317302 External

You can see that even in this simple example, the word counts show interesting information. Actually, even though this works, it took far too long – 1 hour and 11 minutes on my test system, to be precise. I’ll fix that later, but it illustrates that even simple tools can be useful for data analysis.

If you’re curious about the command structure I used in this example, see if you can break it down. Along the way you’ll find that there are other ways – actually several other ways – to get this same result. That exploration is part of the purpose of this system.

As I worked through the sample files I downloaded in the last article, I found that I also wanted to work with tables from a Relational Database Management System (RDBMS) product to show where the tools and concepts I’ll work with on this new system are similar and where they are different to an RDBMS.

I decided to use a very small but relatively complete database – the “pubs” sample database from Sybase and SQL Server. I exported the various tables in that database (without headers) to simulate outputs I might receive from another system. In the appendix at the end of this article I’ll show you the process I followed to get the pubs tables into text files.

The pubs database is a sample of a publishing company, with books, authors, sales and stores. It has multiple joining conditions possible on various key fields, and small, understandable data. I’ll use that here to demonstrate some of the utilities that I find useful for text files. Again – this article isn’t meant as a complete tutorial on these tools, merely an example to show what you can accomplish using only text files.

Here’s a list of the files I created after I followed the process detailed in the appendix:

  • authors.tsv
  • employee.tsv
  • jobs.tsv
  • keysort.authors.tsv
  • keysort.titleauthor.tsv
  • keysort.titles.tsv
  • pub_info.tsv
  • publishers.tsv
  • roysched.tsv
  • sales.tsv
  • store.tsv
  • stores.tsv
  • tempsales.tsv
  • temptitles.tsv
  • titleauthor.tsv
  • titles.tsv
  • schema.txt

The “tables” here have no headers – that’s on purpose, since that simulates a lot of the output I receive from other systems in my day-to-day environments. That’s where the last file “schema.txt” comes into play – it’s like those descriptor files from the Wikipedia and Government sources downloads from the last article. The contents of schema.txt are:

filename columnordinal DataType Length numberscale numberprecision
authors.tsv contract bit NULL NULL NULL
authors.tsv au_id varchar 11 NULL NULL
authors.tsv au_lname varchar 40 NULL NULL
authors.tsv au_fname varchar 20 NULL NULL
authors.tsv phone char 12 NULL NULL
authors.tsv address varchar 40 NULL NULL
authors.tsv city varchar 20 NULL NULL
authors.tsv state char 2 NULL NULL
authors.tsv zip char 5 NULL NULL
discounts.tsv lowqty smallint NULL 0 5
discounts.tsv highqty smallint NULL 0 5
discounts.tsv discount decimal NULL 2 4
discounts.tsv discounttype varchar 40 NULL NULL
discounts.tsv stor_id char 4 NULL NULL
employee.tsv job_id smallint NULL 0 5
employee.tsv job_lvl tinyint NULL 0 3
employee.tsv hire_date datetime NULL NULL NULL
employee.tsv pub_id char 4 NULL NULL
employee.tsv emp_id char 9 NULL NULL
employee.tsv fname varchar 20 NULL NULL
employee.tsv minit char 1 NULL NULL
employee.tsv lname varchar 30 NULL NULL
jobs.tsv job_id smallint NULL 0 5
jobs.tsv min_lvl tinyint NULL 0 3
jobs.tsv max_lvl tinyint NULL 0 3
jobs.tsv job_desc varchar 50 NULL NULL
pub_info.tsv logo image 2147483647 NULL NULL
pub_info.tsv pr_info text 2147483647 NULL NULL
pub_info.tsv pub_id char 4 NULL NULL
publishers.tsv pub_id char 4 NULL NULL
publishers.tsv pub_name varchar 40 NULL NULL
publishers.tsv city varchar 20 NULL NULL
publishers.tsv state char 2 NULL NULL
publishers.tsv country varchar 30 NULL NULL
roysched.tsv lorange int NULL 0 10
roysched.tsv hirange int NULL 0 10
roysched.tsv royalty int NULL 0 10
roysched.tsv title_id varchar 6 NULL NULL
sales.tsv ord_date datetime NULL NULL NULL
sales.tsv qty smallint NULL 0 5
sales.tsv payterms varchar 12 NULL NULL
sales.tsv title_id varchar 6 NULL NULL
sales.tsv stor_id char 4 NULL NULL
sales.tsv ord_num varchar 20 NULL NULL
stores.tsv stor_id char 4 NULL NULL
stores.tsv stor_name varchar 40 NULL NULL
stores.tsv stor_address varchar 40 NULL NULL
stores.tsv city varchar 20 NULL NULL
stores.tsv state char 2 NULL NULL
stores.tsv zip char 5 NULL NULL
titleauthor.tsv au_ord tinyint NULL 0 3
titleauthor.tsv royaltyper int NULL 0 10
titleauthor.tsv au_id varchar 11 NULL NULL
titleauthor.tsv title_id varchar 6 NULL NULL
titles.tsv price money NULL 4 19
titles.tsv advance money NULL 4 19
titles.tsv royalty int NULL 0 10
titles.tsv ytd_sales int NULL 0 10
titles.tsv pubdate datetime NULL NULL NULL
titles.tsv notes varchar 200 NULL NULL
titles.tsv title_id varchar 6 NULL NULL
titles.tsv title varchar 80 NULL NULL
titles.tsv type char 12 NULL NULL
titles.tsv pub_id char 4 NULL NULL

With these files in hand, I’ll examine how you can use them in a fashion similar to working with data in an RDBMS. It’s pretty rare to do that, by the way – but it is an interesting exercise nonetheless.

grep

The grep utility searches for patterns. One of the simplest uses is to look for a word in a file. For instance, let’s look for the last name “Ringer” in the authors.tsv file:

Output:

Note that searching for “ringer” brings back a different result:

There are ways to make the command case-insensitive and more, but for now, you can see that grep searches for characters in a file.

There are some other interesting things you can do with grep, however. I’ll combine it here with the find utility (which finds files) and search all of the files in the pubs directory to see where the book code PS2091 shows up.

The find utility allows you to locate a file and then do something with it – in this case, using grep to search each file (the {} part) and then to show the results:

Running that command in the pubs directory returns this output:

  • 16:PS2091 0 1000 10
  • 17:PS2091 1001 5000 12
  • 18:PS2091 5001 10000 14
  • 19:PS2091 10001 50000 16
  • ./roysched.tsv
  • 2:6380 722a 1994-09-13 00:00:00.000 3 Net 60 PS2091
  • 4:7066 QA7442.3 1994-09-13 00:00:00.000 75 ON invoice PS2091
  • 5:7067 D4482 1994-09-14 00:00:00.000 10 Net 60 PS2091
  • 9:7131 N914008 1994-09-14 00:00:00.000 20 Net 30 PS2091
  • ./sales.tsv
  • 2:6380 722a 1994-09-13 00:00:00.000 3 Net 60 PS2091
  • 4:7066 QA7442.3 1994-09-13 00:00:00.000 75 ON invoice PS2091
  • 5:7067 D4482 1994-09-14 00:00:00.000 10 Net 60 PS2091
  • 9:7131 N914008 1994-09-14 00:00:00.000 20 Net 30 PS2091
  • ./tempsales.tsv
  • 12:PS2091 Is Anger the Enemy? psychology 0736 10.9500 2275.0000 12 2045 Carefully researched study of the effects of strong emotions on the body. Metabolic charts included. 1991-06-15 00:00:00.000
  • ./temptitles.tsv
  • 23:899-46-2035 PS2091 2 50
  • 24:998-72-3567 PS2091 1 50
  • ./titleauthor.tsv
  • 12:PS2091 Is Anger the Enemy? psychology 0736 10.95 2275.00 12 2045 Carefully researched study of the effects of strong emotions on the body. Metabolic charts included. 1991-06-15 00:00:00.000
  • ./titles.tsv

Each file with the search term is displayed, and the name of the file is just at the bottom of each result. Very powerful!

You can read more about the grep utility here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html.

sed

The sed utility also searches within files, but in addition it can change things from one value to another. You can use regular expression searches and it sends the result to the regular output, which is by default the screen.

This example, for instance, finds the string “Albert” in the authors.tsv file, then replaces Albert’s social security number with 555-255-5555. In this case, it doesn’t actually permanently change anything – it just shows that on the screen:

Trimmed output:

You can learn more about the sed utility here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html

join

The join utility joins two files together based on a common set of characters. It’s important to understand that this isn’t a full JOIN operation in an RDBMS – it’s a simple string match between two files.

But there are a couple of (very important) caveats in this process, and it illustrates why we have things like an RDBMS. The join utility requires a sorted file – it’s a simple pattern match. Not only that, simply sorting the file when it has multiple character types and other file contents doesn’t always work the way you think it might.

For instance, to join the authors.tsv file and the titleauthors.tsv file, I first had to sort them – but even that didn’t produce the results I wanted, because the first “field” is actually a set of numbers with dashes in them. As a quick workaround, I replaced the tabs with the tilde character (~) and piped all that out to another file, using the sed utility:

This produces a horribly ugly output, but I’ll fix it in a moment. Now I can use the join utility and get what I want:

And to clean it I’ll use sed again:

Is this the most efficient way to join files? No! There are lots of far better ways to do this, but in this little experiment I learned more about the source of the data than simply looking at it. That’s the point of building a lab system – it also demonstrates the power of these small utilities, and that often you need to layer them to get what you’re after.

You can learn more about the join utility here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/join.html

awk

I’ve explained that grep is a simple search tool, and sed can search and manipulate data. For power-searching, however, you need something that you can write with more programmatic structures, such as looping and so on. For that the awk utility is quite useful.

The awk utility can take files or even streams of data and perform actions on each one. The awk utility can also be very useful working with columnar data – something I deal with all the time. I’ll show an example of that in a moment.

Learning to use the awk utility well can be a bit daunting – but the general pattern is this:

(single-ticks can be used as well as quotes: ‘ )

These commands usually begin with the word BEGIN, and sometimes don’t even have a condition – so a small Hello World program would look like this:

Of course that’s not terribly useful. But let’s return to that join operation from before. Using a looping function, I can start using columns to join on. To be consistent, I’ll join on the first columns of the titleauthor.tsv and author.tsv files again. This time, I don’t need to change the file at all – awk can handle the columns separated using the tab characters.

In this example, I’ll use two interesting constructs from awk for the join. NR is a variable that means the number of the current line being processed. NFR is a variable that means the current record in the file being processed. That means I can set an “outside” condition, and a comparison operator. The $1 and $1 and $0 are variables within the file corresponding to the column positions in the file, and the datum being compared.

So I’ll use all that info to:

  • Use titleauthor.tsv as the “master” comparison
  • Use authors.tsv as the data being compared
  • Show all info from the authors.tsv file
  • Show only the title number (2nd field) from the titleauthors.tsv file

Here’s the command:

Can you work out from this example what is happening line by line? Can you think of a way to join more than two tables using awk? (Hint: it’s messy.) Do you have a more efficient way of doing this?

If you want to learn more about awk, start here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html.

In the next article I’ll turn on some instrumentation for the workstation – not to measure performance directly, but to find which program, construct, process or set of processes takes the least hit on system. It will also show how to evaluate which system components are being exercised the most, such as CPU, memory, I/O or Network. I’ll also introduce a few interactive tools that I use to work with data.

Appendix: Creating the pubs Text Files (or just download them)

I wanted to be able to compare some data operations across all systems I’ll use – from text, to an RDBMS, NoSQL, Hadoop and other platforms. There’s a bit of a danger there, in that each of these tools have advantages and disadvantages for working with various types of data, but this is a learning exercise.

If you do have access to SQL Server, you can read more about the pubs database here, and in that article you’ll find out where to obtain it.

With the pubs database installed on your database server Instance, drop to a command-line on that server and run the following commands – replace INSTANCE with the name of your server/instance. It assumes you have a c:\temp directory as well, and that you’re using a Windows account that has access to the Instance and the database:

  • BCP pubs..authors out c:\temp\authors.tsv -c -T -SINSTANCE
  • BCP pubs..discounts out c:\temp\ discounts.tsv -c -T -SINSTANCE
  • BCP pubs..employee out c:\temp\employee.tsv -c -T -SINSTANCE
  • BCP pubs..jobs out c:\temp\jobs.tsv -c -T -SINSTANCE
  • BCP pubs..pub_info out c:\temp\pub_info.tsv -c -T -SINSTANCE
  • BCP pubs..publishers out c:\temp\publishers.tsv -c -T -SINSTANCE
  • BCP pubs..roysched out c:\temp\roysched.tsv -c -T -SINSTANCE
  • BCP pubs..sales out c:\temp\sales.tsv -c -T -SINSTANCE
  • BCP pubs..store out c:\temp\store.tsv -c -T -SINSTANCE
  • BCP pubs..titleauthor out c:\temp\titleauthor.tsv -c -T -SINSTANCE
  • BCP pubs..sales out c:\temp\sales.tsv -c -T -SINSTANCE
  • BCP pubs..titles out c:\temp\titles.tsv -c -T -SINSTANCE

This brings the data out without headers – a common way to receive data. But I do want to know what is in each of these files, so I opened SQL Server Management Studio (SSMS) and ran the following query. When I was done, I saved the results as schema.txt:

Save these files in your data directory on your lab system and you’ll be ready to work with the examples I explain in this series.