Data Science Laboratory System - Testing the Text Tools and Sample Data

This is the second in a series on setting up a Data Science Laboratory server – the first is located here. My plan is to set up a system that allows me to install and test, and experiment with, various methods to store, process and deliver data. These systems range from simple text manipulation to Relational Databases and distributed file and compute environments. Where possible, I will install and configure both the platforms and code locally, while still leveraging a “cloud” platform. My goal is to work with the following types of data systems:

Text Systems
Interactive Data Tools
Programming and Scripting Languages
Relational Database Management Systems
Key/Value Pair
Document Store Databases
Graph Databases
Object-Oriented Databases
Distributed File and Compute Data Handling Systems

I’ll repeat a disclaimer I made in the previous article:

Note: In this article and the ones that follow the information presented is not an endorsement or recommendation to use any particular vendor, software or platform; it is an explanation of the factors that influenced my choices. You can choose any other platform, cloud provider or other software that you like – the only requirement is that it fits your needs. As I always say – use what works for you. You can examine the choices I’ve made here, change the decisions to fit your needs and come up with your own system. The choices here are illustrative only, and not meant to sell you on a software package or vendor.

In the previous article I explained my choices for the platform and the operating system, and I also covered a few tools that I selected for text-based data interaction. In this article, I thought it would make sense to demonstrate some of those text-based tools. This series isn’t intended to be a full tutorial on how to use each of the tools I cover – but examples are essential for some types of learning, so I’ll include them from time to time. It helps to set context with concrete examples.

The purpose of this system is to learn – and to that end, I’ll explain some of the examples I use to do that. There is no substitute for reading up on the tools you install, and experimenting with them until you’re comfortable with what they do. Only then will you know which tool(s) to use to solve a given problem set. Along the way, you should make sure you follow a scientific process of documenting your controls and variables, so that others can repeat and verify your work. Also ensure that you document the experiments carefully. That documentation serves to educate others – you could even place it in a blog for quick retrieval and wide publication.

As part of the data lifecycle, one of the first steps is to ensure you understand the data you’re working with. Many of these tools help you do that – the first being a solid understanding of text-based data. I’ll cover that in this series, and then follow on in other articles showing examples of testing data using statistical testing, variance, confirmation testing, data spread and so on.

Text Tools Examples

I installed several tools in the last article using the Cygwin suite, and I’ll introduce a few that I use most often in this section, and I’ll also use them in future articles as I go.

Regardless of your data sources, you’ll invariably run into situations where you need to work directly with text. Almost every data platform allows you to input, output and manipulate text, and If you’re familiar with working in large database platforms, you might be surprised to learn the wide array of tasks you can accomplish quickly and easily in a text file.

In all of the examples that follow, you’ll want to read up first about “regular expressions”. This is no trivial task, but almost every text utility you’re liable to work with depends on them. A good place to start is here: http://en.wikipedia.org/wiki/Regular_expression and then follow the links there to learn even more. It will take you some time, but working with regular expressions is a very valuable skill to learn.

What follows is not a tutorial on the utilities I use, but examples that show a medium or perhaps slightly complex implementation. It’s meant only to pique your interest to learn more. My normal learning process is to take a task I want to accomplish, and then experiment with ways to do that. To that end, I’ll start with the “cat” utility.

The cat utility takes a file and sends it to the standard output, such as the screen. It’s often compared to the MS-DOS command of “DIR”. But even this simple utility has uses for the data scientist when combined with other tools.

Recently I saw a documentary that explored why the detective novelist Agatha Christie was so popular. By analyzing her stories, the researchers were able to find many interesting clues. As part of their analysis, they noticed she wrote in a fairly predictable way – for instance, they found that whenever she mentioned a road as the first method of travel in the book, the murderer was always male. They also found that when she wanted you to really get a point, she used the same word three times in a long sentence. The researchers used a “word frequency count” – a useful tool for showing the number of times each word in a work is used. Never assume that because a utility seems to have a limited scope, that it can’t provide big results.

I took the persondata_en.nq file that I mentioned in the last article and ran the cat utility along with the tr, uniq and sort utilities (links to those in the last article) to get a count of how many times each word is used:

1	cat persondata_en.nq \| tr -s '[:punct:][:space:][:blank:]'\| tr '[:punct:][:space:][:blank:]' '\n\n\n' \| tr -s '\n' \| sort \| uniq -c \| sort

Here’s a sample of the output:

…
1696021 ontology
1745043 Reference
1769720 w3
1769805 www
20541043 http
2802116 link
3317302 External
…

You can see that even in this simple example, the word counts show interesting information. Actually, even though this works, it took far too long – 1 hour and 11 minutes on my test system, to be precise. I’ll fix that later, but it illustrates that even simple tools can be useful for data analysis.

If you’re curious about the command structure I used in this example, see if you can break it down. Along the way you’ll find that there are other ways – actually several other ways – to get this same result. That exploration is part of the purpose of this system.

As I worked through the sample files I downloaded in the last article, I found that I also wanted to work with tables from a Relational Database Management System (RDBMS) product to show where the tools and concepts I’ll work with on this new system are similar and where they are different to an RDBMS.

I decided to use a very small but relatively complete database – the “pubs” sample database from Sybase and SQL Server. I exported the various tables in that database (without headers) to simulate outputs I might receive from another system. In the appendix at the end of this article I’ll show you the process I followed to get the pubs tables into text files.

The pubs database is a sample of a publishing company, with books, authors, sales and stores. It has multiple joining conditions possible on various key fields, and small, understandable data. I’ll use that here to demonstrate some of the utilities that I find useful for text files. Again – this article isn’t meant as a complete tutorial on these tools, merely an example to show what you can accomplish using only text files.

Here’s a list of the files I created after I followed the process detailed in the appendix:

authors.tsv
employee.tsv
jobs.tsv
keysort.authors.tsv
keysort.titleauthor.tsv
keysort.titles.tsv
pub_info.tsv
publishers.tsv
roysched.tsv
sales.tsv
store.tsv
stores.tsv
tempsales.tsv
temptitles.tsv
titleauthor.tsv
titles.tsv
schema.txt

The “tables” here have no headers – that’s on purpose, since that simulates a lot of the output I receive from other systems in my day-to-day environments. That’s where the last file “schema.txt” comes into play – it’s like those descriptor files from the Wikipedia and Government sources downloads from the last article. The contents of schema.txt are:

filename	columnordinal	DataType	Length	numberscale	numberprecision
authors.tsv	contract	bit	NULL	NULL	NULL
authors.tsv	au_id	varchar	11	NULL	NULL
authors.tsv	au_lname	varchar	40	NULL	NULL
authors.tsv	au_fname	varchar	20	NULL	NULL
authors.tsv	phone	char	12	NULL	NULL
authors.tsv	address	varchar	40	NULL	NULL
authors.tsv	city	varchar	20	NULL	NULL
authors.tsv	state	char	2	NULL	NULL
authors.tsv	zip	char	5	NULL	NULL
discounts.tsv	lowqty	smallint	NULL	0	5
discounts.tsv	highqty	smallint	NULL	0	5
discounts.tsv	discount	decimal	NULL	2	4
discounts.tsv	discounttype	varchar	40	NULL	NULL
discounts.tsv	stor_id	char	4	NULL	NULL
employee.tsv	job_id	smallint	NULL	0	5
employee.tsv	job_lvl	tinyint	NULL	0	3
employee.tsv	hire_date	datetime	NULL	NULL	NULL
employee.tsv	pub_id	char	4	NULL	NULL
employee.tsv	emp_id	char	9	NULL	NULL
employee.tsv	fname	varchar	20	NULL	NULL
employee.tsv	minit	char	1	NULL	NULL
employee.tsv	lname	varchar	30	NULL	NULL
jobs.tsv	job_id	smallint	NULL	0	5
jobs.tsv	min_lvl	tinyint	NULL	0	3
jobs.tsv	max_lvl	tinyint	NULL	0	3
jobs.tsv	job_desc	varchar	50	NULL	NULL
pub_info.tsv	logo	image	2147483647	NULL	NULL
pub_info.tsv	pr_info	text	2147483647	NULL	NULL
pub_info.tsv	pub_id	char	4	NULL	NULL
publishers.tsv	pub_id	char	4	NULL	NULL
publishers.tsv	pub_name	varchar	40	NULL	NULL
publishers.tsv	city	varchar	20	NULL	NULL
publishers.tsv	state	char	2	NULL	NULL
publishers.tsv	country	varchar	30	NULL	NULL
roysched.tsv	lorange	int	NULL	0	10
roysched.tsv	hirange	int	NULL	0	10
roysched.tsv	royalty	int	NULL	0	10
roysched.tsv	title_id	varchar	6	NULL	NULL
sales.tsv	ord_date	datetime	NULL	NULL	NULL
sales.tsv	qty	smallint	NULL	0	5
sales.tsv	payterms	varchar	12	NULL	NULL
sales.tsv	title_id	varchar	6	NULL	NULL
sales.tsv	stor_id	char	4	NULL	NULL
sales.tsv	ord_num	varchar	20	NULL	NULL
stores.tsv	stor_id	char	4	NULL	NULL
stores.tsv	stor_name	varchar	40	NULL	NULL
stores.tsv	stor_address	varchar	40	NULL	NULL
stores.tsv	city	varchar	20	NULL	NULL
stores.tsv	state	char	2	NULL	NULL
stores.tsv	zip	char	5	NULL	NULL
titleauthor.tsv	au_ord	tinyint	NULL	0	3
titleauthor.tsv	royaltyper	int	NULL	0	10
titleauthor.tsv	au_id	varchar	11	NULL	NULL
titleauthor.tsv	title_id	varchar	6	NULL	NULL
titles.tsv	price	money	NULL	4	19
titles.tsv	advance	money	NULL	4	19
titles.tsv	royalty	int	NULL	0	10
titles.tsv	ytd_sales	int	NULL	0	10
titles.tsv	pubdate	datetime	NULL	NULL	NULL
titles.tsv	notes	varchar	200	NULL	NULL
titles.tsv	title_id	varchar	6	NULL	NULL
titles.tsv	title	varchar	80	NULL	NULL
titles.tsv	type	char	12	NULL	NULL
titles.tsv	pub_id	char	4	NULL	NULL

With these files in hand, I’ll examine how you can use them in a fashion similar to working with data in an RDBMS. It’s pretty rare to do that, by the way – but it is an interesting exercise nonetheless.

grep

The grep utility searches for patterns. One of the simplest uses is to look for a word in a file. For instance, let’s look for the last name “Ringer” in the authors.tsv file:

1	grep "Ringer" authors.tsv

Output:

1 2	899-46-2035 Ringer Anne 801 826-0752 67 Seventh Av. Salt Lake City Oc 84152 1 998-72-3567 Ringer Albert 801 826-0752 67 Seventh Av. Salt Lake City Oc 84152 1

Note that searching for “ringer” brings back a different result:

1	724-08-9931 Stringer Dirk 415 843-2991 5420 Telegraph Av. Oakland Oc 94609 0

There are ways to make the command case-insensitive and more, but for now, you can see that grep searches for characters in a file.

There are some other interesting things you can do with grep, however. I’ll combine it here with the find utility (which finds files) and search all of the files in the pubs directory to see where the book code PS2091 shows up.

The find utility allows you to locate a file and then do something with it – in this case, using grep to search each file (the {} part) and then to show the results:

1	find . -exec sh -c 'grep -n "PS2091" {}' \; -print

Running that command in the pubs directory returns this output:

16:PS2091 0 1000 10
17:PS2091 1001 5000 12
18:PS2091 5001 10000 14
19:PS2091 10001 50000 16
./roysched.tsv
2:6380 722a 1994-09-13 00:00:00.000 3 Net 60 PS2091
4:7066 QA7442.3 1994-09-13 00:00:00.000 75 ON invoice PS2091
5:7067 D4482 1994-09-14 00:00:00.000 10 Net 60 PS2091
9:7131 N914008 1994-09-14 00:00:00.000 20 Net 30 PS2091
./sales.tsv
2:6380 722a 1994-09-13 00:00:00.000 3 Net 60 PS2091
4:7066 QA7442.3 1994-09-13 00:00:00.000 75 ON invoice PS2091
5:7067 D4482 1994-09-14 00:00:00.000 10 Net 60 PS2091
9:7131 N914008 1994-09-14 00:00:00.000 20 Net 30 PS2091
./tempsales.tsv
12:PS2091 Is Anger the Enemy? psychology 0736 10.9500 2275.0000 12 2045 Carefully researched study of the effects of strong emotions on the body. Metabolic charts included. 1991-06-15 00:00:00.000
./temptitles.tsv
23:899-46-2035 PS2091 2 50
24:998-72-3567 PS2091 1 50
./titleauthor.tsv
12:PS2091 Is Anger the Enemy? psychology 0736 10.95 2275.00 12 2045 Carefully researched study of the effects of strong emotions on the body. Metabolic charts included. 1991-06-15 00:00:00.000
./titles.tsv

Each file with the search term is displayed, and the name of the file is just at the bottom of each result. Very powerful!

You can read more about the grep utility here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html.

sed

The sed utility also searches within files, but in addition it can change things from one value to another. You can use regular expression searches and it sends the result to the regular output, which is by default the screen.

This example, for instance, finds the string “Albert” in the authors.tsv file, then replaces Albert’s social security number with 555-255-5555. In this case, it doesn’t actually permanently change anything – it just shows that on the screen:

1	sed '/Albert/s/801 826-0752/555-555-5555/' authors.tsv

Trimmed output:

1 2	... 998-72-3567 Ringer Albert 555-555-5555 67 Seventh Av. Salt Lake City Oc 84152 1

You can learn more about the sed utility here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html

join

The join utility joins two files together based on a common set of characters. It’s important to understand that this isn’t a full JOIN operation in an RDBMS – it’s a simple string match between two files.

But there are a couple of (very important) caveats in this process, and it illustrates why we have things like an RDBMS. The join utility requires a sorted file – it’s a simple pattern match. Not only that, simply sorting the file when it has multiple character types and other file contents doesn’t always work the way you think it might.

For instance, to join the authors.tsv file and the titleauthors.tsv file, I first had to sort them – but even that didn’t produce the results I wanted, because the first “field” is actually a set of numbers with dashes in them. As a quick workaround, I replaced the tabs with the tilde character (~) and piped all that out to another file, using the sed utility:

1 2	sed "s/\\t/~/g" authors.tsv \| sort > keysort.authors.tsv sed "s/\\t/~/g" titleauthor.tsv \| sort > keysort.titleauthor.tsv

This produces a horribly ugly output, but I’ll fix it in a moment. Now I can use the join utility and get what I want:

1	join -t~ keysort.authors.tsv keysort.titleauthor.tsv

And to clean it I’ll use sed again:

1	join -t~ keysort.authors.tsv keysort.titleauthor.tsv \| sed "s/~/\\t/g"

Is this the most efficient way to join files? No! There are lots of far better ways to do this, but in this little experiment I learned more about the source of the data than simply looking at it. That’s the point of building a lab system – it also demonstrates the power of these small utilities, and that often you need to layer them to get what you’re after.

You can learn more about the join utility here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/join.html

awk

I’ve explained that grep is a simple search tool, and sed can search and manipulate data. For power-searching, however, you need something that you can write with more programmatic structures, such as looping and so on. For that the awk utility is quite useful.

The awk utility can take files or even streams of data and perform actions on each one. The awk utility can also be very useful working with columnar data – something I deal with all the time. I’ll show an example of that in a moment.

Learning to use the awk utility well can be a bit daunting – but the general pattern is this:

1	''condition'' { ''action'' }

(single-ticks can be used as well as quotes: ‘ )

These commands usually begin with the word BEGIN, and sometimes don’t even have a condition – so a small Hello World program would look like this:

1	awk 'BEGIN { print 'Hello Buck! ' }

Of course that’s not terribly useful. But let’s return to that join operation from before. Using a looping function, I can start using columns to join on. To be consistent, I’ll join on the first columns of the titleauthor.tsv and author.tsv files again. This time, I don’t need to change the file at all – awk can handle the columns separated using the tab characters.

In this example, I’ll use two interesting constructs from awk for the join. NR is a variable that means the number of the current line being processed. NFR is a variable that means the current record in the file being processed. That means I can set an “outside” condition, and a comparison operator. The $1 and $1 and $0 are variables within the file corresponding to the column positions in the file, and the datum being compared.

So I’ll use all that info to:

Use titleauthor.tsv as the “master” comparison
Use authors.tsv as the data being compared
Show all info from the authors.tsv file
Show only the title number (2nd field) from the titleauthors.tsv file

Here’s the command:

1	awk 'BEGIN {} NR==FNR{arr[$1]=$2;next}$1 in arr{print $0, arr[$1]} END {}' titleauthor.tsv authors.tsv

Can you work out from this example what is happening line by line? Can you think of a way to join more than two tables using awk? (Hint: it’s messy.) Do you have a more efficient way of doing this?

If you want to learn more about awk, start here: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html.

In the next article I’ll turn on some instrumentation for the workstation – not to measure performance directly, but to find which program, construct, process or set of processes takes the least hit on system. It will also show how to evaluate which system components are being exercised the most, such as CPU, memory, I/O or Network. I’ll also introduce a few interactive tools that I use to work with data.

Appendix: Creating the pubs Text Files (or just download them)

I wanted to be able to compare some data operations across all systems I’ll use – from text, to an RDBMS, NoSQL, Hadoop and other platforms. There’s a bit of a danger there, in that each of these tools have advantages and disadvantages for working with various types of data, but this is a learning exercise.

If you do have access to SQL Server, you can read more about the pubs database here, and in that article you’ll find out where to obtain it.

With the pubs database installed on your database server Instance, drop to a command-line on that server and run the following commands – replace INSTANCE with the name of your server/instance. It assumes you have a c:\temp directory as well, and that you’re using a Windows account that has access to the Instance and the database:

BCP pubs..authors out c:\temp\authors.tsv -c -T -SINSTANCE
BCP pubs..discounts out c:\temp\ discounts.tsv -c -T -SINSTANCE
BCP pubs..employee out c:\temp\employee.tsv -c -T -SINSTANCE
BCP pubs..jobs out c:\temp\jobs.tsv -c -T -SINSTANCE
BCP pubs..pub_info out c:\temp\pub_info.tsv -c -T -SINSTANCE
BCP pubs..publishers out c:\temp\publishers.tsv -c -T -SINSTANCE
BCP pubs..roysched out c:\temp\roysched.tsv -c -T -SINSTANCE
BCP pubs..sales out c:\temp\sales.tsv -c -T -SINSTANCE
BCP pubs..store out c:\temp\store.tsv -c -T -SINSTANCE
BCP pubs..titleauthor out c:\temp\titleauthor.tsv -c -T -SINSTANCE
BCP pubs..sales out c:\temp\sales.tsv -c -T -SINSTANCE
BCP pubs..titles out c:\temp\titles.tsv -c -T -SINSTANCE

This brings the data out without headers – a common way to receive data. But I do want to know what is in each of these files, so I opened SQL Server Management Studio (SSMS) and ran the following query. When I was done, I saved the results as schema.txt:

USE pubs;

SELECT

a.TABLE_NAME + '.tsv' as 'filename'

, a.COLUMN_NAME as 'columnordinal'

, a.DATA_TYPE as 'DataType'

, a.CHARACTER_MAXIMUM_LENGTH as 'Length'

, a.NUMERIC_SCALE as 'numberscale'

, a.NUMERIC_PRECISION as 'numberprecision'

FROM INFORMATION_SCHEMA.COLUMNS a

ORDER BY a.TABLE_NAME, a.COLLATION_NAME

Save these files in your data directory on your lab system and you’ll be ready to work with the examples I explain in this series.

Register for Simple Talk

Data Science Laboratory System – Testing the Text Tools and Sample Data

Article tags

About the author

Buck Woody

Buck's contributions

Articles

Books

Top topics

Buck's latest contributions:

Which New Technology Should I Chase?

Data Science Laboratory System – Distributed File Databases

Data Science Laboratory System – Graph Databases

Text Tools Examples

grep

sed

join

awk

Appendix: Creating the pubs Text Files (or just download them)

Article tags

Recommended

About the author

Buck Woody

Buck's contributions

Articles

Books

Top topics

Buck's latest contributions:

Which New Technology Should I Chase?

Data Science Laboratory System – Distributed File Databases

Data Science Laboratory System – Graph Databases