Before you report your conclusions about your data, have you checked whether your 'actionable' figures occurred by chance? The Kruskal-Wallis test is a safe way of determining whether samples come from the same population, because it is simple and doesn't rely on a normal distribution in the population. This allows you a measure of confidence that your results are 'significant'. Phil Factor explains how to do it.… Read more
Distributed File Databases manage large amounts of unstructured or semi-structured data. They are designed on the principle of splitting up the data into multiple locations, and then placing the code that processes each fragment close, or directly on, that location. Buck Woody shows how to install Hadoop in your Data Science lab to experiment with an example of the breed.… Read more
Object-Oriented Databases (OOD) avoid the object-relational impedence mismatch altogether by tightly integrating into the user-level OOP code to the extent that they are simply an engine that ships with the code itself. The developer is able to instantiate OOD objects directly into the code. Buck Woody explores the Object-Oriented breed of database in his Data Science lab.… Read more
A Document Store Database (DSD) is similar to a Relational Database Management system with the exceptions that a DSD allows for unstructured data and sharding a single database across multiple machines. So when or why would you choose a document database over a relational one? Buck Woody has the answer and an example using the DSD MongoDB on his lab system.… Read more
Though the Key/Value pair paradigm is common to almost every computer language, there is no clear agreement yet for the definition of a Key/Value Pair database. However, Key/Value pair databases are valuable for special applications where speed of writing data is more important than searching and general versatility. It is certainly worth experimenting with in a data science lab.… Read more
There is no better way of understanding new data processing, retrieval, analysis or visualising techniques than actually trying things out in a lab system. Buck Woody continues his series by explaining why an RDBMS is essential for a lab, what that is, and how to install SQL Server into the lab. … Read more
Although every computer language is suitable for data, some languages lend themselves especially well for working with certain types or sources of data, or processing the data in certain ways, and so are of particular use to the data scientist. … Read more
Data tools interact directly with data and are great for automating data data-aquisition, but they aren't always the best way to prototype or pilot a process. Interactive data tools also allow you to test and refine the process, until it is ripe for automation. … Read more
Anyone who is frequently faced with preparing data for processing needs to be familiar with some industry-standard text-manipulation tools. Awk, join, sed, find, grep and cat are the classics, and Buck Woody takes them for a spin in his Data Science Laboratory… Read more
Hadoop and MapReduce have good prospects for adoption as a standard for big data analysis, especially since its adoption by Microsoft. It is ideal for Cloud usage since one can spin up nodes when required, pay only for storage and compute services whilst they are running. Roger Jennings descibes how to get it running on Azur… Read more
If you are seeking to analyse very large sets of data, and need a highly parallel rapid way of doing it that scales to your requirements, then 'Cloud Numerics' from Microsoft may be the answer to your prayers… Read more
Subscribe for more articles
Fortnightly newsletters help sharpen your skills and keep you ahead, with articles, ebooks and opinion to keep you informed.
Subscribe to our fortnightly newsletter
How you log in to Simple Talk has changed
We now use Redgate ID (RGID). If you already have an RGID, we’ll try to match it to your account. If not, we’ll create one for you and connect it.