PASS Data Community Summit

In person. In Seattle

14-17 November

Storage 101: RAID

RAID has been around since the 90s to ensure performance and reliability of storage. Robert Sheldon explains the history and theory behind RAID.

The series so far:

  1. Storage 101: Welcome to the Wonderful World of Storage
  2. Storage 101: The Language of Storage
  3. Storage 101: Understanding the Hard-Disk Drive 
  4. Storage 101: Understanding the NAND Flash Solid State Drive
  5. Storage 101: Data Center Storage Configurations
  6. Storage 101: Modern Storage Technologies
  7. Storage 101: Convergence and Composability 
  8. Storage 101: Cloud Storage
  9. Storage 101: Data Security and Privacy 
  10. Storage 101: The Future of Storage
  11. Storage 101: Monitoring storage metrics
  12. Storage 101: RAID

Organizations have been turning to RAID (redundant array of independent disks) since the 1990s to support their application storage. A RAID device comprises multiple disk drives that provide a unified storage solution capable of delivering greater performance and fault tolerance than an individual disk. The use of multiple drives makes it possible to employ techniques such as striping, mirroring, and parity, while delivering an integrated platform that an operating system (OS) sees as a single logical drive.

The idea of distributing data across multiple disks has been around for many decades, but it wasn’t until 1988 that the concept of RAID was formalized by David A. Patterson, Garth Gibson, and Randy H. Katz in their seminal report, A Case for Redundant Arrays of Inexpensive Disks (RAID). The report introduced five levels or RAID and described their relative cost and performance.

Soon after publication, RAID quickly solidified as a valuable option for efficiently storing data. However, it also became apparent that the arrays were not as inexpensive as the original title suggested, so the founders soon adopted the name redundant array of independent disks. From these beginnings, RAID’s popularity quickly grew. Between 1990 and 2002, vendors sold over $150 billion in RAID storage devices, according to Katz.

RAID’s popularity continues to this day. Although it doesn’t replace a comprehensive backup strategy, most levels offer some level of redundancy for providing fault tolerance. In addition, RAID can also help improve performance, depending on the configuration and supported workloads. Plus, RAID makes it possible to deliver greater capacities because it uses multiple drives. Together, these advantages continue to afford RAID a prominent place in data centers and other environments where data is stored.

Digging into RAID

RAID works by spreading data across multiple disks and presenting those disks as a single logical drive. The way in which data is distributed depends on the RAID configuration, which is indicated by the RAID level, such as RAID 1 or RAID 5. Each level uses one or more of the following technologies to provide fault tolerance or improve performance:

  • Striping. Logically sequential data such as a file is segmented into multiple blocks of a specific size and distributed across the disks in the array. The data can also be split at the bit or byte level, rather than block. Because the data is distributed, it can be read from and written to multiple disks simultaneously, significantly improving read and write performance, depending on the workloads.
  • Mirroring. Data is replicated to two or more disks during write operations, providing redundancy and ensuring availability in the event of disk failure or data corruption on one of the disks. Mirroring can often improve read performance, but it generally has minimal impact on write performance.
  • Parity. The storage controller performs exclusive OR (XOR) comparisons on the striped data across an array’s drives and stores the results of those calculations either on the same drives or on a separate drive. The parity data can then be used to reconstruct the primary data if one of the drives fails. Although parity requires extra disk space, it’s typically less than what’s needed for mirroring, and it still offers fault tolerance. However, write performance is compromised owing to the requirements of the parity calculation—for each application write, four physical I/Os are required, two reads and two writes. This is commonly referred to as the “RAID 5 write penalty”.

For the most part, mirroring and parity are mutually exclusive, which means a RAID level will use only one or two of the distribution technologies, but not three. However, data distribution is only part of the story. A RAID device can also be implemented as either a hardware-based or software-based solution:

  • Hardware-based RAID. A dedicated hardware controller manages and processes all storage operations. The controller might by a separate RAID card or built into the motherboard. Hardware-based RAID is more expensive to implement than software-based, but it performs better, is compatible with various operating systems, and it supports more RAID levels.
  • Software-based RAID. The host OS manages and processes the storage operations, making it cheaper and easier to set up than hardware-based RAID. However, it might not perform as well, especially if it’s competing with other server operations. In addition, an OS might support only specific RAID levels. Software-based RAID is generally not suited to complex RAID configurations.

Although hardware-based and software-based RAID are the primary ways in which RAID is implemented, there are also other approaches. For example, you might find references to firmware-based RAID, driver-based RAID, hybrid RAID, or other forms. The pros and cons of these other forms fall somewhere between software-based and hardware-based RAID.

RAID Levels

One of the primary ways in which RAID devices are distinguished from one another is by their configuration levels, which determine how data is distributed across the drives. Each level represents a storage configuration that employs various combinations of striping, mirroring, and parity to improve performance, fault tolerance, or both.

The original RAID taxonomy introduced a numbering scheme for labeling each level, and that scheme has continued to this day. Each level is characterized by tradeoffs between usable capacity, availability, and performance. The taxonomy included the following six levels:

  • RAID 0: This level uses striping to split data into blocks and distribute those blocks across two or more disks. RAID 0 can improve performance because data can be written to or read from all the disks simultaneously, but it provides no redundancy and therefore no fault tolerance. If one disk fails, the entire stripe is unreadable, making RAID 0 ill-suited for most business applications. However, RAID 0 maximizes capacity usage, so it can be a cost-effective option for non-critical or non-persistent workloads. (RAID 0 was not mentioned in the 1988 report.)
  • RAID 1: This level uses mirroring to duplicate data to two or more drives. In case of disk failure, data can be read from one of the other disks, providing reliable fault tolerance. RAID 1 improves read performance because data can be read from multiple disks simultaneously, and for most workloads write performance does not suffer. Organizations implement RAID 1 primarily for its fault tolerance and availability, making it well-suited for mission-critical applications. RAID 1 also lowers usable capacity, resulting in a higher cost per GB.
  • RAID 5: This level uses block-level striping similar to RAID 0 but also adds parity. However, the parity data is distributed across the array’s drives, rather than using a dedicated drive like RAID 3 or RAID 4. Implementing RAID 5 requires at least three disks. In addition, the parity reduces usable drive space and impacts write performance because of the added complexity of writing data. However, RAID 5 can tolerate a single drive failure without losing data
  • RAID 2, RAID 3, and RAID 4: Although these RAID levels are described by Patterson et al. (in the paper mentioned above), the levels are obsolete and are included here merely for reference. Their primary distinctions are how striping and parity are implemented. RAID 2 leverages bit-level striping, RAID 3 uses byte-level striping, and RAID 4 implements block-level striping. RAID 3 and RAID 4 also require a dedicated disk for parity. All three of these RAID levels use Hamming code parity, a linear form of error-correcting code that protects against data loss.

RAID 0, RAID 1, and RAID 5 are commonly used today, and other RAID configurations have been added since the early days. For example, RAID 6 extends RAID 5 by including another layer of parity. As a result, a RAID 6 array requires at least four disks, but it can handle two simultaneous disk failures without losing data.

Another approach to RAID is to nest configurations to address limitations in any one level. By far the most common nested implementations is RAID 10, which is also written as RAID 1+0 because it combines RAID 0 and RAID 1. In this configuration, the array uses striping and mirroring to deliver both performance and fault tolerance. RAID 10 requires a minimum of four drives and comes at a greater cost per GB as a result of the higher redundancy.

To help to make sense of these configurations, the following table lists several common RAID levels in use today and the some of the main differences between them.

Feature

RAID 0

RAID 1

RAID 5

RAID 6

RAID 10

Data distribution technology

Striping

Mirroring

Striping and parity

Striping and double parity

Striping and mirroring

Minimum disks

2

2

3

4

4

Disk utilization based on minimum number of disks

100%

50%

Varies, but typically more than 50%

Varies, but typically more than 50%

50%

Data protection

No fault tolerance

Provides fault tolerance

Provides fault tolerance

Provides fault tolerance

Provides fault tolerance

Performance

Good read and write

Good read, average write

Good read, below average write

Good read, below average write

Good read and write

The table is meant only to provide a high-level overview of how RAID configurations compare, particularly when it comes to performance. Differences in storage products and controllers, types of workloads, network capabilities, and other variables can all impact how well a RAID storage device performs.

Typical SQL Server use cases for RAID include:

  • RAID 0: Stream analytics or other transient, non-persistent data
  • RAID 1: OS, SQL log
  • RAID 5: SQL data for which high write performance is not required
  • RAID 10: SQL log for which high availability and high capacity or high availability and high performance are required or SQL data for which high write performance is required.

One of the most common implementation blunders is choosing RAID 5 rather than RAID 10. There is a time-worn project management maxim: Good, fast, cheap—pick two. In the context of RAID levels, this is accurately refashioned to: Available, fast, cheap—pick two. Both RAID 5 and RAID 10 provide availability. However, IT teams commonly choose the higher usable capacity of RAID 5, which requires fewer disks than a RAID 10 implementation for the same user capacity and is therefore less expensive in terms of raw hardware costs. Yet doing so comes at the expense of performance. Owing to the “RAID 5 write penalty” described earlier, high write workloads on RAID 5 will suffer significantly relative to RAID 10.

In addition, storage solutions might also implement other configurations. For example, RAID 01 (RAID 0+1) is similar to RAID 10 except that the data is first mirrored and then striped, rather than the other way around. In addition, some RAID configurations require proprietary hardware, such as RAID 7, a RAID level whose name was trademarked by the Storage Computer Corporation. RAID 7 is based on RAID 3 and RAID 4 but adds caching capabilities.

RAID’s Uncertain Future

In recent years, industry pundits have been discussing the extent to which RAID will have a place in the next generation data center, but despite uncertainty, its presence remains ubiquitous. Yet as storage evolves, and requirements grow more demanding. These factors put into question how long RAID will remain relevant.

For example, RAID was not designed to handle many of today’s larger hard disk drives (HDDs). A disk failure can result in lengthy data rebuild times, significantly impacting performance and availability. And the longer it takes to rebuild a drive, the greater the chances another one will fail. At the same time, many organizations are turning to object storage, which can offer more efficient mechanisms for ensuring redundancy.

Some also cite the greater reliance on solid state drives (SSDs) as another reason for RAID’s demise. Today’s SSDs are more reliable than HDDs and offer performance orders of magnitude greater than HDDs (even when compared RAID 0). For this reason, some argue that RAID is no longer necessary with all-flash storage. However, most experts agree that mission critical workloads demand storage tier redundancy.

Most experts also believe that SSD-based RAID is a viable alternative to traditional systems, especially for the additional fault tolerance. In fact, several vendors now offer all-flash storage solutions that support standard RAID or proprietary forms of RAID. For example, the Hewlett Packard Enterprise (HPE) 3PAR storage systems supports several RAID configurations, including RAID 10, RAID 50 (RAID 5+0) and RAID MP (multiple parity and striping).

The future of RAID remains uncertain, but the advent of SSD-based RAID and proprietary RAID implementations suggest that RAID, like most storage technologies, will continue to evolve to meet the needs of modern workloads. Whether the end result will look anything like the RAID of 1990 is yet to be seen. But no matter what we end up with, there will always be a demand for storage that can deliver the performance and fault tolerance necessary to support whatever workloads are thrown our way, which is what RAID is all about.