Max Kalashnikov: Storage on the cheap - lessons learned (originally posted to StorageMonkeys July 11, 2009)

Having purchased, assembled, configured, and turned up quite a number of storage arrays, where a major concern was total cost, I've come up with something of a checklist of best practices.

Use cheap, commodity, desktop SATA drives. They're as good, if not better than, "enterprise" models. They're certainly cheaper per performance.

If advanced administration, failover, or clustering features, such as from Veritas, are needed, use SAS HBAs.

Otherwise, use SAS RAID cards. They tend to support more attached devices and may even be cheaper.

Make sure to buy disks from multiple batches for use within a RAID. That is, have a mix of drive models and sub-models, manufacturers, and even end vendors.

Bad batch syndrome is, potentially the most catastrophic. Corollary: Don't buy models so new that there's only one.

Buy only drives which support NCQ. The price premium, if any, is neglible.
Even if there's no performance gain for a particular use case, there's no downside to having it turned on everywhere.
To that end, turn NCQ on for all new adapters connecting new disks.

If coming into a legacy environment, turn off NCQ unless absolutely certain that all existing disks support it.
Problems/corruption can be insidiously subtle.

Before use, write (zeros are fastest) to the entire device. This will trigger any bad blocks to be reallocated.
After that, run a SMART scan on the whole device and check for clean results. This will catch any (very rare) infant mortality.
It also indelibly "stamps" the drive as having been tested.

Install smartmontools on all servers. It's small and otherwise takes no resources.
Running the smartd daemon is another matter. That's a monitoring concern.

Turn on all the supported idle/background SMART tests supported by each device.

Discard (permanently stop using) a disk at the first sign of trouble.
A SMART error or even warning is trouble.
A write error is trouble.
A read error (assuming the disk has been zeroed) is trouble.
A timeout, unless positively isolated to the disk itself, is not trouble.

For external connectors, use only the screw-on type. For SAS, that's SFF-8470.
This does mean spending more money.
Often, one must use internal connections (e.g. SFF-8087) with an adapter.
The latching connectors are all too easily disconnected (sometimes only partially, which can be worse than fully) and/or too fragile.

Locate equipment such that storage cables can be short but have enough slack.
Always provide good strain relief on all ext cables. This means cable ties at strategic points.
Test for adequate slack and clearance by sliding all connected and neighboring equipment.

Add between 3% and 6% (of active disks) hot spares. That should last 2-3 years without human intervention.
By then, replace all the disk, not just the failed ones, as your failure rate will, otherwise, accelerate heavily.
Time your transition to take advantage of technology and/or price improvements but assume closer to 2 years than 3.

RAID1(+0) is far more flexible and simpler than RAID5. It performs much better in degraded and recovery modes.
A good implementation can nearly double read performance, especially on contentious operations.
It costs only 60% more than a 4 column (+1 parity) RAID5 or an 8 column RAID6.

Don't oversubscribe the system bus.
PCI-X 64bit@133MHz is only 1067MB/s half-duplex. (i.e. could be adequate for highly asymmetric read/write)
PCIe x4 is 1000MB/s full-duplex.
SAS 4-lane is 1200MB/s full-duplex.

Once everything is assembled, measure these maximum throughputs. Do so at each layer, including the HBA/RAID card and each spindle.

At each layer with a dirty region log (DRL) and/or journaling option, opt to use it.
If practical, "waste" a whole spindle on it. Otherwise, locate it somewhere highly contentious or low-demand, such as the boot disk.

Similarly, try simulating a failure at each layer and measure the recovery time. That will be the minimum under no load.

If the block size an application or database uses can be tuned, raise it to the highest possible.
Conversely, use the smallest supported stripe unit width size.
Set number of columns such that full stripe width is an even multiple (or, better yet, factor) of block size.
For RAID5, this usually means 4 (plus parity), 8, or (rarely) 16.
4 columns plus parity is particularly well suited to PCIe-to-SAS hardware RAID5, since there's a 4:5 PCIe:SAS bandwidth ratio.

For redundant components (e.g. cables, expanders, power supplies), test hot-swappability.
Do so at different "duty" (simluated outage) cycles and flap rates.
Test flip-flopping between the two components.

If you can ever check all these off, I'll be impressed. Still, I hope it helps other cheapskates out there avoid a few pitfalls.

Max Kalashnikov

21 January 2011

Storage on the cheap - lessons learned (originally posted to StorageMonkeys July 11, 2009)

No comments:

Post a Comment