We’re currently building a new storage server to store low-priority data (tertiary backups, etc). One of the requirements for the project is that it needs to be on cheap storage (as opposed to expensive enterprise SAN/NAS). After some research we decided to build a Backblaze pod. Backblaze used 3TB Hitachi drives in their system, but the ones they listed in their blog post are discontinued and the reviews for all other 3TB+ drives were terrible, so we went with Samsung ST2000DL004 2TB 7200 RPM drives. Like Backblaze, we’re going with software raid, but I figured a good first step would be to figure out what RAID level we want to use, and if we want to use the mdadm/LVM mish-mosh Backblaze uses, or find something simpler. For my testing I created a RAID6 of all 45 drives and created a single XFS volume (XFS’s size limit is ~8 exabytes vs ext4’s 16TB). Ext4 may present some performance advantages, but the management overhead is probably not worth it in our case.
So, this is just a simple benchmarking comparing RAID 0 (stripe with no parity) as a baseline, RAID5 (stripe with 1 parity disk) and RAID6 (stripe with 2 parity disks) across 45 total spindles. For all tests I used Linux software RAID (mdadm).
To test, I have 3 scripts, makeraid0.sh, makeraid5.sh, and makeraid6.sh. Each one does what its name implies. The raid0 has 43 disks, raid5 has 44 disks, and raid6 has 45 disks, so there are 43 “data” disks in each test. The system is a Protocase “Backblaze-inspired” system with a Core i3 540 CPU, 8 GB memory, CentOS 6.3 x64, and 45x We’re just using this box for backup and it gives us about 79 TB usable, which is still plenty, so 2TB isn’t a big problem.
makeraid?.sh for filesystem creation:
#!/bin/bash
mdadm --create /dev/md0 --level=raid6 -c 256K --raid-devices=45
/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
/dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
/dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo
/dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt
/dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy
/dev/sdz /dev/sdaa /dev/sdab /dev/sdac /dev/sdad
/dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai
/dev/sdaj /dev/sdak /dev/sdal /dev/sdam /dev/sdan
/dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas
Filesystem:
[root@Protocase ~]# mkfs.xfs -f /dev/md0
meta-data=/dev/md0 isize=256 agcount=79, agsize=268435392 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=21000267072, imaxpct=1
= sunit=64 swidth=2752 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@Protocase ~]# mount /dev/md0 /raid0/
[root@Protocase ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdat2 289G 3.2G 271G 2% /
tmpfs 3.9G 260K 3.9G 1% /dev/shm
/dev/sdat1 485M 62M 398M 14% /boot
/dev/md0 79T 35M 79T 1% /raid0
RAID0
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 25.1944 s, 416 MB/s
[root@Protocase ~]# rm -f /raid0/zeros.dat
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 25.1922 s, 416 MB/s
[root@Protocase ~]# rm -f /raid0/zeros.dat
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 24.7665 s, 423 MB/s
RAID5
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 25.2239 s, 416 MB/s
[root@Protocase ~]# rm -f /raid0/zeros.dat
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 24.7427 s, 424 MB/s
[root@Protocase ~]# rm -f /raid0/zeros.dat
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 24.2434 s, 433 MB/s
RAID6:
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 26.9032 s, 390 MB/s
[root@Protocase ~]# rm -f /raid0/zeros.dat
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 26.5255 s, 395 MB/s
[root@Protocase ~]# rm -f /raid0/zeros.dat
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 26.4338 s, 397 MB/s
I found it pretty strange that RAID5 seemed to outperform RAID0, but I tested it several times and RAID5 averaged 10-15 MB/s faster than RAID0. Maybe a bug in the kernel? I tried other block sizes ranging from 60KB to 4MB for dd but the results were pretty consistent. In the end it looks like I’m going to go with RAID6 of 43 drives + 2 hotspares, which still yields ~400 MB/s throughput and 75 TB usable:
#!/bin/bash
mdadm --create /dev/md0 --level=raid6 -c 256K -n 43 -x 2
/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
/dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
/dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo
/dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt
/dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy
/dev/sdz /dev/sdaa /dev/sdab /dev/sdac /dev/sdad
/dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai
/dev/sdaj /dev/sdak /dev/sdal /dev/sdam /dev/sdan
/dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas
Update: A coworker suggested looking into write-intent bitmap to improve rebuild speeds. After adding a 256 MB-chunked bitmap, the write performance didn’t degrade much, so this looks like a good addition to the configuration:
[root@Protocase ~]# mdadm -G --bitmap-chunk=256M --bitmap=internal /dev/md0
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 25.8157 s, 406 MB/s
[root@Protocase ~]# rm -fv /raid0/zeros.dat
removed `/raid0/zeros.dat'
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 26.4233 s, 397 MB/s
[root@Protocase ~]# rm -fv /raid0/zeros.dat
removed `/raid0/zeros.dat'
[root@Protocase ~]# dd if=/dev/zero of=/raid0/zeros.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 26.2593 s, 399 MB/s
[root@Protocase ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdat2 289G 3.2G 271G 2% /
tmpfs 3.9G 88K 3.9G 1% /dev/shm
/dev/sdat1 485M 62M 398M 14% /boot
/dev/md0 75T 9.8G 75T 1% /raid0
[root@Protocase ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdas[44](S) sdar[43](S) sdaq[42] sdap[41] sdao[40] sdan[39] sdam[38] sdal[37] sdak[36] sdaj[35] sdai[34] sdah[33] sdag[32] sdaf[31] sdae[30] sdad[29] sdac[28] sdab[27] sdaa[26] sdz[25] sdy[24] sdx[23] sdw[22] sdv[21] sdu[20] sdt[19] sds[18] sdr[17] sdq[16] sdp[15] sdo[14] sdn[13] sdm[12] sdl[11] sdk[10] sdj[9] sdi[8] sdh[7] sdg[6] sdf[5] sde[4] sdd[3] sdc[2] sdb[1] sda[0]
80094041856 blocks super 1.2 level 6, 256k chunk, algorithm 2 [43/43] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
bitmap: 2/4 pages [8KB], 262144KB chunk
unused devices: