Optimal Cluster sizes for SQL Server on SSDs
– are 64K still important?
Lately, with the advance of SSDs, more and more administrators are wondering: Is the old rule to use a 64K Cluster size (aka “File Allocation Unit size”) for SQL Server still applicable on today’s hard drives, especially SSDs or even Flash? And: Does it apply to all RAID-types equally? This summer I took some time to conduct some IO benchmarks on SSDs under different RAID configurations (RAID 0, RAID 10, RAID 50), of which I am now finally finding the time to share the results.
First things first: Whenever you get recommendations from the internet, it is very unlikely that your system is configured identically. The drives I tested were HGST Ultrastar SSD800MH.B Enterprise MLC 12Gb/s SAS Solid-State Drives (HGST is subsidiary of Western Digital).
At this point a big Thank You to DB Netz AG that graciously allowed me to use their brand-new hardware for those tests.)
Ideally you are able to do your own benchmarks. But if you don’t find time, this will hopefully help many.
I was lucky to have 46 of those SSDs, shared equally (2×23) among the 2 HBAs in the server (4x Intel E7-8891 v3 à 10 cores, 20 threads, 2TB RAM) and thus was able to prepare 6 volumes with the following RAID-configuration all at once, before the actual IO tests. The IO tests of course were performed sequentially. The bullet points are using the drive letters from the test Server:
(N) RAID 10 consisting of 4 disks
mirrored via Storage Pools and then striped via Volume Management.
(O) RAID 10 consisting of 6 disks
(M) RAID 10 consisting of 10 disks
(T) RAID 0 (Stripe Set) consisting of 4 disks
striped via Storage Pools
(S) RAID 0 (Stripe Set) consisting of 4 disks
striped via Storage Pools and striped again via Volume Management.
(R) RAID 50 consisting of 6 disks (2×3)
All of the volumes were formatted and tested with a Cluster size of 4 and 64 K after another. Also, some tests on 8 K Cluster size were made.
This is how the final volumes looked like in the Server:
(The final setup of course would be just two or 3, larger, volumes, not that many.)
The IO Tests
Diskspd was used for the load-testing, and the following 6 IO-Test-configurations were used for all tests:
|Name||Path||Block size||Operation||Read / Write||Outstanding IO||Threads|
|Read Random 8K||M:\||8K||Random||100% Read||8||8|
|Read Random 64K||M:\||64K||Random||100% Read||8||8|
|Write Random 8K||M:\||8K||Random||100% Write||8||8|
|Read sequential 64K||M:\||64K||Sequential||100% Read||8||8|
|Write sequential 8K||M:\||8K||Sequential||100% Write||8||8|
|Read sequential 256K||M:\||256K||Sequential||100% Read||8||8|
Not all drive configurations were tested thoroughly with 8 K Cluster size, as I realized that the differences to 64 K were often neglectable.
The tests yielded some unexpected results, and on the other hand some less clear differences than expected.
My tests show the following:
- 64 K Cluster Size
a. performs better for Almost all Raid-Configurations + Workloads
- 8K Random Reads remarkably bad (8 and even 4 K Cluster size performs better here) – this does not apply to 8 K Random Writes though, where 8 and 64 K Cluster both perform (equally) better
- Read Random 64K
- Read sequential 64K
- 8K Sequential writes no differences -> for OLTP Pattern 8 K might be better (4K not generally)
- Raid10 with 4 Disks: no relevant difference
- 8 K Cluster Size performance is almost the same as for 64 K
- 4 K Cluster Size performs better for Random 8K Reads in RAID 50 only
- Raid 10 of 10 disks is substantially faster for 64K Random Reads than Sequential Reads! – Boost after 6 disks. For Sequential reads and random writes, no significant differences were measured.
- The Stripe Sets do NOT outperform the other RAID configurations. They are only substantially faster for Writes (Write Random 8K) than Raid10. For Reads they are sometimes even a bit slower than Raid10 or Raid 50.
a. This debunks the myth that a Stripe Set outperforms a real redundant Raid by nature!
- Raid50 6 disks, net 2.89TB, is much faster than a Raid10 of 6 disks, net 2.16 TB.
- A Stripe set (4 disks) constructed solely via Storage Pool is faster than a Stripe Set constructed via StoragePool and Volume Management.
a. Except for 4 K results!
- 64 K Cluster Size indeed still make sense as a standard. Although it depends a lot on the actual drive configuration the exceptions will apply to a minority of servers-
- It is worthwhile to think out of the RAID 5 and Raid 10 “standards”.
- It is not generally worth to go for Stripe Sets because of Speed and risk the much higher probability of failures.
- A cluster size of 4 K in my eyes is not worth considering according to my measurements and the slightly higher risk of integrity problems – unless your storage vendor proves you so!
- If you need to tune a specific workload until the last IO or have an extraordinarily fine Piece of hardware, you should measure accurately and not go by general recommendations (like in this very Blog-post 😉 ) and determine the best configuration for your very Scenario
You can download the complete results to draw your own conclusions here: 1711_Disk-Benchmark-Results.zip
I hope you find this useful in your next discussions and implementation plans.