Optimal Cluster sizes for SQL Server on SSDs

– are 64K still important?

Lately, with the advance of SSDs, more and more administrators are wondering: Is the old rule to use a 64K Cluster size (aka “File Allocation Unit size”) for SQL Server still applicable on today’s hard drives, especially SSDs or even Flash? And: Does it apply to all RAID-types equally? This summer I took some time to conduct some IO benchmarks on SSDs under different RAID configurations (RAID 0, RAID 10, RAID 50), of which I am now finally finding the time to share the results.

First things first: Whenever you get recommendations from the internet, it is very unlikely that your system is configured identically. The drives I tested were HGST Ultrastar SSD800MH.B Enterprise MLC 12Gb/s SAS Solid-State Drives (HGST is subsidiary of Western Digital).

At this point a big Thank You to DB Netz AG that graciously allowed me to use their brand-new hardware for those tests.)

Ideally you are able to do your own benchmarks. But if you don’t find time, this will hopefully help many.

The Setup:

I was lucky to have 46 of those SSDs, shared equally (2×23) among the 2 HBAs in the server (4x Intel E7-8891 v3 à 10 cores, 20 threads, 2TB RAM) and thus was able to prepare 6 volumes with the following RAID-configuration all at once, before the actual IO tests. The IO tests of course were performed sequentially. The bullet points are using the drive letters from the test Server:

(N) RAID 10 consisting of 4 disks

mirrored via Storage Pools and then striped via Volume Management.

(O) RAID 10 consisting of 6 disks

(M) RAID 10 consisting of 10 disks

(T) RAID 0 (Stripe Set) consisting of 4 disks

striped via Storage Pools

(S) RAID 0 (Stripe Set) consisting of 4 disks

striped via Storage Pools and striped again via Volume Management.

(R) RAID 50 consisting of 6 disks (2×3)

All of the volumes were formatted and tested with a Cluster size of 4 and 64 K after another. Also, some tests on 8 K Cluster size were made.

This is how the final volumes looked like in the Server:

(The final setup of course would be just two or 3, larger, volumes, not that many.)

The IO Tests

Diskspd was used for the load-testing, and the following 6 IO-Test-configurations were used for all tests:

Name Path Block size Operation Read / Write Outstanding IO Threads
Read Random 8K M:\ 8K Random 100% Read 8 8
Read Random 64K M:\ 64K Random 100% Read 8 8
Write Random 8K M:\ 8K Random 100% Write 8 8
Read sequential 64K M:\ 64K Sequential 100% Read 8 8
Write sequential 8K M:\ 8K Sequential 100% Write 8 8
Read sequential 256K M:\ 256K Sequential 100% Read 8 8

Not all drive configurations were tested thoroughly with 8 K Cluster size, as I realized that the differences to 64 K were often neglectable.

The Results

The tests yielded some unexpected results, and on the other hand some less clear differences than expected.
My tests show the following:

  1. 64 K Cluster Size
    a.
    performs better for Almost all Raid-Configurations + Workloads
    Exceptions:

    1.  Raid50
      1. 8K Random Reads remarkably bad (8 and even 4 K Cluster size performs better here) – this does not apply to 8 K Random Writes though, where 8 and 64 K Cluster both perform (equally) better
      2. Read Random 64K
      3. Read sequential 64K
      4. 8K Sequential writes no differences -> for OLTP Pattern 8 K might be better (4K not generally)
    2. Raid10 with 4 Disks: no relevant difference
  1. 8 K Cluster Size performance is almost the same as for 64 K
  2. 4 K Cluster Size performs better for Random 8K Reads in RAID 50 only
  3. Raid 10 of 10 disks is substantially faster for 64K Random Reads than Sequential Reads! – Boost after 6 disks. For Sequential reads and random writes, no significant differences were measured.
  4. The Stripe Sets do NOT outperform the other RAID configurations. They are only substantially faster for Writes (Write Random 8K) than Raid10. For Reads they are sometimes even a bit slower than Raid10 or Raid 50.
    a. This debunks the myth that a Stripe Set outperforms a real redundant Raid by nature!
  5. Raid50 6 disks, net  2.89TB, is much faster than a Raid10 of 6 disks, net 2.16 TB.
  6. A Stripe set (4 disks) constructed solely via Storage Pool is faster than a Stripe Set constructed via StoragePool and Volume Management.
    a. Except for 4 K results!

Conclusions summary:

  • 64 K Cluster Size indeed still make sense as a standard. Although it depends a lot on the actual drive configuration the exceptions will apply to a minority of servers-
  • It is worthwhile to think out of the RAID 5 and Raid 10 “standards”.
  • It is not generally worth to go for Stripe Sets because of Speed and risk the much higher probability of failures.
  • A cluster size of 4 K in my eyes is not worth considering according to my measurements and the slightly higher risk of integrity problems – unless your storage vendor proves you so!
  • If you need to tune a specific workload until the last IO or have an extraordinarily fine Piece of hardware, you should measure accurately and not go by general recommendations (like in this very Blog-post 😉 ) and determine the best configuration for your very Scenario

You can download the complete results to draw your own conclusions here: 1711_Disk-Benchmark-Results.zip

I hope you find this useful in your next discussions and implementation plans.

Andreas

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *