Performance of different blocksizes on a Samsung Pro SSD using zfs

I got a Samsung 990 Pro NVME SSD, which by default reports a blocksize of 512B. The purpose of this device is to host virtual machine workloads on the zfs filesystem. Because of performance and replication considerations I was wondering, if switching from 512B to 4k blocksize (as used in most modern HDD) would result in a performance impact, so I ran some simulations.

TLDR: Stick with the device provided blocksize, if performance is an objective. The impact is likely small, and if replication is another objective you need to stay consistent between the pools.

You find all results also in my storage-fio repository.

Results

I start with the results for the impatient reader. See the simulation and the setup description below.

I ran three kind of simulations: A baseline, a simulation of mostly reading VMs and a simulation of mostly writing VMs. Those simulations suggest that sticking with the drive-provided 512B block size has a performance advantage over a 4k block size. In both VM scenarios I observed a higher overall bandwidth and lower latency with the 512B configuration. However I also observe that the 4k configuration behaves more predictably, as the standard deviation in all scenarios with 4k as block size is considerably smaller.Also, given that the standard deviation for 512B workloads is significantly larger than for 4k, a direct comparison seems unjust.

Those simulations provide a argument for using a 512B block sizes for the simulated VM workloads, but the actual numbers are to be taken with a grain of salt. The argument is to be taken qualitatively and not quantitatively.

See the results.txt file for the extracted data table and data.tar.gz for the raw data files from fio.

Simulation stats

Simulation 2 (mostly reading VMs) has on average 20% more throughput and latency, although also a standard deviation 20% larger. Interestingly the write throughput is 20% higher for the 4k case here so they somehow compensate each other.

In simulation 3 (mostly writing VMs) the 512B configuration outperforms the 4k by about 30% in terms of read bandwidth and 60% in write bandwidth. But also here with a 30% and 60% respectively larger standard deviation. The latency is also about 30% lower for the 512B configuration. Here 512B really shows an increased performance compared to my 4k configuration.

Looking at both the bandwidth and the runtime stats of the whole simulation give consistent results with the previous statements:

  • simulation1 (baseline) is somehow similar and not that interesting
  • simulation2 (mostly reading VMs) gives 45.3/24.4MiB/s at a runtime of 23s for 512B vs. 38.9/20MiB/s at a runtime of 27s
  • simulation3 (mostly writing VMs) gives 15.7/29.1MiB/s at 36.5s for 512B and 12.2/22.6MiB/s at 47s

Please note that values are group stats and not directly translatable to nvme performance characteristics.

Given the larger standard deviation I find the overall runtime stats the more robust metric to look at. Here I see a runtime increase from 23 to 27s, which is consistent with the roughly 20% shown earlier and 37 to 47s, which is also consistent with the described roughly 30% of simulation 3. One could look at the histogram distribution that fio gives, but that’s beyond something I want to do now.

Simulations

Find here the various simulation descriptions. Read:Write ratios are given as e.g. 60:40, meaning that 60% of the workload i read and 40% is write.

I setup a data limit of 4 GiB for each simulation job to provide a statistical large enough sample.

Simulation 1: Baseline

Simulation 1 acts as a generic baseline determination using 4 mixed random write and read workloads. It simulates a 4 VM scenario where we have 3 VMs acting as mostly reading and 1 VM acting as a balanced 50:50 read/write workload on sequential data.

The 3 reading VMs have a read/write ratio of 90:10. The writing VM has a read/write ratio of 50:50. Reads and write happen sequentially.

Simulation 2: Mixed read/write, mostly read

Simulation 2 simulates a 4 VM scenario with mostly read-intenstive workloads on a random mixed reads and writes pattern. It simulates 4 VMs, where 3 VMs are read-intensitve (80:20), one VM is write intensive (20:80).

Reads and writes happen randomly.

Simulation 3: Mixed read/write, mostly write

Simulation 3 is the same as simulation 2 but with 1 VM acting as mostly read and 3 VMs acting as mostly write.

Reads and writes happen randomly.

Setup

I use a Turing Pi RK1 (aarch64, RK3588 CPU) with 32 GiB RAM. The NVME SSD under test is the Samsung 990 Pro. Reported block size is 512B, as shown by parted:

Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 2000GB
Sector size (logical/physical): 512B/512B

The used system for this test openSUSE Leap 15.6 with fio-3.23 and zfs 2.2.4-1:

zfs-2.2.4-1
zfs-kmod-2.2.4-1

The used Kernel for all of those tests were 6.4.0-150600.16-default.

Creation of the zpools happened via the following commands:

# zpool create ashift /dev/nvme0n1                         # Use default blocksize
# zpool create ashift12 -o ashift=12 -f /dev/nvme0n1       # 4k blocksize

To check the corresponding ashift value (used block size as a power of 2), I used

# zdb -C | grep ashift
ashift12:
    name: 'ashift12'
            ashift: 12

To create the datasets I used a simple zfs create command. No additional parameters have been passed:

zfs create ashift12/storage-test

fio is run with the provided simulation files, and the output is piped to a text file using tee, e.g.

fio simulations/simulation3 | tee results/zfs/ashift12/simulation3

Summmary

In short: If performance is your objective, it appears that using the device defaults is king. I notice a performance difference in both tested workloads (mostly reading, mostly writing) in favor of 512B vs 4k. While the 4k workload seems to have a more predictible performance characteristic due to the much smaller standard deviation, the overall runtimes (See Simulation stats) give a more robust metric that support the previous statement.

In my setup here, I will however still use a 4k block size, because I want to have zfs replication working together with my NAS, where I also use 4k. Also given that the expected workload is mostly idling, I do not think that I will notice any difference in practical terms at all. So while the arguments stated here hold, I don’t think they translate into any meaningful performance differences for the expected workload at all, because there will simply never be such a high load on the NVME disk.

Still, if you end up in a similar situation as I did, consider running your own experiments using e.g. fio, try to simulate the expected workload as good as you can, and then make a educated estimation what might be the best in your case.

If in doubt, the provided defaults are typically not wrong.

Weblinks