Categories
AI Blog Admin

Using ChatGPT to fight spam on WordPress

Like all other WordPress blogs, this one attracts a good number of spam comments. I get usually 5-10 per day, but yesterday there were like 30. Almost all of them contain Cyrillic characters:

screenshot showing spam comments containing cyrillic characters

Since I specify that all comments are held until approved, that means I need to either approve or trash or spam every comment.

Enter ChatGPT

I use ChatGPT (specifically GPT 4) for a number of minor coding tasks. I find it helpful. It is not perfect. That doesn’t mean it isn’t useful. I decided to have it ponder this issue. I work with Python a lot at work and it’s typically my scripting language of choice. My initial request is as follows:

write a python script to log into a wordpress site as an admin, get the top 5 comments, see if there are any Cyrillic characters in them, and if there are, delete said comment

It was understandably unhappy about potentially being asked to “hack” a WordPress site, so I had to reassure it that I was the owner of said site:

yes, I have my own blog. I am the owner. I have the admin credentials. please proceed with how to delete those comments

It happily complied and spit out some very usable code:

chatgpt conversation around a python script to delete blog spam comments

After a bit more back and forth:

does this get comments in a pending state? I don't let them be published instantly because most of them are spam

I was informed there are 5 different comment states: approved, hold, spam, trash, unapproved.

perfect. can you please adjust the script to get the pending, unapproved, and hold comments. also make it top 20

It ran perfectly after copy + pasting the Python. Unfortunately I created an application password for my main login on this site and forgot to change the delete URL so it happily sent my application password and username to yourwebsite.com. After revoking that password and realizing there should be a base url:

please split out the site url (https://austinsnerdythings.com) from base_url for both retrieving the comments as well as deleting

I was left with a 100% functional script. This took 3-4 min of back and forth with ChatGPT 4.0. I definitely could’ve code this up myself with the basic structure in 15 minutes or so but I would’ve had to work out the json format for comments and all that. It is so much easier to just test out what ChatGPT provides and alter as necessary:

import requests
import json
import re

def has_cyrillic(text):
    return bool(re.search('[\u0400-\u04FF]', text))

site_url = "https://austinsnerdythings.com"
base_url = f"{site_url}/wp-json/wp/v2/comments?per_page=20&status="
statuses = ['pending', 'hold', 'unapproved']
auth = ('yourusername', 'yourpassword')

for status in statuses:
    url = base_url + status
    response = requests.get(url, auth=auth)
    comments = json.loads(response.text)

    cyrillic_comments = []

    for comment in comments:
        if has_cyrillic(comment['content']['rendered']):
            cyrillic_comments.append(comment)

    # delete comments with Cyrillic characters
    for comment in cyrillic_comments:
        delete_url = f"{site_url}/wp-json/wp/v2/comments/" + str(comment['id'])
        response = requests.delete(delete_url, auth=auth)
        if response.status_code == 200:
            print(f"Successfully deleted comment with id {comment['id']}")
        else:
            print(f"Failed to delete comment with id {comment['id']}. Response code: {response.status_code}")

Finishing touches

The other finishing touches I did were as follows:

  • Created a user specific for comment moderation. I used the ‘Members’ plugin to create a very limited role (only permissions granted are the necessary ones: Moderate Comments, Read, Edit Posts, Edit Others’ Posts, Edit Published Posts) and assigned said user to it. This greatly limits the potential for abuse if the account password falls into the wrong hands.
  • Copied the script to the web host running the blog
  • Set it to be executed hourly via crontab

Now I have a fully automated script that deletes any blog comments with any Cyrillic characters!

You may be asking yourself why I don’t use Akismet or Recaptcha or anything like that. I found the speed tradeoff to not be worthwhile. They definitely slowed down my site for minimal benefit. It only took a couple minutes a day to delete the spam comments. But now it takes no time because it’s automated!

Here’s the link to the full ChatGPT conversation:

https://chat.openai.com/share/aad6a095-9b90-42c5-b1ca-de2a18828ba2

Results

I created a spam comment and ran the script (after adding a print line to show the comment). Here’s the output:

And the web logs showing the 3 status being retrieved via GET and the DELETE for the single spam comment:

I am quite satisfied with this basic solution. It took me far longer to type up this blog post than it did to get the script working.

Categories
LifeProTips

How a Travel Router can Save you Money and Share Wi-Fi on Flights

Introduction

I was on a flight from Denver to Phoenix last Thursday and after I got my travel router all set up and shared with the family, I realized that people may not know how much money they can save on in-flight Wi-Fi with said travel routers. Despite being a self proclaimed nerd (on a blog titled Austin’s Nerdy Things no less), I had never purchased in-flight Wi-Fi until January this year on a flight from Denver to Orlando. For that four hour flight, I brought along my little GL.iNet device and a small battery pack to power it and shared the $10 Wi-Fi with my own phone, my wife’s phone, our daughter’s iPad, and both my mom and dad’s phones. That’s $50 worth of Wi-Fi for 5 devices ($10×5) on a single $10 Wi-Fi purchase. It paid for itself in a single flight.

Update 2023-04-18: I was also made aware that recent Pixel and Samsung phones have this same capability! A few capable devices are listed below with the travel routers.

GL.iNet AR750S-EXT sitting on an airplane tray with a small USB battery pack rebroadcasting wi-fi
GL.iNet AR750S-EXT sitting on an airplane tray rebroadcasting the in-flight Wi-Fi

What is a travel router?

A travel router is a portable and compact Wi-Fi device (see picture above) that allows you to create your own wireless network. It works by connecting to an existing Wi-Fi network, such as the one available on a plane, and then sharing that connection with multiple devices. This means that you can connect your laptop, smartphone, tablet, and other devices simultaneously to the internet without needing to purchase individual Wi-Fi passes for each device. The travel router appears as a single device connected to the main Wi-Fi network and it channels traffic from your devices to make it look like a single device.

Where else can you use a travel router?

You can use a travel router anywhere you pay for Wi-Fi, or anywhere that provides a Wi-Fi signal that must be signed into. I use the same travel router when we get to hotels also. There are a couple benefits:

  • The travel router has external antennas which provide a bit more gain than the internal one in devices. It can also be located where the Wi-Fi signal is strongest and repeat it further into the room/unit.
  • All devices know the travel router SSID and don’t need to be signed into the hotel Wi-Fi separately
  • Some hotels limit the number of devices per room/name combo, which isn’t an issue with a travel router

How much can you save on in-flight Wi-Fi with a travel router?

Let’s say you are a family of four. All four of you have a phone, one has an extra tablet, and one has a work laptop. That’s a total of 6 devices. To use all six devices would be $60 per flight at United’s current rate of $10 per device per flight. If you use a travel router to rebroadcast the in-flight Wi-Fi, you are only spending $10 per flight for the router to gain Wi-Fi access, and then sharing it among you own devices. That’s a savings of $50 for a relatively standard family of four per flight. Do that a few times a year and you can upgrade your room for a couple nights, or bump up to the next level of rental car.

What are some good travel routers?

I personally have a GL.iNet GL-AR750S-EXT. It appears this is no longer manufactured/sold, but GL.iNet has plenty of other devices. They all run an open source networking software called OpenWrt, which is a very popular OS and runs on hundreds of millions of devices. They’re also named after rocks/minerals which my geologist wife enjoys.

A couple considerations for getting a travel router:

  • Buy one with at least two radios (often marked as “dual band”). This ensures you can connect to the host Wi-Fi on one band and rebroadcast your own Wi-Fi on the other band
  • USB power input – so they play nice with USB battery packs
  • External antenna – external antennas have a bit more gain than internal antennas so they have a longer range
  • Do you need to share files? If so, get one with a SD card slot.
  • Processor speed – directly influences how fast any VPN connections would be. Slower processors can’t encrypt/decrypt packets as fast as fast processors. Faster processors also consume more power.
  • Some are their own battery pack, which means no need to carry both a travel router and battery pack! Example: GL.iNet GL-E750, which has a 7000 mAh battery inside.

Here are a few options (I am not being paid by GL.iNet, I just like their devices):

  • GL.iNet GL-SFT1200 (Opal) – this would be a great introductory travel router so you can get your feet wet and play around for not much money. It is dual band with external antennas and will be fast enough for casual browsing. Note that this model does not use a fully open-source version of OpenWrt.
  • GL.iNet GL-MT1300 (Beryl) – a step up from the Opal device, with a USB 3 port instead of USB 2 and a more powerful processor. Both have 3x gigabit ethernet ports in case you’re looking for wired connectivity.
  • GL.iNet GL-AXT1800 (Slate AX) – supports the latest Wi-Fi standard (Wi-Fi 6, or 802.11ax), and has the fastest processor. If you use WireGuard, it can do up to 550 Mbps for VPN, or 120 Mbps for OpenVPN. I would expect this travel router to be future-proofed for many years, and it would actually do well for an in-home router as well.
  • Recent Samsung and Pixel phones (running Android 10 or newer) such as the Pixel 6, Pixel 7, Galaxy S22, Galaxy S23, and others

You’ll also need a battery pack. The MoKo pack we’ve used for years appears to also not be manufactured/sold anymore. Here are some other battery packs. Ensure you select the correct USB type (you probably want USB-C for everything at this point in 2023).

Using a GL.iNet device with United Wi-Fi (and related nuances)

I have found that quite a few different host Wi-Fi networks have some nuance to them. United Wi-Fi specifically does not work if connecting over the 2.4 GHz band to the aircraft’s access point. It will broadcast the network over 2.4 GHz and allow you to connect, but nothing will actually work. So make sure you connect with the 5 GHz band and the rebroadcast your own Wi-Fi on the 2.4 GHz band. Some networks will be the other way around, like the Residence Inn we stayed at in Phoenix this past weekend.

United Wi-Fi is surprisingly quick. There isn’t much waiting at all for casual browsing, and all social media apps work as expected.

Below will be a few screenshots of how I do things. TravelCat is the SSID I use for our devices on the travel router. I have a TravelCat set up on both bands and enable/disable as necessary to switch bands.

Screenshot of GL.iNet connected to United in-flight Wi-Fi on radio0 (5 GHz band) and broadcasting TravelCat on radio1 (2.4 GHz band)
Screenshot of GL.iNet connected to United in-flight Wi-Fi on radio0 (5 GHz band) and broadcasting TravelCat on radio1 (2.4 GHz band)
Screenshot showing the GL.iNet device connected to "Unitedwifi.com" BSSID on radio0 (wlan0) and my iPhone, my wife's iPhone, and our daughter's iPad connected to TravelCat SSID on radio1/wlan1.
Screenshot showing the GL.iNet device connected to “Unitedwifi.com” BSSID on radio0 (wlan0) and my iPhone, my wife’s iPhone, and our daughter’s iPad connected to TravelCat SSID on radio1/wlan1.

How to set up a travel router on United Wi-Fi

This is how I set up the travel router on United Wi-Fi. I’m guessing most other airlines/hotels are similar. Steps 1 and 2 can be completed prior to your flight and only need to be done once.

  1. On the travel router, navigate to DNS settings and uncheck “rebind protection”. This is a setting that generally is useful and protects from malicious attacks but it breaks captive portals. Captive portals are how you get signed into various Wi-Fis so it breaks those. Just disable it, you’ll be fine.
  2. Set up your SSID on both 2.4 GHz and 5 GHz bands. One must be enabled at all times or you’ll need to plug in via ethernet or reset the device to access it again.
  3. Connect to the host Wi-Fi on the 5 GHz band if possible. There should be a “scan” button. Tap it and select the network wit the right name that has the negative value closest to 0 (for example -40 dBm is better than -60 dBm).
  4. Open the captive portal page name if you know it, for example unitedwifi.com. If you don’t, just try to go to google.com or yahoo.com or something boring like that and it should redirect you to complete the login process.
  5. Pay if necessary.
  6. All done! Start browsing as usual!
Travel router in seat back pocket with battery pack. You could also just leave it in your suitcase/backpack for the flight.
Travel router in seat back pocket with battery pack. You could also just leave it in your suitcase/backpack for the flight.

Conclusion

Investing in a travel router can pay for itself in just a single flight (depending on family size), making it an essential piece of tech for any flyer. By sharing Wi-Fi connections among multiple devices and splitting the cost with travel companions, you can save money and stay connected while traveling. So, on your next flight, consider bringing along a travel router and enjoy the convenience and cost-saving benefits it offers. Not gonna lie, I wish I had started using a travel router sooner and coughing up the $8-10 per flight to keep myself entertained with something more than endless games of 2048 or Chess or Catan. Besides, what self-respecting nerd doesn’t like playing with new technology?

Disclosure: Some of the links in this post are affiliate links. This means that, at zero cost to you, I will earn an affiliate commission if you click through the link and finalize a purchase.

Categories
AI

Stable Diffusion Tutorial – Nvidia GPU Installation

Like most other internet-connected people, I have seen the increase in AI-generated content in recent months. ChatGPT is fun to use and I’m sure there are plenty of useful use cases for it but I’m not sure I have the imagination required to use it to it’s full potential. The AI art fad of a couple months ago was cool too. In the back of my mind, I kept thinking “where will AI take us in the next couple years”. I still don’t know the answer to that. The only “art” I am good at is pottery (thanks to high-school pottery class – I took 4 semesters of it and had a great time doing so, whole different story). But now I’m able to generate my own AI art thanks to a guide I found the other day on /g/. I am re-writing it here with screenshots and a bit more detail to try and make it more accessible to general users.

NOTE: You need a decent/recent Nvidia GPU to follow this guide. I have a RTX 2080 Super with 8GB of VRAM. There are low-memory workarounds but I haven’t tested them yet. An absolute limit is 2GB VRAM, and a GTX 7xx (Maxwell architecture) or newer GPU.

Stable Diffusion Tutorial Contents

  1. Installing Python 3.10
  2. Installing Git (the source control system)
  3. Clone the Automatic1111 web UI (this is the front-end for using the various models)
  4. Download models
  5. Adjust memory limits & enable listening outside of localhost
  6. First run
  7. Launching the web UI
  8. Generating Stable Diffusion images

Video version of this install guide

Coming soon. I always do the written guide first, then record based off the written guide. Hopefully by end of day (mountain time) Feb 24.

1 – Installing Python 3.10

This is relatively straight-forward. To check your Python version, go to a command line and enter

python --version

If you already have Python 3.10.x installed (as seen in the screenshot below), you’re good to go (minor version doesn’t matter).

Python 3.10 installed for Stable Diffusion

If not, go to the Python 3 download page and select the most recent 3.10 version. As of writing, the most recent is 3.10.10. Download the x64 installer and install. Ensure the “add python.exe to PATH” checkbox is checked. Adding python.exe to PATH means it can be called with only python at a command prompt instead of the full path, which is something like c:/users/whatever/somedirectory/moredirectories/3.10.10/python.exe.

Installing python and adding python.exe to PATH

2 – Installing Git (the source control system)

This is easier than Python – just install it – https://git-scm.com/downloads. Check for presence and version with git –version:

git installed and ready to go for Stable Diffusion

3 – Clone the Automatic1111 web UI (this is the front-end for using the various models)

With Git, clone means to download a copy of the code repository. When you clone a repo, a new directory is created in whatever directory the command is run in. Meaning that if you navigate to your desktop, and run git clone xyz, you will have a new folder on your desktop named xyz with the contents of the repository. To keep things simple, I am going to create a folder for all my Stable Diffusion stuff in the C:/ root named sd and then clone into that folder.

Open a command prompt and enter

cd c:\

Next create the sd folder and enter it:

mkdir sd
cd sd

Now clone the repository while in your sd folder:

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

After the git clone completes, there will be a new directory called ‘stable-diffusion-webui’:

stable-diffusion-webui cloned and ready to download models

4 – Download models

“Models” are what actually generate the content based on provided prompts. Generally, you will want to use pre-trained models. Luckily, there are many ready to use. Training your own model is far beyond the scope of this basic installation tutorial. Training your own models generally also requires huge amounts of time crunching numbers on very powerful GPUs.

As of writing, Stable Diffusion 1.5 (SD 1.5) is the recommended model. It can be downloaded (note: this is a 7.5GB file) from huggingface here.

Take the downloaded file, and place it in the stable-diffusion-webui/models/Stable-diffusion directory and rename it to model.ckpt (it can be named anything you want but the web UI automatically attempts to load a model named ‘model.ckpt’ upon start). If you’re following along with the same directory structure as me, this file will end up at C:\sd\stable-diffusion-webui\models\Stable-diffusion\model.ckpt.

Another popular model is Deliberate. It can be downloaded (4.2GB) here. Put it in the same folder as the other model. No need to rename the 2nd (and other) models.

After downloading both models, the directory should look like this:

Stable Diffusion 1.5 (SD 1.5) and Deliberate_v11 models ready for use

5 – Adjust memory limits & enable listening outside of localhost (command line arguments)

Inside the main stable-diffusion-webui directory live a number of launcher files and helper files. Find webui-user.bat and edit it (.bat files can be right-clicked -> edit).

Add –medvram (two dashes) after the equals sign of COMMANDLINE_ARGS. If you also want the UI to listen on all IP addresses instead of just localhost (don’t do this unless you know what that means), also add –listen.

webui-user.bat after edits

@echo off

set PYTHON=
set GIT=
set VENV_DIR=
set COMMANDLINE_ARGS=--listen --medvram

call webui.bat

6 – First run

The UI tool (developed by automatic1111) will automatically download a variety of requirements upon first launch. It will take a few minutes to complete. Double-click the webui-user.bat file we just edited. It calls a few .bat files and eventually launches a Python file. The .bat files are essentially glue to stick a bunch of stuff together for the main file.

The very first thing it does is creates a Python venv (virtual environment) to keep the Stable Diffusion packages separate from your other Python packages. Then it pip installs a bunch of packages related to cuda/pytorch/numpy/etc so Python can interact with your GPU.

webui-user.bat using pip to install necessary python packages like cuda

After everything is installed and ready to go, you will see a line that says: Running on local URL: http://127.0.0.1:7860. That means the Python web server UI is running on your own computer on port 7860 (if you added –listen to the launch args, it will show 0.0.0.0:7860, which means it is listening on all IP addresses and can be accessed by external machinse).

stable-diffusion-webui launched and ready to load

7 – Launching the web UI

With the web UI server running, it can be accessed via browser on the same computer running the Python at http://127.0.0.1:7860. That link should work for you if you click it.

Note that if the Python process closes for whatever reason (you close the command window, your computer reboots, etc), you need to double-click webui-user.bat to relaunch it and it needs to be running any time you want to access the web UI.

Automatic1111 stable diffusion web UI up and running

As seen in the screenshot, there are a ton of parameters/settings. I’ll highlight a few in the next section

8 – Generating Stable Diffusion images

This is the tricky part. The prompts make or break your generation. I am still learning. The prompt is where you enter what you want to see. Negative prompt is where you enter what you don’t want to see.

Let’s start simple, with cat in the prompt. Then click generate. A very reasonable-looking cat should soon appear (typically takes a couple seconds per image):

AI-generated cat with stable diffusion 1.5 with default settings

To highlight a few of the settings/sliders:

  • Stable diffusion checkpoint – model selector. Note that it’ll take a bit to load a new model (the multi-GB files need to be read in their entirety and ingested).
  • Prompt – what you want to see
  • Negative prompt – what you don’t want to see
  • Sampling method – various methods to sample new points
  • Sampling steps – how many iterations to use for image generation for a single image
  • Width – width of image to generate (in pixels). NOTE, you need a very powerful GPU with a ton of VRAM to go much higher than the default 512
  • Height – height of image to generate (in pixels). Same warning applies as width
  • Batch count – how many images to include in a batch generation
  • Batch size – haven’t used yet, presumably used to specify how many batches to generate
  • CFG Scale – this slider tells the models how specific they need to be for the prompt. Higher is more specific. Really high values (>12ish) start to get a bit abstract. Probably want to be in the range of 3-10 for this slider.
  • Seed – random number generator seed. -1 means use a new seed for every image.

Some thoughts on prompt/negative prompt

From my ~24 hours using this tool, it is very clear that prompt/negative prompts are what make or break your generation. I think that your ability as a pre-AI artist would come in handy here. I am no artist so I have a hard time putting what I want to see into words. Take example prompt: valley, fairytale treehouse village covered, matte painting, highly detailed, dynamic lighting, cinematic, realism, realistic, photo real, sunset, detailed, high contrast, denoised, centered. I would’ve said “fairytale treehouse” and stopped at that. Compare the two prompts below with the more detailed prompt directly below and the basic “fairytale treehouse” prompt after that:

AI-generated “fairytale treehouse” via stable diffusion. Prompt: valley, fairytale treehouse village covered, matte painting, highly detailed, dynamic lighting, cinematic, realism, realistic, photo real, sunset, detailed, high contrast, denoised, centered
AI-generated “fairytale treehouse” via stable diffusion. Prompt: fairytale treehouse

One of these looks perfectly in place for a fantasy story. The other you could very possibly see in person in a nearby forest.

Both positive and negative can get very long very quickly. Many of the AI-generated artifacts present over the last month or two can be eliminated with negative prompt entries.

Example negative prompt: ugly, deformed, malformed, lowres, mutant, mutated, disfigured, compressed, noise, artifacts, dithering, simple, watermark, text, font, signage, collage, pixel

I will not pretend to know what works well vs not. Google is your friend here. I believe that “prompt engineering” will be very important in AI’s future. Google is your friend here.

Conclusion

AI-generated content is here. It will not be going away. Even if it is outlawed, the code is out there. AI will be a huge part of our future, regardless of if you want it or not. As the saying goes – pandora’s box is now open.

I figured it was worth trying. The guide this is based off made it relatively easy for me (but I do have above-average computer skill), and I wanted to make it even easier. Hopefully you found this ‘how to set up stable diffusion’ guide easy to use as well. Please let me know in the comments section if you have any questions/comments/feedback – I check at least daily!

Resources

Huge shout out to whoever wrote the guide (“all anons”) at https://rentry.org/voldy. That is essentially where this entire guide came from.

Categories
ZFS

Intel Optane P1600X & P4800X as ZFS SLOG/ZIL

As a follow-up to my last post (ZFS SLOG Performance Testing of SSDs including Intel P4800X to Samsung 850 Evo), I wanted to focus specifically on the Intel Optane devices in a slightly faster test machine. These are incredible devices, especially for $59 as of latest check – for the P1600X 118GB M.2 form factor. Hopefully you enjoy this quick review / benchmark.

What is ZFS & SLOG/ZIL

A one-sentence summary is as follows: ZFS is a highly advanced & adaptable file system with multiple features to enhance performance, including the SLOG/ZIL (Separate Log Device/ZFS Intent Log), which essentially functions as a write cache for synchronous writes. For a more detailed write-up, see Jim Salter’s ZFS sync/async ZIL/SLOG guide.

Now to jump right into the performance.

A different (faster) test machine

I popped the P1600X into an M.2 slot of a different machine I have here at home and was very surprised at how much faster it was than in the previous test box. I know the Xeon D series isn’t exactly known for speed but people always say speed doesn’t really matter for storage. I guess being in the Major Leagues with these Optane devices means that processor speed does in fact matter. The single-thread speed of the 2678v3 isn’t a lot higher than the D-1541 (same generation Xeon) but the multi-thread is ~40% faster.

Test machine specs:

  • Somewhat generic 1U case with 4×3.5″ bays
  • AsrockRack EPC612D8
  • Intel Xeon E5-2678v3
  • 2x32GB 2400 MHz (running at 2133 MHz due to v3)
  • Consumer Samsung NVMe boot drive (256GB)
  • 1x Intel D3-S4610 as ZFS test-pool
  • 2x Samsung PM853T 480GB as boot mirror for Proxmox (not used)
  • 1x Intel Optane DC P1600X 58GB in the first M.2 slot
  • 1x Intel Optane DC P4800X in the 2nd PCIe slot (via x16 riser)

Machine info:

root@truenas-1u[~]# uname -a
FreeBSD truenas-1u.home.fluffnet.net 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 n245412-484f039b1d0 TRUENAS amd64
root@truenas-1u[~]# zfs -V
zfs-2.1.6-1
zfs-kmod-v2022101100-zfs_2a0dae1b7
root@truenas-1u[~]# fio -v
fio-3.28
root@truenas-1u[~]#

The test drives

Intel did provide the Optane devices but did not make any demands on what to write about or how to write it, nor did they review any of these posts before publishing.

Intel Optane DC P1600X 58GB M.2

The Intel P1600X placed in the first M.2 slot of the AsrockRack EPC612D8

Intel Optane DC P4800X 375GB AIC

This drive is “face down” when installed so here’s a picture of it on my desk in its shipping tray. The other side is populated with many memory chips.

Intel Optane DC P4800X in tray
Intel P4800X in tray

TrueNAS diskinfo for Intel P1600X

I wanted to re-run this test in a machine where the M.2 slot wasn’t limited to x1 lane width. The results are far better, even for latency, which I wasn’t expecting. For 4k size, the latency is 13.6 microseconds per write. This calculates out to 71.8 kIOPS, which is great for this simple test (I believe it is QD=1). In the Xeon D-1541 machine with a x1 M.2 slot, the latencies were roughly double, and the throughput topped out at 374 MB/s. Quick comparison: spinning hard drives typically have sync write latencies in the 13 millisecond range, meaning the Optane gets writes committed 1000x faster than spinning hard drives.

root@truenas-1u[/mnt/test-pool]# diskinfo -wS /dev/nvd1
/dev/nvd1
        4096            # sectorsize
        58977157120     # mediasize in bytes (55G)
        14398720        # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        INTEL SSDPEK1A058GA     # Disk descr.
        PHOC209200Q5058A        # Disk ident.
        nvme1           # Attachment
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
           4 kbytes:     13.6 usec/IO =    287.2 Mbytes/s
           8 kbytes:     17.8 usec/IO =    439.8 Mbytes/s
          16 kbytes:     26.6 usec/IO =    586.9 Mbytes/s
          32 kbytes:     43.9 usec/IO =    711.4 Mbytes/s
          64 kbytes:     79.4 usec/IO =    786.7 Mbytes/s
         128 kbytes:    151.5 usec/IO =    825.1 Mbytes/s
         256 kbytes:    280.8 usec/IO =    890.2 Mbytes/s
         512 kbytes:    545.3 usec/IO =    917.0 Mbytes/s
        1024 kbytes:   1064.4 usec/IO =    939.5 Mbytes/s
        2048 kbytes:   2105.2 usec/IO =    950.0 Mbytes/s
        4096 kbytes:   4199.2 usec/IO =    952.6 Mbytes/s
        8192 kbytes:   8367.5 usec/IO =    956.1 Mbytes/s

TrueNAS diskinfo for Intel P4800X

The P4800X was also faster in the Xeon E5-2678v3 machine – with latencies 8-10% better and throughput up to 270MB/s faster. Makes me wonder what an even faster machine could do. The v3 stuff is getting quite old at this point, but still excellent value for homelab usage.

The 4k latency is 14.1 microseconds, which is curiously 0.5 us slower than the P1600X. Every other size was both quicker and faster.

root@truenas-1u[/mnt/test-pool]# diskinfo -wS /dev/nvd0
/dev/nvd0
        4096            # sectorsize
        375083606016    # mediasize in bytes (349G)
        91573146        # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        INTEL SSDPED1K375GA     # Disk descr.
        PHKS750500G2375AGN      # Disk ident.
        nvme0           # Attachment
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
           4 kbytes:     14.1 usec/IO =    276.8 Mbytes/s
           8 kbytes:     16.0 usec/IO =    489.3 Mbytes/s
          16 kbytes:     21.2 usec/IO =    737.6 Mbytes/s
          32 kbytes:     30.3 usec/IO =   1032.6 Mbytes/s
          64 kbytes:     48.9 usec/IO =   1278.3 Mbytes/s
         128 kbytes:     86.1 usec/IO =   1451.3 Mbytes/s
         256 kbytes:    151.1 usec/IO =   1655.0 Mbytes/s
         512 kbytes:    277.8 usec/IO =   1800.0 Mbytes/s
        1024 kbytes:    536.0 usec/IO =   1865.6 Mbytes/s
        2048 kbytes:   1048.3 usec/IO =   1907.9 Mbytes/s
        4096 kbytes:   2070.8 usec/IO =   1931.6 Mbytes/s
        8192 kbytes:   4120.3 usec/IO =   1941.6 Mbytes/s

fio command for performance testing

I varied runtime (between 5 and 30s – 30s for the fast tests is limited by the single disk write speed of ~500MB/s, so I reduced the time to the default txg commit time of 5s) and iodepth (1, 4, 16, 64).

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=8g --iodepth=1 --runtime=30 --time_based

Results

I’m just going to copy + paste (and link to) the Excel table I created since there I don’t really have a good way to do tables in my current WordPress install.

ZFS benchmark results with Intel Optane P1600X & P4800X
ZFS benchmark results with Intel Optane P1600X & P4800X

Conclusion

These P1600X drives are very, very quick with their 4k write latencies in the low teens. With the recent price drops, and the increasing prevalence of M.2 slots in storage devices, adding one as a SLOG is a super-cheap method to drastically improve sync write performance. I am always on eBay trying to find used enterprise drives with high performance to price ratios and I think this one tops the charts (Amazon link: P1600X 118GB M.2). The write endurance is plenty high for many use cases at 1.3PB (ServeTheHome.com highlights that recycled data center SSDs rarely have more than 1PB written in their article Used enterprise SSDs: Dissecting our production SSD population). The fact that this M.2 doesn’t take a full drive bay makes it even more appealing for storage chassis with only 4 drive bays.

Categories
ZFS

ZFS SLOG Performance Testing of SSDs including Intel P4800X to Samsung 850 Evo

I’ve been meaning to type this post up for a couple months, but with a 7 month old and a 29 month old at home, a new job, and a half-renovated basement, it took a bit of a back seat.

First question you may have:

What is ZFS?

ZFS stands for ‘zettabyte file system’. It has a long history that started in the early 2000s as part of Sun Microsystem’s Solaris operating system. Since then, it has undergone a decent glow-up, broke free of Oracle/Sun licensing via a rebranding as OpenZFS (I will not pretend to know half the details of how this happened) and is now a default option in at least two somewhat mainstream operating systems (Ubuntu 20.04+ and Proxmox). It just so happens that those are the two primary Linux-based operating systems I use. It also just so happens that I use ZFS as my main filesystem for all of my Proxmox hosts, as well as my network storage systems running FreeNAS/TrueNAS core.

Why do I use ZFS?

A huge benefit of ZFS is data integrity. Long story very short, each block written to disk is written with a checksum. That checksum is checked upon reading, and if it doesn’t match, ZFS knows that the data became corrupted somewhere between writing it to disk, and reading it back (could be an issue with disk, cables, disk controller, or even a flipped bit from a cosmic ray). Luckily, ZFS has a variety of options that make this a non-issue. For anything important, I do at least a mirror, meaning two disks with the same exact data written to each. If a checksum is invalid from one disk, the other is tried. If the other checks out, the first is corrected. ZFS also run a “scrub” at regular intervals (default is usually monthly), where it will read every single block in a pool and verify the checksums for integrity. This prevents bitrot (the tendency for data to go bad if left sitting for years). I will admit, none of my data is super super important but I like to pretend it is.

Performance of ZFS

ZFS can be extremely performant, if the right hardware is used. A trade-off of the excellent data integrity is the overhead involved with checksums and all that jazz. ZFS is also a copy-on-write filesystem, which means snapshots are instant, and blocks aren’t changed – just the reference to the block. Where things get slow is when you have a pool of spinning, mechanical hard drives, and you request sync writes, meaning that ZFS will not say the write is completed until it is safely committed to disk. That takes 8-15 milliseconds for spinning disks. 8-15 milliseconds may not seem like a long time, but it is an eternity compared to CPU L1/L2/L3 cache and a tenth (don’t quote me on that) of an eternity compared to fast NVMe SSDs. To overcome to latency of mechanical disk access, loads of people think, “Oh I can just slap in this old SSD from my last build as a read/write cache and call it good”. That’s what we’ll examine today.

Benchmarking ZFS sync writes

The test system is something I picked up from r/homelabsales a few months ago for $150 in a Datto case. I figured that compared to current Synology/QNAP 4-bay NAS devices, it was 1) a steal, and 2) far more capable:

  • Intel Xeon D-1541 on a rebranded ASRock D1541D4U-2T8R
  • 4x32GB DDR4 2400MHz
  • 2x 10G from Intel X540
  • LSI3008 SAS3 controller
  • IPMI
  • 4 bay 1U chassis
  • 2x 4TB WD Red as the base mirror.
root@truenas-datto[~]# uname -a
FreeBSD truenas-datto.home.fluffnet.net 13.1-RELEASE-p2 FreeBSD 13.1-RELEASE-p2 n245412-484f039b1d0 TRUENAS amd64

root@truenas-datto[~]# zfs -V
zfs-2.1.6-1
zfs-kmod-v2022101100-zfs_2a0dae1b7

I stopped at 2x 4TB WD Reds instead of four so I could easily swap different SSDs into one of the other bays for benchmarking. I figure a basic 2 HDD mirror is decently close to many r/homelab setups. Of course, performance will roughly double for the base zpool if another mirror is added.

The guy on homelabsales had a bunch of similarly spec’d systems, all of which sold for very cheap – let me know if you got one!

The fio command used

fio is a standard benchmarking tool. I kept it simple and used the same script for each test. The following does random writes, with a 4k byte size to test IOPS instead of pure bandwidth/throughput, IO depth of 1, 30 second duration, and sync after each write.

Note: I set sync=always on the dataset, so I’m not specifying –fsync=1 here.

Note 2: iodepth=16 is a bit strong for home situations, but I’ve already redone this benchmark series twice so not doing it again with qd=1 for every drive. QD=1 is covered for the 3 fastest drives at the end

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based

I will be testing a few different drives as SLOG devices (“write cache”):

  • Intel DC S3500 120GB
  • Samsung 850 Evo 500GB
  • HGST HUSSL4020ASS600 200GB
  • HGST Ultrastar SS300 HUSMM3240ASS205 400GB
  • Intel Optane P1600X M.2 form factor 58GB
  • Intel Optane P4800X AIC (PCI-e form factor) 375GB

The Intel devices were graciously provided by Intel with no expectation other than running some benchmarks on them. Apparently their Optane business is winding down, which is unfortunate because it is an amazing technology. ServeTheHome.com has confirmed the P4800X AIC form-factor is being discontinued.

A short word on “write cache” for ZFS – go read read Jim Salter’s ZFS sync/async ZIL/SLOG guide. He is a ZFS expert and has written many immensely helpful guides on it. Long story short, a SLOG is a write cache device, but only for sync writes. It needs to have consistent, low latency to function well. Old enterprise SSDs are ideal for this.

FreeBSD’s diskinfo command results

I briefly wanted to show the diskinfo results for the Intel devices:

Intel Optane P4800X 375GB

Intel Optane DC P4800X in tray
This is the single most expensive piece of computer hardware I’ve ever held in my own hands – the Intel Optane SSD DC P4800X 375GB in the AIC (PCIe) form-factor.

Bottoms out at 15 microseconds per IO for 259.6 MB/s at 4k blocksize. This is extremely fast, and is considered to be one of the best SLOG devices currently available (top 5 easily). Make no mistake, the Intel P4800X is one of the highest performing solid state drives in existence. Here’s a quote from the StorageReview.com review of the device: “For low-latency workloads, there is currently nothing that comes close to the Intel Optane SSD DC P4800X.”

There are a few form factors available, with price tags to match:

root@truenas[~]# diskinfo -wS /dev/nvd0
/dev/nvd0
        4096            # sectorsize
        375083606016    # mediasize in bytes (349G)
        91573146        # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        INTEL SSDPED1K375GA     # Disk descr.
        PHKS750500G2375AGN      # Disk ident.
        nvme0           # Attachment
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
           4 kbytes:     15.0 usec/IO =    259.6 Mbytes/s
           8 kbytes:     18.3 usec/IO =    427.6 Mbytes/s
          16 kbytes:     23.4 usec/IO =    667.7 Mbytes/s
          32 kbytes:     34.6 usec/IO =    904.3 Mbytes/s
          64 kbytes:     55.9 usec/IO =   1118.8 Mbytes/s
         128 kbytes:    121.1 usec/IO =   1032.2 Mbytes/s
         256 kbytes:    197.8 usec/IO =   1263.7 Mbytes/s
         512 kbytes:    352.2 usec/IO =   1419.5 Mbytes/s
        1024 kbytes:    651.0 usec/IO =   1536.2 Mbytes/s
        2048 kbytes:   1237.5 usec/IO =   1616.1 Mbytes/s
        4096 kbytes:   2413.5 usec/IO =   1657.4 Mbytes/s
        8192 kbytes:   4772.1 usec/IO =   1676.4 Mbytes/s

Intel Optane P1600X 58GB diskinfo

The P4800X’s little brother is no slouch. In fact, it is quicker latency-wise than any SATA and SAS SSD currently in existence. The diskinfo shows that in a slightly more capable system (Xeon E5-2678v3, 2x32GB 2133 MHz, ASRock EPC612D8), the latency is even lower than in the P4800X. The latency is an astounding 13.3 microseconds per IO at 4k size. These are now available on Amazon (as a 118GB version) for $88 as of writing – Intel Optane P1600X 118GB M.2. Note: the original version of this post had the numbers from a P1600X in a M.2 slot limited to a single PCIe lane (x1). The below numbers are from a full x4 slot.

root@truenas[~]# diskinfo -wS /dev/nvd0
/dev/nvd0
        512             # sectorsize
        58977157120     # mediasize in bytes (55G)
        115189760       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        INTEL SSDPEK1A058GA     # Disk descr.
        PHOC209200Q5058A        # Disk ident.
        nvme0           # Attachment
        Yes             # TRIM/UNMAP support
        0               # Rotation rate in RPM

Synchronous random writes:
         0.5 kbytes:      9.2 usec/IO =     52.9 Mbytes/s
           1 kbytes:      9.3 usec/IO =    105.2 Mbytes/s
           2 kbytes:     10.9 usec/IO =    179.6 Mbytes/s
           4 kbytes:     13.3 usec/IO =    293.3 Mbytes/s
           8 kbytes:     19.5 usec/IO =    400.3 Mbytes/s
          16 kbytes:     35.2 usec/IO =    444.1 Mbytes/s
          32 kbytes:     62.4 usec/IO =    500.6 Mbytes/s
          64 kbytes:    116.9 usec/IO =    534.8 Mbytes/s
         128 kbytes:    218.6 usec/IO =    571.7 Mbytes/s
         256 kbytes:    413.4 usec/IO =    604.7 Mbytes/s
         512 kbytes:    806.9 usec/IO =    619.6 Mbytes/s
        1024 kbytes:   1570.5 usec/IO =    636.7 Mbytes/s
        2048 kbytes:   3095.1 usec/IO =    646.2 Mbytes/s
        4096 kbytes:   5889.7 usec/IO =    679.1 Mbytes/s
        8192 kbytes:  12175.3 usec/IO =    657.1 Mbytes/s

On to the benchmarking results

The base topology of the zpool is a simple mirrored pair of WD Reds.

root@truenas-datto[~]# zpool create -o ashift=12 bench-pool mirror /dev/da1 /dev/da3
root@truenas-datto[~]# zfs set recordsize=4k bench-pool
root@truenas-datto[~]# zfs set compress=lz4 bench-pool
root@truenas-datto[~]# zfs set sync=always bench-pool

Base 2x 4TB WD Red Results

With sync write speeds: 894 IOPS, 3.6MB/s. It is 2023 – do not use uncached mechanical hard drives for sync workloads. Median completion latency of 12 milliseconds, average of 17.8 milliseconds.

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=5545KiB/s][w=1386 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=4014: Mon Jan 30 11:13:18 2023
  write: IOPS=894, BW=3578KiB/s (3664kB/s)(105MiB/30035msec); 0 zone resets
    slat (nsec): min=1187, max=148381, avg=4123.14, stdev=4314.32
    clat (msec): min=11, max=139, avg=17.85, stdev=16.63
     lat (msec): min=11, max=139, avg=17.86, stdev=16.63
    clat percentiles (msec):
     |  1.00th=[   12],  5.00th=[   12], 10.00th=[   12], 20.00th=[   12],
     | 30.00th=[   12], 40.00th=[   12], 50.00th=[   12], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   12], 90.00th=[   49], 95.00th=[   61],
     | 99.00th=[   73], 99.50th=[   81], 99.90th=[  120], 99.95th=[  140],
     | 99.99th=[  140]
   bw (  KiB/s): min=  871, max= 5619, per=100.00%, avg=3594.69, stdev=2066.03, samples=59
   iops        : min=  217, max= 1404, avg=898.29, stdev=516.49, samples=59
  lat (msec)   : 20=86.12%, 50=4.53%, 100=9.05%, 250=0.30%
  cpu          : usr=0.25%, sys=0.58%, ctx=2221, majf=0, minf=1
  IO depths    : 1=2.1%, 2=5.7%, 4=18.5%, 8=63.3%, 16=10.4%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=92.8%, 8=1.7%, 16=5.5%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,26864,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

With sync=disabled

This effectively tests how fast the pool can write without consideration for sync writes. They are essentially buffered in memory (in the ZIL – ZFS intent log) and flushed every 5 seconds by default to disk.

IOPS = 48.3k, median latency = 400 microseconds, avg latency = 321 microseconds. Note that the max IOPS recorded was 288.8k, which was likely for the first few seconds as the ZIL filled up. As the ZIL started to flush to disk, the writes slowed down to keep pace with the slowness of the disks writing. This figure is the maximum this system can generate, regardless of what disks are used.

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=137MiB/s][w=35.0k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5388: Mon Jan 30 11:57:29 2023
  write: IOPS=48.3k, BW=189MiB/s (198MB/s)(5658MiB/30001msec); 0 zone resets
    slat (nsec): min=981, max=176175, avg=2249.36, stdev=1835.18
    clat (usec): min=10, max=1486, avg=319.71, stdev=267.96
     lat (usec): min=12, max=1490, avg=321.96, stdev=267.89
    clat percentiles (usec):
     |  1.00th=[   21],  5.00th=[   25], 10.00th=[   29], 20.00th=[   37],
     | 30.00th=[   52], 40.00th=[   68], 50.00th=[  400], 60.00th=[  461],
     | 70.00th=[  498], 80.00th=[  578], 90.00th=[  644], 95.00th=[  693],
     | 99.00th=[ 1004], 99.50th=[ 1090], 99.90th=[ 1172], 99.95th=[ 1205],
     | 99.99th=[ 1270]
   bw (  KiB/s): min=61860, max=1155592, per=100.00%, avg=194107.64, stdev=242822.10, samples=59
   iops        : min=15465, max=288898, avg=48526.51, stdev=60705.52, samples=59
  lat (usec)   : 20=0.59%, 50=29.12%, 100=12.99%, 250=2.63%, 500=25.14%
  lat (usec)   : 750=26.72%, 1000=1.74%
  lat (msec)   : 2=1.05%
  cpu          : usr=12.47%, sys=24.78%, ctx=671732, majf=0, minf=1
  IO depths    : 1=0.1%, 2=0.5%, 4=12.1%, 8=56.8%, 16=30.6%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=96.0%, 8=1.4%, 16=2.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1448382,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

With Samsung Evo 850 500GB

This was a very popular SSD a few years ago, with great performance per dollar when it was release. This was the 2nd SSD I ever bought (first was a Crucial M4 256GB).

IOPS = 6306, median latency = 1.9 milliseconds, avg latency = 2.5 ms, standard deviation = 1.6 ms. These are not great performance numbers. Consumer drives typically don’t deal well with high queue depth operations.

root@truenas-datto[~]# smartctl -a /dev/da0
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 500GB

root@truenas-datto[~]# zpool add bench-pool log /dev/da0
root@truenas-datto[~]# zpool status bench-pool
  pool: bench-pool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        bench-pool  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da3     ONLINE       0     0     0
        logs
          da0       ONLINE       0     0     0

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=9141KiB/s][w=2285 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=4532: Mon Jan 30 11:21:33 2023
  write: IOPS=6306, BW=24.6MiB/s (25.8MB/s)(739MiB/30005msec); 0 zone resets
    slat (nsec): min=1088, max=250418, avg=4202.64, stdev=3509.42
    clat (usec): min=1038, max=14358, avg=2509.48, stdev=1683.17
     lat (usec): min=1044, max=14362, avg=2513.68, stdev=1683.00
    clat percentiles (usec):
     |  1.00th=[ 1418],  5.00th=[ 1713], 10.00th=[ 1762], 20.00th=[ 1811],
     | 30.00th=[ 1860], 40.00th=[ 1893], 50.00th=[ 1926], 60.00th=[ 1975],
     | 70.00th=[ 2008], 80.00th=[ 2073], 90.00th=[ 6259], 95.00th=[ 6849],
     | 99.00th=[ 8848], 99.50th=[ 9634], 99.90th=[12518], 99.95th=[12780],
     | 99.99th=[13173]
   bw (  KiB/s): min= 8928, max=32191, per=100.00%, avg=25461.85, stdev=8650.57, samples=59
   iops        : min= 2232, max= 8047, avg=6365.05, stdev=2162.61, samples=59
  lat (msec)   : 2=67.18%, 4=20.48%, 10=12.06%, 20=0.28%
  cpu          : usr=2.26%, sys=5.10%, ctx=35707, majf=0, minf=1
  IO depths    : 1=0.1%, 2=0.3%, 4=6.6%, 8=75.7%, 16=17.4%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=91.3%, 8=4.8%, 16=3.9%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,189227,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

With a 10 year old Intel DC S3500 120GB

This was the first enterprise SSD I acquired for use in my homelab. I bought two, for a read/write cache on my Xpenology machine (DSM requires 2 drives to mirror for write cache). The datasheet indicates write latency of 65 us, which is pretty quick. Much faster than the 1900 us median latency of the Evo 850 in the previous section. Unfortunately for the 120GB drives, sequential writes are indicated at 135 MB/s, which is fine for gigabit filesharing. These drives (and many, many other “enterprise” SSDs) feature capacitors for power loss protection. The drive can acknowledge the writes very quick because the capacitors store enough energy for the drive to commit writes to NAND in the event of a power loss.

IOPS = 32.5k, median latency = 437 us, 5x faster than the Evo 850. Standard deviation = 0.14 ms, 10x more consistent than the Evo 850.

root@truenas-datto[~]# smartctl -a /dev/da0
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BB120G4

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=128MiB/s][w=32.9k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=4782: Mon Jan 30 11:26:32 2023
  write: IOPS=32.5k, BW=127MiB/s (133MB/s)(3811MiB/30001msec); 0 zone resets
    slat (nsec): min=1226, max=12095k, avg=3314.28, stdev=13914.23
    clat (usec): min=156, max=17646, avg=468.20, stdev=137.34
     lat (usec): min=214, max=17648, avg=471.52, stdev=138.30
    clat percentiles (usec):
     |  1.00th=[  363],  5.00th=[  388], 10.00th=[  400], 20.00th=[  416],
     | 30.00th=[  429], 40.00th=[  437], 50.00th=[  445], 60.00th=[  457],
     | 70.00th=[  478], 80.00th=[  506], 90.00th=[  570], 95.00th=[  611],
     | 99.00th=[  725], 99.50th=[  807], 99.90th=[ 1057], 99.95th=[ 1254],
     | 99.99th=[ 3032]
   bw (  KiB/s): min=109752, max=133120, per=100.00%, avg=130157.39, stdev=4451.81, samples=59
   iops        : min=27438, max=33280, avg=32539.07, stdev=1112.98, samples=59
  lat (usec)   : 250=0.01%, 500=78.78%, 750=20.44%, 1000=0.64%
  lat (msec)   : 2=0.11%, 4=0.02%, 20=0.01%
  cpu          : usr=6.55%, sys=13.26%, ctx=186714, majf=0, minf=1
  IO depths    : 1=0.3%, 2=4.6%, 4=23.3%, 8=59.6%, 16=12.3%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.1%, 8=0.3%, 16=5.5%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,975522,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

With an 11 year old HGST SSD400S.B 200GB

This is a SAS2 SSD with SLC NAND, and power loss protection. It has a rated endurance of 18 petabytes. This is effectively infinite write endurance for any home use situation (and even many enterprise/commercial use situations).

IOPS = 27.4k, median latency = 486 us, standard dev = 238 us. Very good numbers.

root@truenas-datto[~]# smartctl -a /dev/da0
=== START OF INFORMATION SECTION ===
Vendor:               HITACHI
Product:              HUSSL402 CLAR200

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=107MiB/s][w=27.5k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5064: Mon Jan 30 11:31:09 2023
  write: IOPS=27.4k, BW=107MiB/s (112MB/s)(3208MiB/30001msec); 0 zone resets
    slat (nsec): min=1234, max=1589.4k, avg=3528.85, stdev=7290.11
    clat (usec): min=159, max=17706, avg=559.45, stdev=238.22
     lat (usec): min=207, max=17710, avg=562.98, stdev=238.44
    clat percentiles (usec):
     |  1.00th=[  363],  5.00th=[  396], 10.00th=[  416], 20.00th=[  445],
     | 30.00th=[  457], 40.00th=[  469], 50.00th=[  486], 60.00th=[  519],
     | 70.00th=[  578], 80.00th=[  652], 90.00th=[  791], 95.00th=[  930],
     | 99.00th=[ 1172], 99.50th=[ 1287], 99.90th=[ 4146], 99.95th=[ 4359],
     | 99.99th=[ 4948]
   bw (  KiB/s): min=82868, max=135401, per=99.87%, avg=109358.92, stdev=15198.36, samples=59
   iops        : min=20717, max=33850, avg=27339.37, stdev=3799.66, samples=59
  lat (usec)   : 250=0.01%, 500=55.35%, 750=32.72%, 1000=8.72%
  lat (msec)   : 2=3.07%, 4=0.03%, 10=0.12%, 20=0.01%
  cpu          : usr=6.02%, sys=11.55%, ctx=158396, majf=0, minf=1
  IO depths    : 1=0.3%, 2=4.5%, 4=23.5%, 8=59.5%, 16=12.3%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.2%, 8=0.3%, 16=5.5%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,821298,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

HGST Ultrastar SS300 HUSMM3240 400GB

This is a more modern SAS3 MLC SSD, spec’d for write-intensive use cases. As far as I can tell, this drive is about as close to NVMe performance as you can get from a SAS3 interface per the datasheet specs (200k IOPS, 2050 MB/s throughput, 85 us latency max). Endurance is 10 drive writes per day for 5 years, which is 7.3 PB.

Our pool now does 43.0k IOPS (max 49.8k), with a median sync latency of 306 us, standard dev of 244 us

root@truenas-datto[~]# smartctl -a /dev/da0
=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUSMM3240ASS205

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=107MiB/s][w=27.5k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5609: Mon Jan 30 12:01:13 2023
  write: IOPS=43.0k, BW=168MiB/s (176MB/s)(5045MiB/30001msec); 0 zone resets
    slat (nsec): min=1190, max=300805, avg=3316.20, stdev=6306.27
    clat (usec): min=86, max=33197, avg=350.86, stdev=244.72
     lat (usec): min=129, max=33199, avg=354.17, stdev=244.57
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  247], 10.00th=[  265], 20.00th=[  281],
     | 30.00th=[  289], 40.00th=[  297], 50.00th=[  306], 60.00th=[  314],
     | 70.00th=[  330], 80.00th=[  347], 90.00th=[  379], 95.00th=[  486],
     | 99.00th=[ 1369], 99.50th=[ 1418], 99.90th=[ 1532], 99.95th=[ 1565],
     | 99.99th=[ 1631]
   bw (  KiB/s): min=106970, max=199121, per=100.00%, avg=173154.63, stdev=33688.29, samples=59
   iops        : min=26742, max=49780, avg=43288.37, stdev=8422.14, samples=59
  lat (usec)   : 100=0.01%, 250=5.86%, 500=89.35%, 750=0.62%, 1000=0.05%
  lat (msec)   : 2=4.11%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=7.38%, sys=17.49%, ctx=318098, majf=0, minf=1
  IO depths    : 1=0.3%, 2=3.9%, 4=21.5%, 8=61.2%, 16=13.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.9%, 8=0.8%, 16=5.3%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1291513,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Intel Optane DC P1600X 58GB

This is one of the drives sent to me by Intel. Optane is not NAND technology. I don’t really keep up on the details of persistent storage details, so you’ll have to look up specifics. What I do know, is the technology allows for much faster writes, in terms of latency. At the top end, the throughput isn’t as high as modern NVMe, but the latency is much quicker at low queue depths. As a reminder, all the tests so far are with queue depth = 16. At these queue depths, Optane doesn’t look as fast as it really is, unless you’re working with the P1600X’s big brother (next section).

IOPS = 39.2k (max 57.6k), median latency = 322 us, std dev = 385 us.

root@truenas-datto[~]# smartctl -a /dev/nvme1 | head -n 6
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEK1A058GA
Serial Number:                      PHOC209200KC058A

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=98.7MiB/s][w=25.3k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5670: Mon Jan 30 12:03:22 2023
  write: IOPS=39.2k, BW=153MiB/s (160MB/s)(4591MiB/30001msec); 0 zone resets
    slat (nsec): min=1096, max=337336, avg=3261.19, stdev=6144.90
    clat (usec): min=26, max=72627, avg=387.96, stdev=385.67
     lat (usec): min=75, max=72631, avg=391.22, stdev=385.50
    clat percentiles (usec):
     |  1.00th=[  163],  5.00th=[  200], 10.00th=[  221], 20.00th=[  265],
     | 30.00th=[  289], 40.00th=[  306], 50.00th=[  322], 60.00th=[  334],
     | 70.00th=[  351], 80.00th=[  408], 90.00th=[  619], 95.00th=[  857],
     | 99.00th=[ 1385], 99.50th=[ 1500], 99.90th=[ 2245], 99.95th=[ 2343],
     | 99.99th=[ 2671]
   bw (  KiB/s): min=51568, max=230682, per=100.00%, avg=157659.42, stdev=51886.29, samples=59
   iops        : min=12892, max=57670, avg=39414.47, stdev=12971.59, samples=59
  lat (usec)   : 50=0.01%, 100=0.04%, 250=16.53%, 500=67.34%, 750=10.13%
  lat (usec)   : 1000=1.55%
  lat (msec)   : 2=4.31%, 4=0.11%, 10=0.01%, 50=0.01%, 100=0.01%
  cpu          : usr=7.55%, sys=17.05%, ctx=310458, majf=0, minf=1
  IO depths    : 1=0.2%, 2=3.3%, 4=18.4%, 8=63.4%, 16=14.8%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.6%, 8=1.5%, 16=4.8%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1175375,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Intel Optane DC P4800X 375GB

This is the other drive that Intel sent me. This drive is fast. Many review sites use verbiage such as “this drive has no comparison” or “this is the fastest drive we’ve ever tests”.

My first test was for 30 seconds. I realized the drive was doing it’s thing (caching sync writes) much faster than the HDD mirror could flush to disk. Regardless, the initial results are below. Note the max IOPS of 84.6k.

root@truenas-datto[~]# smartctl -a /dev/nvme0
=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPED1K375GA
Serial Number:                      PHKS750500G2375AGN

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=30 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=105MiB/s][w=26.9k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5733: Mon Jan 30 12:05:04 2023
  write: IOPS=38.9k, BW=152MiB/s (159MB/s)(4560MiB/30001msec); 0 zone resets
    slat (nsec): min=1138, max=14924k, avg=4410.98, stdev=17210.79
    clat (usec): min=15, max=38034, avg=389.12, stdev=465.14
     lat (usec): min=52, max=38036, avg=393.54, stdev=464.47
    clat percentiles (usec):
     |  1.00th=[   81],  5.00th=[  110], 10.00th=[  129], 20.00th=[  149],
     | 30.00th=[  161], 40.00th=[  172], 50.00th=[  180], 60.00th=[  190],
     | 70.00th=[  210], 80.00th=[  465], 90.00th=[ 1221], 95.00th=[ 1287],
     | 99.00th=[ 1467], 99.50th=[ 1549], 99.90th=[ 2343], 99.95th=[ 2409],
     | 99.99th=[ 2737]
   bw (  KiB/s): min=61716, max=338557, per=100.00%, avg=156442.86, stdev=111950.13, samples=59
   iops        : min=15429, max=84639, avg=39110.37, stdev=27987.58, samples=59
  lat (usec)   : 20=0.01%, 50=0.02%, 100=3.42%, 250=72.03%, 500=4.63%
  lat (usec)   : 750=0.26%, 1000=0.10%
  lat (msec)   : 2=19.20%, 4=0.35%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=8.83%, sys=17.06%, ctx=377621, majf=0, minf=1
  IO depths    : 1=0.2%, 2=2.4%, 4=15.2%, 8=66.6%, 16=15.7%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.0%, 8=2.6%, 16=4.4%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1167247,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Now that we’re hitting the HDD mirror limits

Recall that the default txg flush interval is 5 seconds. I used a test of 10 seconds, thinking that the SLOG will cache writes for up to 5 seconds, and for the next 5 seconds, the disks are constantly writing the cache to disk while the SLOG is still ingesting writes.

HGST SS300 – 10 second test

IOPS = 47.7k (max = 49.0k), median latency = 306 us, std dev = 125 us

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=10 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=172MiB/s][w=44.0k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5798: Mon Jan 30 12:06:40 2023
  write: IOPS=47.7k, BW=186MiB/s (195MB/s)(1864MiB/10001msec); 0 zone resets
    slat (nsec): min=1069, max=2552.4k, avg=3405.36, stdev=7482.06
    clat (usec): min=85, max=18461, avg=314.21, stdev=124.91
     lat (usec): min=156, max=18465, avg=317.61, stdev=125.01
    clat percentiles (usec):
     |  1.00th=[  225],  5.00th=[  253], 10.00th=[  273], 20.00th=[  285],
     | 30.00th=[  289], 40.00th=[  297], 50.00th=[  306], 60.00th=[  314],
     | 70.00th=[  326], 80.00th=[  343], 90.00th=[  363], 95.00th=[  392],
     | 99.00th=[  486], 99.50th=[  510], 99.90th=[  586], 99.95th=[  635],
     | 99.99th=[ 1860]
   bw (  KiB/s): min=167409, max=196215, per=100.00%, avg=191025.95, stdev=6820.67, samples=19
   iops        : min=41852, max=49053, avg=47756.05, stdev=1705.08, samples=19
  lat (usec)   : 100=0.01%, 250=4.23%, 500=95.12%, 750=0.62%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=7.83%, sys=19.74%, ctx=125318, majf=0, minf=1
  IO depths    : 1=0.2%, 2=4.0%, 4=22.3%, 8=60.6%, 16=12.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.1%, 8=0.5%, 16=5.4%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,477196,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Intel P1600X – 10 second test

IOPS = 49.0k (max = 56.0k), median latency = 306 us, std dev = 151 us

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=10 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=192MiB/s][w=49.1k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5846: Mon Jan 30 12:07:30 2023
  write: IOPS=49.0k, BW=191MiB/s (201MB/s)(1913MiB/10001msec); 0 zone resets
    slat (nsec): min=1149, max=2444.8k, avg=3780.45, stdev=8423.91
    clat (usec): min=17, max=19475, avg=304.18, stdev=151.42
     lat (usec): min=92, max=19478, avg=307.96, stdev=151.36
    clat percentiles (usec):
     |  1.00th=[  159],  5.00th=[  196], 10.00th=[  210], 20.00th=[  247],
     | 30.00th=[  277], 40.00th=[  293], 50.00th=[  306], 60.00th=[  318],
     | 70.00th=[  330], 80.00th=[  343], 90.00th=[  375], 95.00th=[  420],
     | 99.00th=[  519], 99.50th=[  537], 99.90th=[  594], 99.95th=[  635],
     | 99.99th=[ 1205]
   bw (  KiB/s): min=181572, max=224247, per=100.00%, avg=196069.00, stdev=9980.14, samples=19
   iops        : min=45393, max=56061, avg=49016.74, stdev=2494.90, samples=19
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.05%, 250=20.89%, 500=77.45%
  lat (usec)   : 750=1.59%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 20=0.01%
  cpu          : usr=9.61%, sys=21.87%, ctx=146140, majf=0, minf=1
  IO depths    : 1=0.2%, 2=3.5%, 4=18.5%, 8=63.4%, 16=14.4%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.6%, 8=1.6%, 16=4.8%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,489805,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Intel P4800X – 8 second test

IOPS = 80.7k (max = 86.3k, so might still be hitting HDD mirror limits), median latency = 172 us, std dev = 86 us. These numbers are fantastic, especially for 4k blocksize. The throughput is 331MB/s. You will have a hard time getting your hands on a faster device unless you have $xx,xxx to spend.

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=16 --runtime=8 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=277MiB/s][w=70.9k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5906: Mon Jan 30 12:09:02 2023
  write: IOPS=80.7k, BW=315MiB/s (331MB/s)(2522MiB/8001msec); 0 zone resets
    slat (nsec): min=1264, max=13894k, avg=4853.54, stdev=20889.14
    clat (usec): min=14, max=14173, avg=174.95, stdev=84.46
     lat (usec): min=55, max=14176, avg=179.81, stdev=86.27
    clat percentiles (usec):
     |  1.00th=[   84],  5.00th=[  116], 10.00th=[  135], 20.00th=[  149],
     | 30.00th=[  159], 40.00th=[  165], 50.00th=[  172], 60.00th=[  178],
     | 70.00th=[  184], 80.00th=[  192], 90.00th=[  210], 95.00th=[  233],
     | 99.00th=[  367], 99.50th=[  441], 99.90th=[  644], 99.95th=[  734],
     | 99.99th=[ 1037]
   bw (  KiB/s): min=275345, max=345029, per=100.00%, avg=328385.13, stdev=16226.53, samples=15
   iops        : min=68836, max=86259, avg=82095.93, stdev=4056.60, samples=15
  lat (usec)   : 20=0.01%, 50=0.02%, 100=2.65%, 250=93.56%, 500=3.47%
  lat (usec)   : 750=0.25%, 1000=0.03%
  lat (msec)   : 2=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=15.90%, sys=33.75%, ctx=232026, majf=0, minf=1
  IO depths    : 1=0.2%, 2=2.9%, 4=18.1%, 8=64.7%, 16=14.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.3%, 8=1.8%, 16=4.9%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,645590,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Queue Depth = 1 tests

Where the Optane drives really shine is with low queue depth operations. I reran the tests with QD=1 (5 second test to simulate a burst of writes) for the P4800X, the P1600X, and the SS300.

P4800X queue depth = 1

IOPS = 20.2k, median latency = 45 us, std dev = 129 us.

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=5 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=73.7MiB/s][w=18.9k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5920: Mon Jan 30 12:09:42 2023
  write: IOPS=20.2k, BW=78.8MiB/s (82.6MB/s)(394MiB/5001msec); 0 zone resets
    slat (nsec): min=1413, max=266338, avg=1671.21, stdev=1557.28
    clat (usec): min=16, max=40903, avg=47.37, stdev=128.99
     lat (usec): min=43, max=40905, avg=49.04, stdev=129.01
    clat percentiles (usec):
     |  1.00th=[   43],  5.00th=[   44], 10.00th=[   44], 20.00th=[   44],
     | 30.00th=[   44], 40.00th=[   44], 50.00th=[   45], 60.00th=[   45],
     | 70.00th=[   46], 80.00th=[   47], 90.00th=[   50], 95.00th=[   68],
     | 99.00th=[   85], 99.50th=[   92], 99.90th=[  115], 99.95th=[  133],
     | 99.99th=[  245]
   bw (  KiB/s): min=68448, max=85351, per=100.00%, avg=80968.44, stdev=6282.37, samples=9
   iops        : min=17112, max=21337, avg=20241.78, stdev=1570.39, samples=9
  lat (usec)   : 20=0.01%, 50=89.91%, 100=9.81%, 250=0.27%, 500=0.01%
  lat (msec)   : 50=0.01%
  cpu          : usr=4.02%, sys=7.56%, ctx=101209, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,100846,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

P1600X queue depth = 1

IOPS = 13.7k, median latency = 63 us, std dev = 134 us

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=5 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=55.0MiB/s][w=14.1k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=5969: Mon Jan 30 12:10:43 2023
  write: IOPS=13.7k, BW=53.5MiB/s (56.1MB/s)(268MiB/5001msec); 0 zone resets
    slat (nsec): min=1344, max=82121, avg=1741.86, stdev=1171.68
    clat (usec): min=45, max=34906, avg=70.64, stdev=134.05
     lat (usec): min=61, max=34908, avg=72.38, stdev=134.11
    clat percentiles (usec):
     |  1.00th=[   62],  5.00th=[   62], 10.00th=[   62], 20.00th=[   63],
     | 30.00th=[   63], 40.00th=[   63], 50.00th=[   63], 60.00th=[   64],
     | 70.00th=[   64], 80.00th=[   71], 90.00th=[   95], 95.00th=[  117],
     | 99.00th=[  123], 99.50th=[  125], 99.90th=[  135], 99.95th=[  147],
     | 99.99th=[  176]
   bw (  KiB/s): min=55067, max=56348, per=100.00%, avg=55753.33, stdev=498.73, samples=9
   iops        : min=13766, max=14087, avg=13937.89, stdev=124.76, samples=9
  lat (usec)   : 50=0.01%, 100=92.75%, 250=7.25%
  lat (msec)   : 50=0.01%
  cpu          : usr=2.88%, sys=5.60%, ctx=68708, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,68540,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

SS300 queue depth = 1

Back to reality with a very high performing SAS-3 SSD.

IOPS = 6.4k, median latency = 151 us, std dev = 93 us

The P4800X is 3x faster than this very, very fast SSD, P1600X is 2x faster.

root@truenas-datto[/bench-pool]# rm -f random-write.0.0 && fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=5 --time_based
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=24.9MiB/s][w=6363 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=6023: Mon Jan 30 12:11:23 2023
  write: IOPS=6379, BW=24.9MiB/s (26.1MB/s)(125MiB/5001msec); 0 zone resets
    slat (nsec): min=1350, max=49617, avg=2134.61, stdev=1266.22
    clat (usec): min=125, max=16182, avg=153.86, stdev=93.01
     lat (usec): min=129, max=16183, avg=155.99, stdev=93.23
    clat percentiles (usec):
     |  1.00th=[  131],  5.00th=[  133], 10.00th=[  133], 20.00th=[  133],
     | 30.00th=[  135], 40.00th=[  135], 50.00th=[  151], 60.00th=[  155],
     | 70.00th=[  159], 80.00th=[  169], 90.00th=[  200], 95.00th=[  202],
     | 99.00th=[  206], 99.50th=[  208], 99.90th=[  235], 99.95th=[  255],
     | 99.99th=[  537]
   bw (  KiB/s): min=25434, max=25887, per=100.00%, avg=25632.00, stdev=152.43, samples=9
   iops        : min= 6358, max= 6471, avg=6407.56, stdev=38.13, samples=9
  lat (usec)   : 250=99.94%, 500=0.05%, 750=0.01%, 1000=0.01%
  lat (msec)   : 20=0.01%
  cpu          : usr=1.72%, sys=3.26%, ctx=31941, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,31903,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Last test, using a P1600X as a ZFS special allocation device (special vdev)

The Intel engineer I was working with requested I demonstrate the potential of a P1600X as a ZFS special allocation device (special device/special vdev). They did not have a specific test in mind. A special device stores metadata about the files on the pool, and can additionally store small blocks (as set with the special_small_blocks property of the dataset). If you set special_small_blocks to 16k, for example, any file that is smaller than 16k will be written to the special device directly instead of the main pool disks. As you might imagine, this can really speed up transfers of small files. Note: since the special device stores all metadata about data on the pool, and potentially data itself, it is critical that it is at least a mirror (some use 3-way mirrors). If you lose the special device vdev, you lose the entire pool! Since I am just benchmarking, I do not have a mirrored special vdev. I can remove it whenever I want since the underlying pool is comprised of mirrors.

The test procedure is as follows:

  1. Reboot machine to clear ARC
  2. Run a command that reads metadata for every file on a dataset. The specific command prints out a distribution of file sizes, which I then used to calculate what I should use for special_small_blocks.
  3. Note the time it took to run said command
  4. Wipe pool
  5. Add special device to pool
  6. Transfer all data back to pool
  7. Rerun command and note time difference.

The “wipe pool” and “transfer all data back to pool” steps make this test something you don’t want to repeat (as with many other ZFS things, such as changing compression type or recordsize, special device metadata is only written upon writes. it will not backfill).

Command used to for special_small_blocks calculation

I found this on level1techs.com:

find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

It prints out a list of sizes and how many entites are in that size bin. I threw ‘time’ in front to see how long it takes to run. I ran it once for each dataset. This is my media dataset (movies, music, tv shows, etc.)

root@truenas-datto[/mnt/test-pool/share-media]# time find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:  14336
  2k:    759
  4k:   1194
  8k:   4188
 16k:   5609
 32k:   3498
 64k:   2882
128k:    552
256k:    194
512k:    103
  1M:     58
  2M:      6
  4M:      2
  8M:     14
 16M:     98
 32M:    253
 64M:    793
128M:   1130
256M:    529
512M:    322
  1G:    185
  2G:     60
  4G:      9
  8G:      3
 16G:      1
find . -type f -print0  0.15s user 2.10s system 6% cpu 37.210 total
xargs -0 ls -l  1.23s user 0.24s system 3% cpu 37.213 total
awk   0.07s user 0.02s system 0% cpu 37.213 total
sort -n  0.00s user 0.00s system 0% cpu 37.213 total
awk   0.00s user 0.00s system 0% cpu 37.212 total

37.2 seconds total to run.

And on my ‘data’ dataset (backups, general file storage, etc.):

root@truenas-datto[/mnt/test-pool/share-data]# time find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k: 106714
  2k:  23595
  4k:  20593
  8k:  25994
 16k:  20264
 32k:  28506
 64k:  28759
128k:  26909
256k:  20027
512k:  19708
  1M:  24780
  2M:  34587
  4M:  20427
  8M:  15888
 16M:   2388
 32M:    983
 64M:    558
128M:    272
256M:    260
512M:    179
  1G:     93
  2G:     69
  4G:     36
  8G:     17
 16G:      5
 32G:      1
128G:      4
256G:      1
find . -type f -print0  0.36s user 5.46s system 3% cpu 2:43.07 total
xargs -0 ls -l  10.46s user 2.55s system 7% cpu 2:43.13 total
awk   0.78s user 0.02s system 0% cpu 2:43.13 total
sort -n  0.00s user 0.00s system 0% cpu 2:43.13 total
awk   0.00s user 0.00s system 0% cpu 2:43.13 total

2m43s. We have a baseline.

I also threw together a spreadsheet to do a cumulative sum on how much space is used by each bin. I decided to go with 16k, which should only use 0.87 GB of the 58 GB Optane P1600X for small files.

ZFS Special device size calculation spreadsheet table. Link to .xlsx file below.

Rebuilding the pool with a special allocation device

I use syncoid exclusively to synchronize ZFS pools across machines/systems. Sanoid generates the snapshots. These are great tools written by Jim Salter. I transferred with option special_small_blocks=16k so the property was set immediately. Syncoid won’t write to an empty, already created dataset. It took a while to transfer 3.7TB, even over a 10G link.

-- add 2x more WD Reds, wipe pools, recreate pools, remove p1600x as slog, re-add as special dev
root@truenas-datto[~]# zpool destroy bench-pool

zpool create -o ashift=12 test-pool mirror /dev/da0 /dev/da1 mirror /dev/da2 /dev/da3
zpool add -f test-pool special /dev/nvd1

root@truenas-datto[~]# smartctl -a /dev/nvme1
=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEK1A058GA
Serial Number:                      PHOC209200KC058A

root@truenas-datto[~]# zpool status test-pool
  pool: test-pool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        test-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
        special
          nvd1      ONLINE       0     0     0

errors: No known data errors

syncoid --no-sync-snap --recvoptions="o special_small_blocks=16k" [email protected]:big/share-media test-pool/share-media
root@truenas-datto[~]# zfs get special_small_blocks test-pool/share-media
NAME                   PROPERTY              VALUE                 SOURCE
test-pool/share-media  special_small_blocks  16K                   local

-----------------------------------------------------------------------------------------------------------------------------
-- NOTE: this zpool list was captured approximately 1 minute into the data transfer, hence why the nvd1 ALLOC is only 24.7M -
-----------------------------------------------------------------------------------------------------------------------------
root@truenas-datto[~]# zpool list -v test-pool
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test-pool     7.30T  15.1G  7.29T        -         -     0%     0%  1.00x    ONLINE  /mnt
  mirror-0    3.62T  7.22G  3.62T        -         -     0%  0.19%      -    ONLINE
    da0       3.64T      -      -        -         -      -      -      -    ONLINE
    da1       3.64T      -      -        -         -      -      -      -    ONLINE
  mirror-1    3.62T  7.87G  3.62T        -         -     0%  0.21%      -    ONLINE
    da2       3.64T      -      -        -         -      -      -      -    ONLINE
    da3       3.64T      -      -        -         -      -      -      -    ONLINE
special           -      -      -        -         -      -      -      -  -
  nvd1        54.9G  24.7M  54.5G        -         -     0%  0.04%      -    ONLINE

Special device metadata reading results

The same command as before for the ‘media’ dataset:

root@truenas-datto[/mnt/test-pool/share-media]# time find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k:  14336
  2k:    759
  4k:   1194
 ...
  4G:      9
  8G:      3
 16G:      1
find . -type f -print0  0.14s user 1.53s system 51% cpu 3.254 total
xargs -0 ls -l  1.22s user 0.23s system 43% cpu 3.370 total
awk   0.08s user 0.00s system 2% cpu 3.370 total
sort -n  0.00s user 0.00s system 0% cpu 3.370 total
awk   0.00s user 0.00s system 0% cpu 3.369 total

And for the ‘data’ dataset:

root@truenas-datto[/mnt/test-pool/share-media]# cd ../share-data
root@truenas-datto[/mnt/test-pool/share-data]# time find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
  1k: 106714
  2k:  23595
  4k:  20593
 ...
 32G:      1
128G:      4
256G:      1
find . -type f -print0  0.24s user 3.80s system 24% cpu 16.282 total
xargs -0 ls -l  10.29s user 2.39s system 77% cpu 16.345 total
awk   0.78s user 0.02s system 4% cpu 16.345 total
sort -n  0.00s user 0.00s system 0% cpu 16.345 total
awk   0.00s user 0.00s system 0% cpu 16.345 total

To summarize the special device metadata access times (all times in seconds), it reduced the amount of time required to traverse all file metadata by 90%. That said, there was a similar reduction if I ran the command multiple times in a row without a special vdev due to ZFS’ ARC caching the metadata. It would be trivial to write a cron task that would iterate over files on a regular basis to keep that data in ARC as frequently accessed data.

share-mediashare-data
no special vdev (s)37.2163.07
with special vdev (s)3.2516.28
delta (s)-33.95-146.79
delta %-91.3%-90.0%

Special device size utilization

As far as I can tell, there isn’t really a good way to calcuate how much space will be used by the special device storing metadata and such. In my case, 3.74TB of data with special_small_blocks=16k resulted in 11.9GB being used on the special device:

root@truenas-datto[/mnt/test-pool/share-data]# zpool list -v test-pool
NAME           SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test-pool     7.30T  3.74T  3.57T        -         -     0%    51%  1.00x    ONLINE  /mnt
  mirror-0    3.62T  1.78T  1.84T        -         -     0%  49.1%      -    ONLINE
    da0       3.64T      -      -        -         -      -      -      -    ONLINE
    da1       3.64T      -      -        -         -      -      -      -    ONLINE
  mirror-1    3.62T  1.94T  1.68T        -         -     0%  53.6%      -    ONLINE
    da2       3.64T      -      -        -         -      -      -      -    ONLINE
    da3       3.64T      -      -        -         -      -      -      -    ONLINE
special           -      -      -        -         -      -      -      -  -
  nvd1        54.9G  11.9G  42.6G        -         -     1%  21.8%      -    ONLINE

Conclusion

If you’ve made it this far, congrats. I didn’t intend to write this many words on ZFS benchmarks but here we are. Hopefully you found this data interesting/insightful and/or learned something from it. At a minimum, you should now be able to calculate how much space you need if you want to utilize a special vdev. Also special thanks to Intel for sending me Optane samples to play around with and benchmark.

Here’s a table with a summary of the results:

ZFS SLOG/ZIL benchmarks for select devices

Please let me know in the comments of this post if you’d like me to re-run any benchmarks, or have any questions/comments/concerns about the process!