I intend to use this site to document my journey down the path of nerdiness (past, present, and future). I’ve been learning over the years from various sites like what I hope this one becomes, and want to give back. I have a wide variety of topics I’d like to cover. At a minimum, posting about my activities will help me document what I learned to refer back in the future. I’ll also post about projects we do ourselves around the house instead of hiring professionals, saving big $$$$ in the process. Hope you enjoy the journey with me!
Below are some topic I plan on covering (I’ve already done something with every one of these and plan on documenting it):
RTL-SDRs (receiving signals from your electric meter, ADS-B, general radio stuff)
Virtual machines and my homelab setup
Home automation / smart home (Home Assistant, Tasmota, Phillips Hue bulbs, automating various tasks throughout the house)
My mini solar setup (2x300W panels) and not-so-mini battery backup (8x272Ah LiFePO4 batteries – should yield 7ish kWh of storage)
Remote control aircraft running Arduplane with video downlink and two-way telemetry
General computer stuff (building them, what I use mine for, Hyper-V)
Home network (Ubiquiti setup, VLANs, wiring the house with CAT6, IP security cameras on Blue Iris)
Formation of my LLC if anyone wants to hear about that
Math delta (left), EQ delta (center), and combined delta (right) across all 667 (i,j) layer duplication configs. The three-phase encode/reason/decode anatomy is clearly visible at 4B scale.
I’ve been messing around with local LLMs on my 3090 for a while now — I have a growing collection of Qwen models on D:\LLM that I probably should be embarrassed about. A few weeks ago I stumbled across David Noel Ng’s LLM Neuroanatomy blog posts, where he showed that you can take a pretrained transformer and literally just re-run some of its middle layers a second time at inference, no retraining needed, and get meaningfully better outputs.
The D:\LLM folder. I should probably be embarrassed about this.
The idea is wild: the model’s weights don’t change. You just tell it “hey, run layers 15 through 21 again” and the hidden state gets another pass through those same weights. Ng showed this working on Qwen3.5-27B (a 64-layer model) with up to +15.6% improvement on combined math and emotional reasoning benchmarks.
Naturally, I wanted to know if this works on smaller models too. Welcome to Austin’s Nerdy Things, where we perform brain surgery on 4-billion-parameter language models to make them think twice.
Background: What Is RYS?
Ng’s technique is called RYS and the core concept is surprisingly simple. A normal transformer forward pass goes:
Process input through layers 0, 1, 2, …, N-1 sequentially
Done
With RYS, you pick a contiguous block in the middle — layers i through j-1 — and after the model finishes layer j-1, you jump back and re-execute layers i through N-1. Those middle layers run twice on the evolving hidden state.
The reason this can work is that transformer layers aren’t all doing the same thing. Ng’s work showed models have a recognizable three-phase anatomy:
Early layers (~0-15% depth): Encoding. Converting tokens into contextualized representations. Repeating these produces garbage — the model tries to re-encode already-encoded stuff.
Middle layers (~20-60% depth): Reasoning. The actual thinking. Repeating these is like giving the model extra time to work through the problem.
Late layers (~70-100% depth): Decoding. Converting internal representations back into token predictions. Repeating these also produces garbage.
Ng found the sweet spot consistently in the middle, and his RYS repo provides all the tooling to test this — layer duplication wrappers, benchmark probe sets, the whole thing.
But his experiments were on a 27B model with 64 layers. I wanted to know: does this three-phase anatomy even exist at 4B scale? Can you exploit it on consumer hardware?
The Setup
Model: Qwen3-4B. I picked this one specifically because it’s a pure dense transformer (36 layers, 2560 hidden dim, GQA with 32 Q / 8 KV heads, RoPE, BF16). The Qwen3.5-2B has hybrid linear/full attention which would complicate things, and Qwen3-4B is in the same model family as Ng’s 27B target, which makes cross-scale comparison cleaner.
Hardware: My trusty RTX 3090 (24 GB VRAM). The model takes about 8.1 GB at baseline, which leaves plenty of room for the KV cache overhead from layer duplication.
Benchmarks: I used Ng’s probe sets from the RYS repo:
Math-16: 16 hard math questions (square roots, cube roots, big multiplications) requiring single-integer answers. Scored with digit-level partial credit. No chain-of-thought allowed. Greedy decoding, 64 max new tokens.
EQ-16: 16 EQ-Bench scenarios — complex social dialogues where the model predicts 4 emotion intensities on a 0-10 scale. Max 256 new tokens.
I used /no_think to disable Qwen3’s thinking mode so we’re measuring raw single-pass capability, and greedy decoding (do_sample=False), which I verified is perfectly deterministic across 5 runs on the same input. No need for multi-run variance testing.
The sweep: All 667 valid (i, j) configurations for a 36-layer model, including baseline (0, 0). Every config runs all 32 probe questions. I added early stopping that triggers if the first 2 math probes both produce garbage (saves about 30% of wall time on broken configs). The scanner saves results to JSON after every single config — resume-friendly for when Windows decides it’s update time.
Total sweep time: about 9 hours on a single 3090. Claude helped me write the scanner script (with me providing the architecture decisions and Ng’s RYS library doing the heavy lifting on layer manipulation).
Baseline Scores
Before messing with anything, Qwen3-4B scores:
Probe
Score
Math-16
0.305
EQ-16
0.749
Combined
1.054
The math score looks low, but these are genuinely hard problems (like “what is the cube root of 1019330085047 times 31?”) and the scorer gives partial credit for getting digits right. The EQ score is actually solid — Qwen3-4B is pretty decent at predicting emotional dynamics even without chain-of-thought.
The Heatmaps
Here’s where it gets fun. I swept all 667 configs and plotted the results as heatmaps. Each cell is one (i, j) configuration. Red means improvement over baseline. Blue means degradation. The x-axis is j (where the repeated block ends) and the y-axis is i (where it starts).
Left: math delta. Center: EQ delta. Right: combined delta. Red = improvement, blue = degradation. 667 configs, 36 layers.
Three things jumped out immediately.
1. The three-phase anatomy is clearly present at 4B scale
The top-left corner (early layers duplicated with wide spans) is deep blue — that’s the encoding zone. The bottom-right corner (late layers) is also blue — that’s the decoding zone. The productive region runs diagonally through the middle. This is exactly the encode / reason / decode structure Ng found at 27B.
Layers 0-6 are the encoding wall. Repeat anything starting before layer 5 with a wide span and the model outputs garbage. Layers 30+ are decoding territory — also garbage if you repeat there. The productive zone lives between layers ~5 and ~27, spanning roughly 60% of the model.
2. Math and EQ have different hot zones
This was something I wasn’t expecting. The math heatmap shows gains across a broad band from mid-stack to upper layers. The EQ heatmap’s gains concentrate in a tighter region around layers 7-16. The combined heatmap shows three distinct hot zones:
Zone A (layers 7-15, ~19-42% depth): Strong EQ gains, moderate math
Zone B (layers 15-20, ~42-56% depth): Balanced improvement on both
Zone C (layers 21-27, ~58-75% depth): Strong math gains, EQ roughly neutral
So the model’s “emotional processing” lives slightly earlier in the stack than its “mathematical processing.” That’s a cool finding — different kinds of reasoning occupy different layer ranges even in a small model.
3. The encoding wall is a cliff, not a slope
The transition from “productive duplication” to “catastrophic failure” happens over 1-2 layers. Layer 5 duplication helps. Layer 3 duplication tanks the model. There’s basically no gradient — it’s a cliff edge. Ng observed something similar at 27B but it’s even more pronounced at 4B scale.
The Pareto Frontier
Not all improvements are worth the extra latency. Each extra layer traversal costs time. The practical question is: how much bang per buck?
X-axis: overhead (%). Y-axis: combined score. The curve is sharply concave — almost all the benefit comes from the first 1-2 extra layers.
Size
Config (i,j)
Extra layers
Overhead
Combined
Improvement
XS
(2,3)
1
2.8%
1.090
+3.4%
S
(5,6)
1
2.8%
1.154
+9.6%
M
(21,22)
1
2.8%
1.179
+11.9%
L
(21,23)
2
5.6%
1.192
+13.2%
XL
(14,36)
22
61.1%
1.202
+14.0%
The winner is (21,22): just repeat layer 21 once. That’s +11.9% combined improvement at 2.8% latency overhead. One single extra layer forward pass. That’s it.
Going from 1 extra layer to 2 buys another 1.3 percentage points. Going from 2 extra layers to 22 — literally 10x the overhead — buys only 0.8 more. The returns collapse fast. Look at that Pareto chart — the curve is basically flat after the first couple of points.
Single-Layer Repeats: The 4B Surprise
This is where things get really interesting, and where the results diverge most from Ng’s 27B findings.
Ng reported that single-layer repeats at 27B “almost never help.” You need to duplicate a contiguous block of at least 2-3 layers to see meaningful improvement at that scale.
At 4B? 14 out of 35 single-layer repeats beat baseline. Here are the top performers:
Layer
Config
Combined delta
21
(21,22)
+0.126
5
(5,6)
+0.101
24
(24,25)
+0.100
26
(26,27)
+0.073
19
(19,20)
+0.071
22
(22,23)
+0.063
20
(20,21)
+0.057
17
(17,18)
+0.049
Layers 5 and 17-26 — nearly the entire mid-to-late stack — all produce meaningful gains when repeated individually. That’s a wide, diffuse productive zone spanning about 60% of the model.
My interpretation: smaller models have less specialized layers. At 27B with 64 layers, each layer does something specific enough that repeating just one doesn’t help much — you need a coherent block. At 4B with 36 layers, individual layers carry more general-purpose reasoning capacity. A single extra pass through one of them is already enough to bump quality.
This is arguably the most practically useful finding from the whole experiment. For small models, even the simplest possible intervention works.
How This Compares to Ng’s 27B Results
Property
Qwen3-4B (36 layers)
Qwen3.5-27B (64 layers)
Three-phase anatomy
Yes, clearly visible
Yes
Encoding wall
Layers 0-6 (~0-17%)
~first 15%
Best single-layer
(21,22) = +11.9%
Rarely productive
Best absolute
(14,36) = +14.0% at 61% overhead
~+15.6% at ~15.6% overhead
Best efficiency
(21,22) = +11.9% at 2.8% overhead
Layer 33 = +1.5%
Productive single layers
14/35 (40%)
Rare
Efficiency curve shape
Sharply concave
Roughly linear to ~10 layers
The biggest difference is the shape of the efficiency curve. At 27B, adding more repeated layers gives roughly linear improvement up to about 10 extra layers — there’s a real reason to invest in multi-layer duplication. At 4B, the curve is sharply concave. Almost all the benefit comes from the very first extra layer. After that, you’re paying a lot of overhead for very little gain.
This makes intuitive sense. A bigger model has more specialized layers where repetition compounds — each one contributes something distinct. A smaller model gets most of its benefit from a single extra pass through its most general-purpose reasoning layer, and additional passes hit diminishing returns because those layers are doing similar work.
What This Means If You Want to Use It
If you’re deploying a small dense model and want better reasoning at minimal cost:
Find the model’s “layer 21.” Run a quick single-layer sweep on your target model. It takes minutes per config.
Repeat that one layer. At 2.8% latency overhead, this is basically free.
Don’t over-invest in multi-layer duplication at small scale. The second extra layer buys way less than the first.
For framework implementers: this is a ~10-line change to a model’s forward pass. No weight changes, no retraining, no meaningful VRAM increase. It should be a first-class inference option in llama.cpp, vLLM, ExLlama, etc.
Caveats
I want to be upfront about what this doesn’t prove:
One model, one family. These results are Qwen3-4B specific. The three-phase anatomy probably generalizes (Ng showed it on multiple architectures), but the exact layer numbers won’t. Every model needs its own sweep.
Small probe sets. 16 math + 16 EQ questions. Enough for relative ordering of configs, but the absolute scores have meaningful variance. Validate on larger benchmarks before deploying.
Greedy decoding only. Sampling might interact differently with layer duplication. I haven’t tested that.
No multi-block compositions. Ng’s beam search finds configs that repeat two different blocks (e.g., layers 30-34 AND 43-45). I only tested single contiguous blocks. The multi-block space at 4B is unexplored.
RoPE positions aren’t adjusted. The model sees the same position IDs on the repeated pass. This works empirically but the theoretical interaction is unclear.
Reproducing This
Everything runs on a single 3090 (or any 24GB+ GPU):
The scanner loads the model once, pre-tokenizes all probes, then iterates through configs. Each config wraps the base model with a layer-index remapping (no weight copies, just pointer rearrangement), runs all probes greedy, scores, and saves. Resume works by checking which config keys already exist in the results JSON.
What’s Next
A few obvious follow-ups I’m thinking about:
Multi-block beam search at 4B. Does combining layers 5-6 and 21-22 compound the gains?
Cross-scale comparison. Run the same sweep on Qwen3.5-2B (hybrid attention), Qwen3.5-9B, maybe a non-Qwen model. See how the efficiency curve changes with scale.
Train-time loop exposure. Train a small model where specific layers are looped during training, compare with inference-time-only duplication.
Integration with inference frameworks. llama.cpp, vLLM, and ExLlama already manage layer weights — adding a “repeat layer N” flag should be pretty straightforward.
The broader takeaway is that transformer layers aren’t interchangeable. They have structure, and that structure is legible even at small scale. You can exploit it at inference time with zero retraining, and the cost is basically nothing.
Layer 21 thinks twice. The model gets smarter. That’s the whole trick.
I’ve been building SkySpottr, an AR app overlaying aircraft information on your phone’s screen, using your device’s location, orientation, and incoming aircraft data (ADS-B) to predict where planes should appear on screen, then uses a YOLO model to lock onto the actual aircraft and refine the overlay. YOLOv8 worked great for this… until I actually read the license.
Welcome to Austin’s Nerdy Things, where we train from scratch entire neural networks to avoid talking to lawyers.
The Problem with Ultralytics
YOLOvWhatver is excellent. Fast, accurate, easy to use, great documentation. But Ultralytics licenses it under AGPL-3.0, which means if you use it in a product, you either need to open-source your entire application or pay for a commercial license. For a side project AR app that I might eventually monetize? That’s a hard pass.
Enter YOLOX from Megvii (recommended by either ChatGPT or Claude, can’t remember which, as an alternative). MIT licensed. Do whatever you want with it. The catch? You have to train your own models from scratch instead of using Ultralytics’ pretrained weights and easy fine-tuning pipeline. I have since learned there are some pretrained models. I didn’t use them.
So training from scratch is what I did. Over a few late nights in December 2025, I went from zero YOLOX experience to running custom-trained aircraft detection models in my iOS app. Here’s how it went.
The Setup
Hardware: RTX 3090 on my Windows machine, COCO2017 dataset on network storage (which turned out to be totally fine for training speed), and way too many terminal windows open.
I started with the official YOLOX repo and the aircraft class from COCO2017. The dataset has about 3,000 training images with airplanes, which is modest but enough to get started.
The first training run failed immediately because I forgot to install YOLOX as a package. Classic. Then it failed again because I was importing a class that didn’t exist in the version I had. Claude (who was helping me through this, and hallucinated said class) apologized and fixed the import. We got there eventually.
Training Configs: Nano, Tiny, Small, and “Nanoish”
YOLOX has a nice inheritance-based config system. You create a Python file, inherit from a base experiment class, and override what you want. I ended up with four different configs:
yolox_nano_aircraft.py – The smallest. 0.9M params, 1.6 GFLOPs. Runs on anything.
yolox_tiny_aircraft.py – Slightly bigger with larger input size for small object detection.
yolox_small_aircraft.py – 5M params, 26 GFLOPs. The “serious” model.
yolox_nanoish_aircraft.py – My attempt at something between nano and tiny.
The “nanoish” config was my own creation where I tried to find a sweet spot. I bumped the width multiplier from 0.25 to 0.33 and… immediately got a channel mismatch error because 0.33 doesn’t divide evenly into the architecture. Turns out you can’t just pick arbitrary numbers. I am a noob at these things. Lesson learned.
After some back-and-forth, I settled on a config with 0.3125 width (which is 0.25 \* 1.25, mathematically clean) and 512×512 input. This gave me roughly 1.2M params – bigger than nano, smaller than tiny, and it actually worked.
Here’s the small model config – the one that ended up in production. The key decisions are width = 0.50 (2x wider than nano for better feature extraction), 640×640 input for small object detection, and full mosaic + mixup augmentation:
class Exp(MyExp):
def __init__(self):
super(Exp, self).__init__()
# Model config - YOLOX-Small architecture
self.num_classes = 1 # Single class: airplane
self.depth = 0.33
self.width = 0.50 # 2x wider than nano for better feature extraction
# Input/output config - larger input helps small object detection
self.input_size = (640, 640)
self.test_size = (640, 640)
self.multiscale_range = 5 # Training will vary from 480-800
# Data augmentation
self.mosaic_prob = 1.0
self.mosaic_scale = (0.1, 2.0)
self.enable_mixup = True
self.mixup_prob = 1.0
self.flip_prob = 0.5
self.hsv_prob = 1.0
# Training config
self.warmup_epochs = 5
self.max_epoch = 400
self.no_aug_epochs = 100
self.basic_lr_per_img = 0.01 / 64.0
self.scheduler = "yoloxwarmcos"
def get_model(self):
from yolox.models import YOLOX, YOLOPAFPN, YOLOXHead
in_channels = [256, 512, 1024]
# Small uses standard convolutions (no depthwise)
backbone = YOLOPAFPN(self.depth, self.width, in_channels=in_channels, act=self.act)
head = YOLOXHead(self.num_classes, self.width, in_channels=in_channels, act=self.act)
self.model = YOLOX(backbone, head)
return self.model
And the nanoish config for comparison – note the depthwise=True and the width of 0.3125 (5/16) that I landed on after the channel mismatch debacle:
class Exp(MyExp):
def __init__(self):
super(Exp, self).__init__()
self.num_classes = 1
self.depth = 0.33
self.width = 0.3125 # 5/16 - halfway between nano (0.25) and tiny (0.375)
self.input_size = (512, 512)
self.test_size = (512, 512)
# Lighter augmentation than small - this model is meant to be fast
self.mosaic_prob = 0.5
self.mosaic_scale = (0.5, 1.5)
self.enable_mixup = False
def get_model(self):
from yolox.models import YOLOX, YOLOPAFPN, YOLOXHead
in_channels = [256, 512, 1024]
backbone = YOLOPAFPN(self.depth, self.width, in_channels=in_channels,
act=self.act, depthwise=True) # Depthwise = lighter
head = YOLOXHead(self.num_classes, self.width, in_channels=in_channels,
act=self.act, depthwise=True)
self.model = YOLOX(backbone, head)
return self.model
The -c yolox_s.pth loads YOLOX’s pretrained COCO weights as a starting point (transfer learning). The -d 1 is one GPU, -b 16 is batch size 16 (about 8GB VRAM on the 3090 with fp16), and --fp16 enables mixed precision training.
The Small Object Problem
Here’s the thing about aircraft detection for an AR app: planes at cruise altitude look tiny. A 747-8 at 37,000 feet is maybe 20-30 pixels on your phone screen if you’re lucky, even with the 4x optical zoom of the newest iPhones (8x for the 12MP weird zoom mode). Standard YOLO models are tuned for reasonable-sized objects, not specks in the sky. The COCO dataset has aircraft that are reasonably sized, like when you’re sitting at your gate at an airport and take a picture of the aircraft 100 ft in front of you.
My first results were underwhelming. The nano model was detecting larger aircraft okay but completely missing anything at altitude. The evaluation metrics looked like this:
AP for airplane = 0.234
AR for small objects = 0.089
Not great. The model was basically only catching aircraft on approach or takeoff.
For the small config, I made some changes to help with tiny objects:
Increased input resolution to 640×640 (more pixels = more detail for small objects)
Enabled full mosaic and mixup augmentation (helps the model see varied object scales)
Switched from depthwise to regular convolutions (more capacity)
(I’ll be honest, I was leaning heavily on Claude for the ML-specific tuning decisions here)
This pushed the model to 26 GFLOPs though, which had me worried about phone performance.
Here’s what the small model’s accuracy looked like broken down by object size. You can see AP for small objects climbing from ~0.45 to ~0.65 over training, while large objects hit ~0.70. Progress, but small objects remain the hardest category – which tracks with the whole “specks in the sky” problem.
Will This Actually Run on a Phone?
The whole point of this exercise was to run inference on an iPhone. So here is some napkin math:
Model
GFLOPs
Estimated Phone Inference
Nano
1.6
~15ms, smooth 30fps easy
Nanoish
3.2
~25ms, still good
Small
26
~80ms, might be sluggish
YOLOv8n (for reference)
8.7
~27ms
My app was already running YOLOv8n at 15fps with plenty of headroom. So theoretically even the small model should work, but nano/nanoish would leave more room for everything else the app needs to do.
The plan: train everything, compare accuracy, quantize for deployment, and see what actually works in practice.
Training Results (And a Rookie Mistake)
After letting things run overnight (300 epochs takes a while even on a 3090), here’s what I got:
The nanoish model at epoch 100 was already showing 94% detection rate on test images, beating the fully-trained nano model. And it wasn’t even done training yet.
Quick benchmark on 50 COCO test images with aircraft (RTX 3090 GPU inference – not identical to phone, but close enough for the smaller models to be representative):
Model
Detection Rate
Avg Detections/Image
Avg Inference (ms)
FPS
YOLOv8n
58.6%
0.82
33.6
29.7
YOLOX nano
74.3%
1.04
14.0
71.4
YOLOX nanoish
81.4%
1.14
15.0
66.9
YOLOX tiny
91.4%
1.28
16.5
60.7
YOLOX small
92.9%
1.30
17.4
57.4
Ground Truth
–
1.40
–
–
YOLOv8n getting beaten by every single YOLOX variant while also being slower was… not what I expected. Here’s the mAP comparison across all the models over training – you can see the hierarchy pretty clearly:
The big takeaway: more capacity = better accuracy, but with diminishing returns. The jump from nano to nanoish is huge, nanoish to small is solid, and tiny lands somewhere in between depending on the epoch. (You’ll notice two extra lines in the chart – a large model and a self-sourced variant. I kept training after this post’s story ends. More on the self-sourced pipeline later. You can also see the large model is clearly overfitting past epoch ~315 – loss keeps decreasing but mAP starts dropping. My first time overfitting a model.)
The nanoish model hit a nice sweet spot. Faster than YOLOv8n, better small object detection than pure nano, and still lightweight enough for mobile.
And here is the output from my plot_training.py script:
But there was a problem I didn’t notice until later: my training dataset had zero images without aircraft in them. Every single training image contained at least one airplane. This is… not ideal if you want your model to learn what an airplane isn’t. More on that shortly.
How It Actually Works in the App
Before I get to results, here’s what the ML is actually doing in SkySpottr. The app combines multiple data sources to track aircraft:
ADS-B data tells us where aircraft are in 3D space (lat, lon, altitude)
Device GPS and orientation tell us where the phone is and which way it’s pointing
Physics-based prediction places aircraft overlays on screen based on all the above
That prediction is usually pretty good, but phone sensors drift and aircraft positions are slightly delayed. So the overlays can be off by a couple degrees. This is where YOLO comes in.
The app runs the model on each camera frame looking for aircraft. When it finds one within a threshold distance of where the physics engine predicted an aircraft should be, it “snaps” the overlay to the actual detected position. The UI shows an orange circle around the aircraft and marks it as “SkySpottd” – confirmed via machine learning.
I call this “ML snap” mode. It’s the difference between “there’s probably a plane somewhere around here” and “that specific bright dot is definitely the aircraft.”
The model runs continuously on device, which is why inference time matters so much. Even at 15fps cap, that’s still 15 inference cycles per second competing with everything else the app needs to do (sensor fusion, WebSocket data, AR rendering, etc.). Early on I was seeing 130%+ CPU usage on my iPhone, which is not great for battery life. Every millisecond saved on inference is a win.
Getting YOLOX into CoreML
One thing the internet doesn’t tell you: YOLOX and Apple’s Vision framework don’t play nice together.
YOLOv8 exports to CoreML with a nice Vision-compatible interface. You hand it an image, it gives you detections. Easy. YOLOX expects different preprocessing – it wants pixel values in the 0-255 range (not normalized 0-1), and the output tensor layout is different.
The conversion pipeline goes PyTorch → TorchScript → CoreML. Here’s the core of it:
import torch
import coremltools as ct
from yolox.models import YOLOX, YOLOPAFPN, YOLOXHead
# Build model (same architecture as training config)
backbone = YOLOPAFPN(depth=0.33, width=0.50, in_channels=[256, 512, 1024], act="silu")
head = YOLOXHead(num_classes=1, width=0.50, in_channels=[256, 512, 1024], act="silu")
model = YOLOX(backbone, head)
# Load trained weights
ckpt = torch.load("yolox_small_best.pth", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()
model.head.decode_in_inference = True # Output pixel coords, not raw logits
# Trace and convert
dummy = torch.randn(1, 3, 640, 640)
traced = torch.jit.trace(model, dummy)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(name="images", shape=(1, 3, 640, 640))],
outputs=[ct.TensorType(name="output")],
minimum_deployment_target=ct.target.iOS15,
convert_to="mlprogram",
)
mlmodel.save("yolox_small_aircraft.mlpackage")
The decode_in_inference = True is crucial — without it, the model outputs raw logits and you’d need to implement the decode head in Swift. With it, the output is [1, N, 6] where 6 is [x_center, y_center, width, height, obj_conf, class_score] in pixel coordinates.
On the Swift side, Claude ended up writing a custom detector that bypasses the Vision framework entirely. Here’s the preprocessing — the part that was hardest to get right:
/// Convert pixel buffer to MLMultiArray [1, 3, H, W] with 0-255 range
private func preprocess(pixelBuffer: CVPixelBuffer) -> MLMultiArray? {
// GPU-accelerated resize via Core Image
let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
let scaleX = CGFloat(inputSize) / ciImage.extent.width
let scaleY = CGFloat(inputSize) / ciImage.extent.height
let scaledImage = ciImage.transformed(by: CGAffineTransform(scaleX: scaleX, y: scaleY))
// Reuse pixel buffer from pool (memory leak fix #1)
var resizedBuffer: CVPixelBuffer?
CVPixelBufferPoolCreatePixelBuffer(kCFAllocatorDefault, pool, &resizedBuffer)
guard let buffer = resizedBuffer else { return nil }
ciContext.render(scaledImage, to: buffer)
// Reuse pre-allocated MLMultiArray (memory leak fix #2)
guard let array = inputArray else { return nil }
CVPixelBufferLockBaseAddress(buffer, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(buffer, .readOnly) }
let bytesPerRow = CVPixelBufferGetBytesPerRow(buffer)
let pixels = CVPixelBufferGetBaseAddress(buffer)!.assumingMemoryBound(to: UInt8.self)
let arrayPtr = array.dataPointer.assumingMemoryBound(to: Float.self)
let channelStride = inputSize * inputSize
// BGRA → RGB, keep 0-255 range (YOLOX expects unnormalized pixels)
// Direct pointer access is ~100x faster than MLMultiArray subscript
for y in 0..<inputSize {
let rowOffset = y * bytesPerRow
let yOffset = y * inputSize
for x in 0..<inputSize {
let px = rowOffset + x * 4
let idx = yOffset + x
arrayPtr[idx] = Float(pixels[px + 2]) // R
arrayPtr[channelStride + idx] = Float(pixels[px + 1]) // G
arrayPtr[2 * channelStride + idx] = Float(pixels[px]) // B
}
}
return array
}
The two key gotchas: (1) BGRA byte order from the camera vs RGB that the model expects, and (2) YOLOX wants raw 0-255 pixel values, not the 0-1 normalized range that most CoreML models expect. If you normalize, everything silently breaks — the model runs, returns garbage, and you spend an evening wondering why.
For deployment, I used CoreML’s INT8 quantization (coremltools.optimize.coreml.linear_quantize_weights). This shrinks the model by about 50% with minimal accuracy loss. The small model went from ~17MB to 8.7MB, and inference time improved slightly.
Real World Results (Round 1)
I exported the nanoish model and got it running in SkySpottr. The good news: it works. The ML snap feature locks onto aircraft, the orange verification circles appear, and inference is fast enough that I don’t notice any lag.
The less good news: false positives. Trees, parts of houses, certain cloud formations – the model occasionally thinks these are aircraft. Remember that rookie mistake about no negative samples? Yeah.
I later set up a 3-way comparison to visualize exactly this kind of failure. The three panels show my COCO-only trained model (red boxes), a later model trained on self-sourced images (green boxes – I’ll explain this pipeline shortly), and YOLO26-X as a ground truth oracle (right panel, no boxes means no detection). The COCO-only model confidently detects an “aircraft” that is… a building. The other two correctly ignore it.
The app handles this gracefully because of the matching threshold. Random false positives in empty sky don’t trigger the snap because there’s no predicted aircraft nearby to match against. But when there’s a tree branch right next to where a plane should be, the model sometimes locks onto the wrong thing.
The even less good news: it still struggles with truly distant aircraft. A plane at 35,000 feet that’s 50+ miles away is basically a single bright pixel. No amount of ML is going to reliably detect that. For those, the app falls back on pure ADS-B prediction, which is usually good enough to get the overlay in the right general area.
But when it works, it works. I’ll show some examples of successful detections in the self-sourced section below.
The Memory Leak Discovery (Fun Debugging Tangent)
While testing the YOLOX integration, I was also trying to get RevenueCat working for subscriptions. Had the app running for about 20 minutes while I debugged the in-app purchase flow. Noticed it was getting sluggish, opened Instruments, and… yikes.
Base memory for the app is around 200MB. After 20 minutes of continuous use, it had climbed to 450MB. Classic memory leak pattern.
The culprit was AI induced, and AI resolved: it was creating a new CVPixelBuffer and MLMultiArray for every single frame. At 15fps, that’s 900 allocations per minute that weren’t getting cleaned up fast enough.
The fix was straightforward – use a CVPixelBufferPool for the resize buffers and pre-allocate a single MLMultiArray that gets reused. Memory now stays flat even after hours of use.
(The RevenueCat thing? I ended up ditching it entirely and going with native StoreKit2. RevenueCat is great, but keeping debug and release builds separate was more hassle than it was worth for a side project. StoreKit2 is actually pretty nice these days if you don’t need the analytics. I’m at ~80 downloads, and not a single purchase. First paid app still needs some fine tuning, clearly, on the whole freemium thing.)
Round 2: Retraining with Negative Samples
After discovering the false positive issue, I went back and retrained. This time I made sure to include images without aircraft – random sky photos, clouds, trees, buildings, just random COCO2017 stuff. The model needs to learn what’s NOT an airplane just as much as what IS one.
Here’s the extraction script that handles the negative sampling. The key insight: you need to explicitly tell the model what empty sky looks like:
def extract_airplane_dataset(split="train", negative_ratio=0.2, seed=42):
"""Extract airplane images from COCO, with negative samples."""
with open(f"instances_{split}2017.json") as f:
coco_data = json.load(f)
# Find all images WITH airplanes
airplane_image_ids = set()
for ann in coco_data['annotations']:
if ann['category_id'] == AIRPLANE_CATEGORY_ID: # 5 in COCO
airplane_image_ids.add(ann['image_id'])
# Find images WITHOUT airplanes for negative sampling
all_ids = {img['id'] for img in coco_data['images']}
negative_ids = all_ids - airplane_image_ids
# Add 20% negative images (no airplanes = teach model what ISN'T a plane)
num_negatives = int(len(airplane_image_ids) * negative_ratio)
sampled_negatives = random.sample(list(negative_ids), num_negatives)
# ... copy images and annotations to output directory
I also switched from nanoish to the small model. The accuracy improvement on distant aircraft was worth the extra compute, and with INT8 quantization the inference time came in at around 5.6ms on an iPhone – way better than my napkin math predicted. Apple’s Neural Engine is impressive.
The final production model: YOLOX-Small, 640×640 input, INT8 quantized, ~8.7MB on disk. It runs at 15fps with plenty of headroom for the rest of the app on my iPhone 17 Pro.
Round 3: Self-Sourced Images and Closing the Loop
So the model works, but it was trained entirely on COCO2017 – airport tarmac photos, stock images, that kind of thing. My app is pointing at the sky from the ground. Those are very different domains.
I added a debug flag to SkySpottr for my phone that saves every camera frame where the model fires a detection. Just flip it on, walk around outside for a while, and the app quietly collects real-world training data. Over a few weeks of casual use, I accumulated about 2,000 images from my phone.
The problem: these images don’t have ground truth labels. I’m not going to sit there and manually draw bounding boxes on 2,000 sky photos. So I used YOLO26-X (Ultralytics’ latest and greatest, which I’m fine using as an offline tool since it never ships in the app) as a teacher model. Run it on all the collected images, take its high-confidence detections as pseudo-labels, convert to COCO annotation format, and now I have a self-sourced dataset to mix in with the original COCO training data.
Here’s the pseudo-labeling pipeline. First, run the teacher model on all collected images:
from ultralytics import YOLO
model = YOLO("yolo26x.pt") # Big model, accuracy over speed
for img_path in tqdm(image_paths, desc="Processing images"):
results = model(str(img_path), conf=0.5, verbose=False)
boxes = results[0].boxes
airplane_boxes = boxes[boxes.cls == AIRPLANE_CLASS_ID]
for box in airplane_boxes:
xyxy = box.xyxy[0].cpu().numpy().tolist()
x1, y1, x2, y2 = xyxy
detections.append({
"bbox_xywh": [x1, y1, x2 - x1, y2 - y1], # COCO format
"confidence": float(box.conf[0]),
})
Then convert those detections to COCO annotation format so YOLOX can train on them:
def convert_to_coco(detections):
"""Convert YOLO26 detections to COCO training format."""
coco_data = {
"images": [], "annotations": [],
"categories": [{"id": 1, "name": "airplane", "supercategory": "vehicle"}],
}
for uuid, data in detections.items():
img_path = Path(data["image_path"])
width, height = Image.open(img_path).size
if width > 1024 or height > 1024: # Skip oversized images
continue
coco_data["images"].append({"id": image_id, "file_name": f"{uuid}.jpg",
"width": width, "height": height})
for det in data["detections"]:
coco_data["annotations"].append({
"id": ann_id, "image_id": image_id, "category_id": 1,
"bbox": det["bbox_xywh"], "area": det["bbox_xywh"][2] * det["bbox_xywh"][3],
"iscrowd": 0,
})
with open("instances_train.json", "w") as f:
json.dump(coco_data, f)
Finally, combine both datasets in the training config using YOLOX’s ConcatDataset:
Out of 2,000 images, YOLO26-X found aircraft in about 108 of them at a 0.5 confidence threshold – a 1.8% hit rate, which makes sense since most frames are just empty sky between detections. I filtered out anything over 1024px and ended up with a nice supplementary dataset of aircraft-from-the-ground images.
The 3-way comparison images I showed earlier came from this pipeline. Here’s what successful detections look like – the COCO-only model (red), self-sourced model (green), and YOLO26-X (right panel, shown at full resolution so you can see what we’re actually detecting):
That’s maybe 30 pixels of airplane against blue sky, detected with 0.88 and 0.92 confidence by the two YOLOX variants.
And here’s one I particularly like – aircraft spotted through pine tree branches. Real-world conditions, not a clean test image. Both YOLOX models nail it, YOLO26-X misses at this confidence threshold:
And a recent one from February 12, 2026 – a pair of what appear to be F/A-18s over Denver at 4:22 PM MST, captured at 12x zoom. The model picks up both jets at 73-75% confidence, plus the bird in the bottom-right at 77% (a false positive the app filters out via ADS-B matching). Not bad for specks against an overcast sky.
I also trained a full YOLOX-Large model (depth 1.0, width 1.0, 1024×1024 input) on the combined dataset, just to see how far I could push it. Too heavy for phone deployment, but useful for understanding the accuracy ceiling.
Conclusion
Was this worth it to avoid Ultralytics’ licensing? Since it took an afternoon and a couple evenings of vibe-coding, yes, it was not hard to switch. Not just because MIT is cleaner than AGPL, but because I learned a ton about how these models actually work. The Ultralytics ecosystem is so polished that it’s easy to treat it as a black box. Building from YOLOX forced me to understand some of the nuances, the training configs, and the tradeoffs between model size and accuracy.
Plus, I can now say I trained my own object detection model from scratch. That’s worth something at parties. Nerdy parties, anyway.
SkySpottr is live on the App Store if you want to see the model in action – point your phone at the sky and watch it lock onto aircraft in real-time.
The self-sourced pipeline is still running. Every time I use the app with the debug flag on, it collects more training data. The plan is to periodically retrain as the dataset grows – especially now that I’m getting images from different weather conditions, times of day, and altitudes. The COCO-only model was a solid start, but a model trained on actual ground-looking-up images of aircraft at altitude? That’s the endgame.
But there was a problem. Despite having a stable PPS reference, my NTP server’s frequency drift was exhibiting significant variation over time. After months (years) of monitoring the system with Grafana dashboards, I noticed something interesting: the frequency oscillations seemed to correlate with CPU temperature changes. The frequency would drift as the CPU heated up during the day and cooled down at night, even though the PPS reference remained rock-solid.
Like clockwork (no pun intended), I somehow get sucked back into trying to improve my setup every 6-8 weeks. This post is the latest on that never-ending quest.
This post details how I achieved an 81% reduction in frequency variability and 77% reduction in frequency standard deviation through a combination of CPU core pinning and thermal stabilization. Welcome to Austin’s Nerdy Things, where we solve problems that 99.999% of people (and 99% of datacenters) don’t have.
The Problem: Thermal-Induced Timing Jitter
Modern CPUs, including those in Raspberry Pis, use dynamic frequency scaling to save power and manage heat. When the CPU is idle, it runs at a lower frequency (and voltage). When load increases, it scales up. This is great for power efficiency, but terrible for precision timekeeping.
Why? Because timekeeping (with NTP/chronyd/others) relies on a stable system clock to discipline itself against reference sources. If the CPU frequency is constantly changing, the system clock’s tick rate varies, introducing jitter into the timing measurements. Even though my PPS signal was providing a mostly perfect 1-pulse-per-second reference, the CPU’s frequency bouncing around made it harder for chronyd to maintain a stable lock.
But here’s the key insight: the system clock is ultimately derived from a crystal oscillator, and crystal oscillator frequency is temperature-dependent. The oscillator sits on the board near the CPU, and as the CPU heats up and cools down throughout the day, so does the crystal. Even a few degrees of temperature change can shift the oscillator’s frequency by parts per million – exactly what I was seeing in my frequency drift graphs. The CPU frequency scaling was one factor, but the underlying problem was that temperature changes were affecting the crystal oscillator itself. By stabilizing the CPU temperature, I could stabilize the thermal environment for the crystal oscillator, keeping its frequency consistent.
Looking at my Grafana dashboard, I could see the frequency offset wandering over a range of about 1 PPM (parts per million) as the Pi warmed up and cooled down throughout the day. The RMS offset was averaging around 86 nanoseconds, which isn’t terrible (it’s actually really, really, really good), but I knew it could be better.
The Discovery
After staring at graphs for longer than I’d like to admit, I had an idea: what if I could keep the CPU at a constant temperature? If the temperature (and therefore the frequency) stayed stable, maybe the timing would stabilize too.
The solution came in two parts:
1. CPU core isolation – Dedicate CPU 0 exclusively to timing-critical tasks (chronyd and PPS interrupts) 2. Thermal stabilization – Keep the other CPUs busy to maintain a constant temperature, preventing frequency scaling
Here’s what happened when I turned on the thermal stabilization system on November 17, 2025 at 09:10 AM:
Same ish graph but with CPU temp also plotted:
That vertical red line marks on the first plot when I activated the “time burner” process. Notice how the frequency oscillations immediately dampen and settle into a much tighter band? Let’s dive into how this works.
The Solution Part 1: CPU Core Pinning and Real-Time Priority
The first step is isolating timing-critical operations onto a dedicated CPU core. On a Raspberry Pi (4-core ARM), this means:
CPU 0: Reserved for chronyd and PPS interrupts
CPUs 1-3: Everything else, including our thermal load
I had AI (probably Claude Sonnet 4 ish, maybe 4.5) create a boot optimization script that runs at system startup:
#!/bin/bash
# PPS NTP Server Performance Optimization Script
# Sets CPU affinity, priorities, and performance governor at boot
set -e
echo "Setting up PPS NTP server performance optimizations..."
# Wait for system to be ready
sleep 5
# Set CPU governor to performance mode
echo "Setting CPU governor to performance..."
cpupower frequency-set -g performance
# Pin PPS interrupt to CPU0 (may fail if already pinned, that's OK)
echo "Configuring PPS interrupt affinity..."
echo 1 > /proc/irq/200/smp_affinity 2>/dev/null || echo "PPS IRQ already configured"
# Wait for chronyd to start
echo "Waiting for chronyd to start..."
timeout=30
while [ $timeout -gt 0 ]; do
chronyd_pid=$(pgrep chronyd 2>/dev/null || echo "")
if [ -n "$chronyd_pid" ]; then
echo "Found chronyd PID: $chronyd_pid"
break
fi
sleep 1
((timeout--))
done
if [ -z "$chronyd_pid" ]; then
echo "Warning: chronyd not found after 30 seconds"
else
# Set chronyd to real-time priority and pin to CPU 0
echo "Setting chronyd to real-time priority and pinning to CPU 0..."
chrt -f -p 50 $chronyd_pid
taskset -cp 0 $chronyd_pid
fi
# Boost ksoftirqd/0 priority
echo "Boosting ksoftirqd/0 priority..."
ksoftirqd_pid=$(ps aux | grep '\[ksoftirqd/0\]' | grep -v grep | awk '{print $2}')
if [ -n "$ksoftirqd_pid" ]; then
renice -n -10 $ksoftirqd_pid
echo "ksoftirqd/0 priority boosted (PID: $ksoftirqd_pid)"
else
echo "Warning: ksoftirqd/0 not found"
fi
echo "PPS NTP optimization complete!"
# Log current status
echo "=== Current Status ==="
echo "CPU Governor: $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)"
echo "PPS IRQ Affinity: $(cat /proc/irq/200/effective_affinity_list 2>/dev/null || echo 'not readable')"
if [ -n "$chronyd_pid" ]; then
echo "chronyd Priority: $(chrt -p $chronyd_pid)"
fi
echo "======================"
What this does:
Performance Governor: Forces all CPUs to run at maximum frequency, disabling frequency scaling
PPS IRQ Pinning: Ensures PPS interrupt (IRQ 200) is handled exclusively by CPU 0
Chronyd Real-Time Priority: Sets chronyd to SCHED_FIFO priority 50, giving it preferential CPU scheduling
Chronyd CPU Affinity: Pins chronyd to CPU 0 using taskset
ksoftirqd Priority Boost: Improves priority of the kernel softirq handler on CPU 0
This script can be added to /etc/rc.local or as a systemd service to run at boot.
The Solution Part 2: PID-Controlled Thermal Stabilization
Setting the performance governor helps, but on a Raspberry Pi, even at max frequency, the CPU temperature will still vary based on ambient conditions and load. Temperature changes affect the CPU’s actual operating frequency due to thermal characteristics of the silicon.
The solution? Keep the CPU at a constant temperature using a PID-controlled thermal load. I call it the “time burner” (inspired by CPU burn-in tools, but with precise temperature control).
As a reminder of what we’re really doing here: we’re maintaining a stable thermal environment for the crystal oscillator. The RPi 3B’s 19.2 MHz oscillator is physically located near the CPU on the Raspberry Pi board, so by actively controlling CPU temperature, we’re indirectly controlling the oscillator’s temperature. Since the oscillator’s frequency is temperature-dependent (this is basic physics of quartz crystals), keeping it at a constant temperature means keeping its frequency stable – which is exactly what we need for precise timekeeping.
Here’s how it works:
Read CPU temperature from /sys/class/thermal/thermal_zone0/temp
PID controller calculates how much CPU time to burn to maintain target temperature (I chose 54°C)
Three worker processes run on CPUs 1, 2, and 3 (avoiding CPU 0)
Each worker alternates between busy-loop (MD5 hashing) and sleeping based on PID output
Temperature stabilizes at the setpoint, preventing thermal drift
Here’s the core implementation (simplified for readability):
#!/usr/bin/env python3
import time
import argparse
import multiprocessing
import hashlib
import os
from collections import deque
class PIDController:
"""Simple PID controller with output clamping and anti-windup."""
def __init__(self, Kp, Ki, Kd, setpoint, output_limits=(0, 1), sample_time=1.0):
self.Kp = Kp
self.Ki = Ki
self.Kd = Kd
self.setpoint = setpoint
self.output_limits = output_limits
self.sample_time = sample_time
self._last_time = time.time()
self._last_error = 0.0
self._integral = 0.0
self._last_output = 0.0
def update(self, measurement):
"""Compute new output of PID based on measurement."""
now = time.time()
dt = now - self._last_time
if dt < self.sample_time:
return self._last_output
error = self.setpoint - measurement
# Proportional
P = self.Kp * error
# Integral with anti-windup
self._integral += error * dt
I = self.Ki * self._integral
# Derivative
derivative = (error - self._last_error) / dt if dt > 0 else 0.0
D = self.Kd * derivative
# Combine and clamp
output = P + I + D
low, high = self.output_limits
output = max(low, min(high, output))
self._last_output = output
self._last_error = error
self._last_time = now
return output
def read_cpu_temperature(path='/sys/class/thermal/thermal_zone0/temp'):
"""Return CPU temperature in Celsius."""
with open(path, 'r') as f:
temp_str = f.read().strip()
return float(temp_str) / 1000.0
def burn_cpu(duration):
"""Busy-loop hashing for 'duration' seconds."""
end_time = time.time() + duration
m = hashlib.md5()
while time.time() < end_time:
m.update(b"burning-cpu")
def worker_loop(worker_id, cmd_queue, done_queue):
"""
Worker process:
- Pins itself to CPUs 1, 2, or 3 (avoiding CPU 0)
- Burns CPU based on commands from main process
"""
available_cpus = [1, 2, 3]
cpu_to_use = available_cpus[worker_id % len(available_cpus)]
os.sched_setaffinity(0, {cpu_to_use})
print(f"Worker {worker_id} pinned to CPU {cpu_to_use}")
while True:
cmd = cmd_queue.get()
if cmd is None:
break
burn_time, sleep_time = cmd
burn_cpu(burn_time)
time.sleep(sleep_time)
done_queue.put(worker_id)
# Main control loop (simplified)
def main():
target_temp = 54.0 # degrees Celsius
control_window = 0.20 # 200ms cycle time
pid = PIDController(Kp=0.05, Ki=0.02, Kd=0.0,
setpoint=target_temp,
sample_time=0.18)
# Start 3 worker processes
workers = []
cmd_queues = []
done_queue = multiprocessing.Queue()
for i in range(3):
q = multiprocessing.Queue()
p = multiprocessing.Process(target=worker_loop, args=(i, q, done_queue))
p.start()
workers.append(p)
cmd_queues.append(q)
try:
while True:
# Measure temperature
current_temp = read_cpu_temperature()
# PID control: output is fraction of time to burn (0.0 to 1.0)
output = pid.update(current_temp)
# Convert to burn/sleep times
burn_time = output * control_window
sleep_time = control_window - burn_time
# Send command to all workers
for q in cmd_queues:
q.put((burn_time, sleep_time))
# Wait for workers to complete
for _ in range(3):
done_queue.get()
print(f"Temp={current_temp:.2f}C, Output={output:.2f}, "
f"Burn={burn_time:.2f}s")
except KeyboardInterrupt:
for q in cmd_queues:
q.put(None)
for p in workers:
p.join()
if __name__ == '__main__':
main()
The full implementation includes a temperature filtering system to smooth out sensor noise and command-line arguments for tuning the PID parameters.
PID Tuning Notes:
Kp=0.05: Proportional gain – responds to current error
Ki=0.02: Integral gain – eliminates steady-state error
Kd=0.0: Derivative gain – set to zero because temperature changes slowly
The target temperature of 54°C was chosen empirically – high enough to keep the CPU from idling down, but low enough to avoid thermal throttling (which starts around 80°C on Raspberry Pi).
The Results: Numbers Don’t Lie
The improvement was immediately visible. Here are the statistics comparing performance before and after the optimization:
A note on ambient conditions: The Raspberry Pi lives in a project enclosure in our master bedroom (chosen for its decent GPS reception and ADS-B coverage for a new aircraft AR overlay app idea I’m working on also running on this Pi). While the time burner maintains the CPU die temperature at 54°C, the enclosure is still subject to ambient temperature swings. Room temperature cycles from a low of 66°F (18.9°C) at 5:15 AM to a peak of 72°F (22.2°C) at 11:30 AM – a 6°F daily swing from our heating schedule. The fact that we see such dramatic frequency stability improvements despite this ambient variation speaks to how effective the thermal control is. The CPU’s active heating overwhelms the environmental changes, maintaining consistent silicon temperature where it matters most.
Frequency Stability
Metric
Before
After
Improvement
Mean RMS Offset
85.44 ns
43.54 ns
49.0% reduction
Median RMS Offset
80.13 ns
37.93 ns
52.7% reduction
The RMS offset is chronyd’s estimate of the timing uncertainty. Cutting this nearly in half means the system is maintaining significantly better time accuracy.
Setup Instructions
Want to replicate this? Here’s the step-by-step process:
Prerequisites
You need a working GPS PPS NTP server setup. If you don’t have one yet, follow my 2025 NTP guide first.
# Verify CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should output: performance
# Check chronyd CPU affinity and priority
ps -eo pid,comm,psr,ni,rtprio | grep chronyd
# Should show psr=0 (CPU 0) and rtprio=50
# Check time burner processes
ps aux | grep time_burner
# Should show 4 processes (1 main + 3 workers)
# Monitor NTP performance
chronyc tracking
Example output from chronyc tracking:
Reference ID : 50505300 (PPS)
Stratum : 1
Ref time (UTC) : Sun Nov 24 16:45:23 2025
System time : 0.000000038 seconds fast of NTP time
Last offset : -0.000000012 seconds
RMS offset : 0.000000035 seconds
Frequency : 1.685 ppm slow
Residual freq : -0.001 ppm
Skew : 0.002 ppm
Root delay : 0.000000001 seconds
Root dispersion : 0.000010521 seconds
Update interval : 16.0 seconds
Leap status : Normal
Notice the RMS offset of 35 nanoseconds – this is the kind of accuracy you can achieve with thermal stabilization.
Step 6: Monitor Over Time
(Topic for a future post)
Set up Grafana dashboards to monitor:
Frequency offset (PPM)
RMS offset (nanoseconds)
CPU temperature
System time offset
You’ll see the frequency stabilize within a few hours as the PID controller locks onto the target temperature.
Monitoring and Troubleshooting
Real-Time Monitoring
Watch chronyd tracking in real-time:
watch -n 1 "chronyc tracking"
Check time burner status:
sudo systemctl status time-burner.service
View time burner output:
sudo journalctl -u time-burner.service -f
Common Issues
Temperature overshoots or oscillates:
Adjust PID gains – reduce Kp if oscillating, increase Ki if steady-state error
Try different target temperatures (50-60°C range)
High CPU usage (obviously):
This is intentional – the time burner uses ~90% of 3 cores
Not suitable for Pis running other workloads
Chronyd not pinned to CPU 0:
Check that the optimization script runs after chronyd starts
Adjust the timing in the systemd service dependencies
Trade-offs and Considerations
Let’s be honest about the downsides:
Power Consumption
The time burner keeps 3 cores at ~30% average utilization. My Pi now draws about 3-4W continuously (vs 1-2W idle). Over a year, that’s an extra 15-25 kWh, or about $2-3 in electricity (depending on your rates).
Heat
Running at 54°C means the Pi is warm to the touch. This is well within safe operating temperature (thermal throttling doesn’t start until 80°C), but you might want to ensure adequate ventilation. I added a small heatsink just to be safe.
CPU Resources
You’re dedicating 3 of 4 cores to burning cycles. This is fine for a dedicated NTP server, but not suitable if you’re running other services on the same Pi. That said, I am also running the feeder to my new ADS-B aircraft visualization app on it. My readsb instance regularly gets to 1200 msg/s with 200+ aircraft.
Is It Worth It?
For 99.999% of use cases: absolutely not.
Most applications don’t need better than millisecond accuracy, let alone the 35-nanosecond RMS offset I’m achieving. Even for distributed systems, microsecond-level accuracy is typically overkill.
When this might make sense:
Precision timing applications (scientific instrumentation, radio astronomy)
Distributed systems research requiring tight clock synchronization
Network testing where timing precision affects results
Because you can (the best reason for any homelab project)
For me, this falls squarely in the “because you can” category. I had the monitoring infrastructure in place, noticed the thermal correlation, and couldn’t resist solving the problem. Plus, I learned a lot about PID control, CPU thermal characteristics, and Linux real-time scheduling.
Future Improvements
Some ideas I’m considering:
Adaptive PID Tuning
The current PID gains are hand-tuned for a specific ambient temperature range. The fairly low P value is to avoid spikes when some load on the Pi kicks up the temp. The I is a balance to keep long term “burn” relatively consistent. Implementing an auto-tuning algorithm (like Ziegler-Nichols) or adaptive PID could handle seasonal temperature variations better.
Hardware Thermal Control
Instead of software thermal control, I could add an actively cooled heatsink with PWM fan control. This might achieve similar temperature stability while using less power overall.
Oven-Controlled Crystal Oscillator (OCXO)
For the ultimate in frequency stability, replacing the Pi’s crystal with a temperature-controlled OCXO would eliminate thermal drift at the source. This is how professional timing equipment works. I do have a BH3SAP GPSDO sitting next to me (subject to a future post)… Then again, I’m the person who just wrote 4000 words about optimizing a $50 time server, so who am I kidding?
Conclusions
Through a combination of CPU core isolation and PID-controlled thermal stabilization, I achieved:
81% reduction in frequency variability
77% reduction in frequency standard deviation
74% reduction in frequency range
49% reduction in RMS offset
The system now maintains 38-nanosecond median RMS offset from the GPS PPS reference, with frequency drift that’s barely detectable in the noise. The CPU runs at a constant 54°C, and in steady state, the frequency offset stays within a tight ±0.14 PPM band (compared to ±0.52 PPM before optimization).
Was this necessary? No. Did I learn a bunch about thermal management, PID control, and Linux real-time scheduling? Yes. Would I do it again? Absolutely.
Resource
I did come across a “burn” script that was the basis for this thermal management. I can’t find it at the moment, but when I do I’ll link it here.
Have questions or suggestions? Drop a comment below. I’m particularly interested to hear if anyone has tried alternative thermal management approaches or has experience with OCXO modules for Raspberry Pi timing applications.
In the last two PPS posts (the original in 2021 and the revisit in 2025), we explored how to get microsecond-accurate time with a Raspberry Pi and a GPS module that outputs a once-per-second pulse (PPS). That project was a ton of fun—and borderline overkill for most home setups—but it got us into the realm of microseconds! Now we’re going to shoot for yet another SI prefix leap and aim for nanosecond accuracy. That’s 1 ns = 0.000000001 seconds (alternatively, it means there are 1 billion nanoseconds in one second).
How? By using the Precision Time Protocol (PTP, IEEE 1588). PTP is designed for high-precision time synchronization over a network, commonly used in financial trading, industrial control, and telecom environments. With the right hardware and configuration, you can synchronize clocks across your devices to within hundreds of nanoseconds with common homelab gear. Is the title a little misleading? Maybe, but technically it still makes sense to use the nano prefix for the numbers that we’re talking about here (anything >1000 nanoseconds should probably be referred to in microseconds).
To be clear, the nanosecond here refers to the synchronization between devices on your network! Depending on how your Pi is set up, and the quality of it’s oscillator, it is unlikely that your Pi’s timing, as determined by the PPS signals, will be as accurate or precise as the PTP synchronization.
As always, do you need nanosecond-level timing at home? Absolutely, 100% no. But this is Austin’s Nerdy Things, so here we are (again)!
Why would you need time this accurate at home?
You don’t, at all. Even microsecond-level accuracy is already overkill for home usage. But there are some niche use cases:
Amateur radio or signal processing that needs super-tight phase alignment.
High-speed data acquisition where you want to correlate measurements with precise timestamps.
Simply pushing the limits of what’s possible because (if you read far enough back in my about me) the last four digits of my phone number spell NERD (seriously. and I’ve had my phone number since I was 15.)
PTP can outperform NTP by a few orders of magnitude if everything is set up correctly with hardware timestamping. With PTP, your network cards (and potentially switches) handle timestamps in hardware, avoiding much of the jitter introduced by the kernel and software layers.
Diagram showing the various places timestamping can occur in the processing of a ethernet packet, the closer to the link the better for timing purposes. Source: https://networklessons.com/ip-services/introduction-to-precision-time-protocol-ptp
Disclaimer
My experiments appear to be relatively successful but I need to get this out of the way: this level of timing is solidly into the realm of experts. I kinda sorta understand most of what’s going on here but there are a ton of super detailed nuances that go way over my head. Pretty sure some people spend a lifetime on this kind of stuff (particularly at places like the US National Institute of Standards and Technology – NIST, which is “up the street” from where I live and is one of the NTP sources I use). Nanoseconds are being reported but I have no way to verify.
Materials needed
Two machines/computers with NIC (network interface card) that have hardware timestamping – many server NICs have this, and quite a few “prosumer” Intel NICs do too (examples: i210, i340, i350, some i225/i226), and, essential for the revisiting PPS NTP post, Raspberry Pi 5s do too. PTP is also known as IEEE 1588, which is the PTP standard, so you may see either on datasheets.
A very local area network. From what I’ve read, this won’t work well over a WAN, especially if there is asymmetric latency (a typical homelab network, even across a couple switches, will be fine)
A machine with highly accurate time (perhaps from PPS GPS sync) to be used as the “grandmaster”, which is PTP-speak for server.
Procedure
The general procedure will be to set up the server first, which involves syncing the PHC (physical hardware clock) of the NIC to the system clock, which is discipline from elsewhere. After the PHC is synchronized to the system clock, we will use linuxptp (ptp4l) to act as a server. After that, we will essentially do the opposite on any client machines – synchronize the PHC from the PTP grandmaster, and then sync the system clock with the PHC.
0 – Ensure your NIC supports hardware timestamps
Run ethtool to check if your NIC supports hardware timestamps. The format is ethtool -T [nic name]. My NIC is named enp0s31f6 so I will use that. This is a I219-LM in a Dell Optiplex 7040 which is not exactly new but works very well as a Proxmox Backup Server.
ethtool -T enp0s31f6
root@pbs:~# ethtool -T enp0s31f6
Time stamping parameters for enp0s31f6:
Capabilities:
hardware-transmit
software-transmit
hardware-receive
software-receive
software-system-clock
hardware-raw-clock
PTP Hardware Clock: 0
Hardware Transmit Timestamp Modes:
off
on
Hardware Receive Filter Modes:
none
all
ptpv1-l4-sync
ptpv1-l4-delay-req
ptpv2-l4-sync
ptpv2-l4-delay-req
ptpv2-l2-sync
ptpv2-l2-delay-req
ptpv2-event
ptpv2-sync
ptpv2-delay-req
root@pbs:~# ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 48:4d:7e:db:98:6b brd ff:ff:ff:ff:ff:ff
root@pbs:~# lspci | grep Ether
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
The lines to look for are in the capabilities section
hardware-transmit
hardware-receive
We have those so we’re good to go on the client side. I haven’t explored those hardware receive filter modes yet but they look interesting.
The server is the Raspberry Pi 5 which shows similar output:
austin@raspberrypi5:~ $ ethtool -T eth0
Time stamping parameters for eth0:
Capabilities:
hardware-transmit
software-transmit
hardware-receive
software-receive
software-system-clock
hardware-raw-clock
PTP Hardware Clock: 0
Hardware Transmit Timestamp Modes:
off
on
onestep-sync
Hardware Receive Filter Modes:
none
all
1 – Synchronize the hardware clock
First, install linuxptp on both server and client
sudo apt install linuxptp
With linuxptp installed, we will use phc2sys to synchronize the various clocks. Despite the name, phc2sys can be used to synchronize either direction (from PHC to system clock or from system clock to PHC).
With that out of the way, let’s get to the command:
# s = source
# c = destination, replace with your NIC name
# O = offset, PTP traditionally uses TAI, which doesn't use leap seconds and as of Feb 2025, is 37 seconds off of UTC, 0 means use whatever system clock is using
# step_threshold means any delta above this amount should just be jumped instead of slowly shifted by fast/slow frequency
# m = print out status messages
sudo phc2sys -s CLOCK_REALTIME -c eth0 -O 0 --step_threshold=0.5 -m
And the results:
screenshot of phc2sys synchronizing the PHC of a Raspberry Pi 5 NIC with the system clock
Here we see three fields with numbers (offset/delay in nanoseconds and freq in parts per billion (ppb)):
Offset is how far off the PHC is from the realtime clock (starting at 3.4 million nanoseconds = 3.4 milliseconds and then stepping down to 28 nanoseconds)
Frequency is the frequency adjustment of the destination clock (in this case, the eth0 NIC PHC)
Delay is the estimated amount of time to get the message from the source to destination (which is suspiciously high for this NIC, other machines typically show much lower numbers)
Leave this running (we’ll daemon-ize things at the end).
2 – Tune the Raspberry Pi 5 NIC driver to reduce latency
Raspberry Pi ethernet driver by collects packets over a period of time, which is 49 microseconds by default.
Raspberry Pi showing 49 microseconds of packet coalescing
We can reduce that to the driver minimum of 4 microseconds:
Next up is to use ptp4l to serve the time via PTP over your network.
We need a configuration file to give to ptp4l. This isn’t entirely necessary, most config items can be presented as arguments in the command line but I like config files.
Call this file whatever (perhaps ptp-gm.conf, for precision time protocol grandmaster):
[global]
# extra logging
verbose 1
# use hardware timestamping (alternative is software, which isn't nearly as accurate/precise)
time_stamping hardware
# you can specify a "domain number", which is analogus to VLAN
#domainNumber 0
# force this node to act as a master (won't revert to slave).
masterOnly 1
# priority settings, 128 is default. lower numbers are higher priority in case there are multiple grandmasters
priority1 128
# clockClass=6 for GNSS reference
# other classes = https://documentation.nokia.com/srlinux/24-10/books/network-synchronization/ieee-1588-ptp.html
clockClass 6
# timeSource is where time comes from - 0x10 is "atomic clock" which is a bit sus for us but not ultimately wrong
# https://support.spirent.com/csc30/s/article/FAQ14011
timeSource 0x10
# log output to a file, summary interval is 2^x, so 1 = 2^1 = every 2 seconds
# can also output with -m
# summary_interval 1
# logfile /var/log/ptp4l.log
Now run ptp4l also!
sudo ptp4l -f ptp-gm.conf -i eth0
You’ll see some outputs around getting things set up and running. Key things to look for “selected local clock … as best master” and “assuming grand master role”. The MAC shown is actually from the NIC.
Raspberry Pi 5 acting as PTP grandmaster, using the physical hardware clock of the NIC as the “local clock”, which is synchronized with the realtime clock via phc2sys which is synchronized via PPS/GPS.
Now we are ready to serve this time to clients.
4 – Receive PTP over the network
To get PTP over the network, you can use NICs that support software timestamping but we’re going for higher accuracy/precision than that so select a machine that has a NIC that supports PTP/IEEE 1588 (see step 0 for reference).
Setting system time via PTP is really a two step process – synchronizing the NIC PHC with PTP and then using phc2sys to synchronize the system clock with the PHC. If you are thinking this sounds similar to the end of step 2, you are correct, it is just in reverse for the clients.
Diagram showing the source -> PHC -> system clock -> PTP -> network -> PTP -> PHC -> system clock flow. Source: https://latency-matters.medium.com/be-back-on-time-b3267f62d76a
And you will start seeing some init messages followed by some statistics scrolling past:
ptp4l as slave showing double-digit nanosecond synchronization
The output looks a bit different if there are more requests/polls than summary outputs – RMS will be added, which is root mean squared error, along with max error, and some +/- indicators on the frequency and delay. That delay is still suspicious…
We see here that we have achieved double-digit nanosecond synchronization across the network!
Now compare to a Supermicro Xeon v4 server running Intel i350 NICs synchronizing to a OSA 5401 SyncPlug – super stable and tight precision.
ptp4l as slave showing single-digit nanosecond synchronization
The OSA 5401 has an oscillator rated to 1 ppb, and is exceptionally stable. That is half of the equation – the i350 is certainly better than the i219, but probably not by orders of magnitude like the OSA 5401 is.
Oscilloquartz OSA 5401 SyncPlug in a Brocade switch with GPS antenna connected to it’s SMA port, showing the once per second green LED lit. Bet you’ve never seen a GPS antenna port on a SFP module before. This is a complete computer in the SFP module.
Actually, I can try synchronizing the i219-LM to the syncplug. Long story short on this, I use ethernet layer 2 on this system (not layer 3) because the proxmox hosts share their NICs and it was just how I did it originally. I also use domain number = 24 because that’s what the OSA 5401 came with from eBay.
We can see it is a little bit better, but still not nearly as good as the i350. I am now tempted to try my Solarflare/Mellanox NICs, especially since the one I just looked at in my cupboard has u.fl connectors for both PPS in and out… topic for a future post.
With the PHC of the client system synchronized, we are 3/4 of the way to completion. Last up – setting the system clock from the PHC.
5 – Setting client clock from PHC
I originally just use the PHC in Chrony as a source. This will work well. Through my research for this post, I saw that it also possible to share a memory segment from ptp4l to Chrony. I like just using the PHC so we’ll use that approach here.
Add this line to your chrony config:
refclock PHC /dev/ptp0 poll 0 dpoll -5 tai
The poll = 0 means poll the source every second, dpoll -5 means query the source many times per second (2^-5 = 32 hz), and tai is the 37 second offset.
Restart chrony
sudo systemctl restart chrony
After a minute or so check Chrony’s sources with – chronyc sources:
We can see that Chrony has successfully selected the PHC and has synchronized the system clock with it to within one single nanosecond!
6 – Serving PTP to other clients
You can of course repeat the process for N other clients. Alternatively, you can just have Chrony use hardware timestamps and enable the F323 experimental field. That enables some NTPv5 features that help a ton with synchronization. There is also a F324 field that I haven’t tried that appears to run PTP over NTP packets.
The relevant lines from my Chrony config:
peer 10.98.1.172 minpoll 0 maxpoll 0 iburst xleave extfield F323
peer 10.98.1.174 minpoll 0 maxpoll 0 iburst xleave extfield F323
# use tai if synchronized to a true TAI source, which ignores leap seconds
refclock PHC /dev/ptp0 poll 0 dpoll -5 tai prefer trust
allow all
hwtimestamp *
And if you don’t want to mess with Chrony and just want to synchronize your system clock directly from the PHC on your clients – /etc/systemd/system/phc2sys-client.service
sudo systemctl daemon-reload
sudo systemctl enable ptp4l-client.service
sudo systemctl enable phc2sys-client.service # If not using Chrony
sudo systemctl start ptp4l-client.service
sudo systemctl start phc2sys-client.service # If not using Chrony
Conclusion
We’ve come a long way in our pursuit of precise timing – from using GPS PPS signals for microsecond accuracy to achieving nanosecond-level synchronization with PTP. While this level of precision is absolutely overkill for most home setups (as was the microsecond timing from our previous adventures), it demonstrates what’s possible with relatively accessible hardware like the Raspberry Pi 5 and common Intel NICs.
The key takeaways from this exploration:
PTP with hardware timestamping can achieve double-digit nanosecond synchronization even with consumer-grade hardware
The quality of your network interface cards matters significantly – as we saw comparing the i219-LM, i350, and the OSA 5401
Simple optimizations like adjusting packet coalescing can have meaningful impacts on timing precision
Modern tools like Chrony make it relatively straightforward to integrate PTP into your existing time synchronization setup
For those interested in pushing timing precision even further, there are still frontiers to explore – from specialized timing NICs to advanced PTP profiles. But for now, I think I’ll stop here and enjoy my massively overengineered home time synchronization setup. At least until the next timing-related rabbit hole comes along…
ublox LEA-M8T sending PPS signals to Raspberry Pi 5 for extremely precise timingtesting various Chrony/ublox configurations evening of Feb 18 2025 – acheived 1 nanosecond (1 nanosecond = 0.000000001 seconds) accuracy! this was a somewhat lucky capture but the Pi is routinely in single digits, and skew hovers around 0.001-0.002 ppm error, which is 1-2 ppb error on the Pi clock
Original Introduction
Lots of acronyms in that title. If I expand them out, it says – “microsecond accurate network time protocol with a Raspberry Pi and global positioning system pulse per second”. What it means is you can get super accurate timekeeping (1 microsecond = 0.000001 seconds) with a Raspberry Pi and a GPS receiver that spits out pulses every second. By following this guide, you will your very own Stratum 1 NTP server at home!
Why would you need time this accurate at home?
You don’t. There aren’t many applications for this level of timekeeping in general, and even fewer at home. But this blog is called Austin’s Nerdy Things so here we are. Using standard, default internet NTP these days will get your computers to within 2-4 milliseconds of actual time (1 millisecond = 0.001 seconds). Pretty much every internet connected device these days has a way to get time from the internet. PPS gets you to the next SI prefix in terms of accuracy (milli -> micro), which means 1000x more accurate timekeeping. With some other tricks, you can get into the nanosecond range (also an upcoming post topic!).
Materials Needed
Raspberry Pi 5 – the 3’s ethernet is hung off a USB connection so while the 3 itself can get great time, it is severely limited in how accurate other machines can sync to it. Raspberry Pi 4 would work decently. But Raspberry Pi 5 supports Precision Time Protocol (PTP), which can get synchronizations down to double-digit nanoseconds. So get the 5. Ideally, your Pi isn’t doing much other than keeping time, so no need to get one with lots of memory.
A timing-specific GPS module – these have algorithms tuned to provide extremely precise PPS signals. For example, by default, they prefer satellites with higher elevations, and have special fixed position modes where they know they aren’t moving so they focus on providing the best time possible. u-blox devices, for instance, have a “survey-in” mode where the positions are essentially averaged over a specified amount of time and standard deviations to a singular, fixed location. Other options:
project box to stuff it all in – Temperature stability is super important for accurate time. There is a reason some of most accurate oscillators are called oven controlled crystal oscillators (OCXO) – they are extremely stable. This box keeps airflow from minutely cooling/heating the Pi.
I did say “stuffed” right? not joking here… I stuffed some newspaper on top to minimize airflow then closed it up. caption: raspberry pi for timing with PPS GPS NTP in project box with ds18b20 temperature sensor
Steps
0 – Update your Pi and install packages
This NTP guide assumes you have a Raspberry Pi ready to go.
You should update your Pi to latest before basically any project. We will install some other packages as well. pps-tools help us check that the Pi is receiving PPS signals from the GPS module. We also need GPSd for the GPS decoding of both time and position (and for ubxtools which we will use to survey-in). I use chrony instead of NTPd because it seems to sync faster than NTPd in most instances and also handles PPS without compiling from source (the default Raspbian NTP doesn’t do PPS) Installing chrony will remove ntpd.
sudo apt update
sudo apt upgrade
# this isn't really necessary, maybe if you have a brand new pi
# sudo rpi-update
sudo apt install pps-tools gpsd gpsd-clients chrony
1 – Add GPIO and module info where needed
In /boot/firmware/config.txt (changed from last post), add ‘dtoverlay=pps-gpio,gpiopin=18’ to a new line. This is necessary for PPS. If you want to get the NMEA data from the serial line, you must also enable UART and set the initial baud rate.
########## NOTE: at some point, the config file changed from /boot/config.txt to /boot/firmware/config.txt
sudo bash -c "echo '# the next 3 lines are for GPS PPS signals' >> /boot/firmware/config.txt"
sudo bash -c "echo 'dtoverlay=pps-gpio,gpiopin=18' >> /boot/firmware/config.txt"
sudo bash -c "echo 'enable_uart=1' >> /boot/firmware/config.txt"
sudo bash -c "echo 'init_uart_baud=9600' >> /boot/firmware/config.txt"
In /etc/modules, add ‘pps-gpio’ to a new line.
sudo bash -c "echo 'pps-gpio' >> /etc/modules"
Reboot
sudo reboot
Let’s also disable a bunch of stuff we don’t need:
2 – wire up the GPS module to the Pi
Disclaimer – I am writing this guide with a combination of Raspberry Pi 4/5 and Adafruit Ultimate GPS module, but will swap out with the LEA-M8T when it arrives.
Pin connections:
GPS PPS to RPi pin 12 (GPIO 18)
GPS VIN to RPi pin 2 or 4
GPS GND to RPi pin 6
GPS RX to RPi pin 8
GPS TX to RPi pin 10
see 2nd picture for a visual
Adafruit Ultimate GPS Breakout V3
3 – enable serial hardware port
Run raspi-config -> 3 – Interface options -> I6 – Serial Port -> Would you like a login shell to be available over serial -> No. -> Would you like the serial port hardware to be enabled -> Yes.
screenshot showing raspberry serial port (UART) enabled
4 – verify PPS
First, check that PPS is loaded. You should see a single line showing pps_gpio:
Now check for the actual PPS pulses. NOTE: you need at least 4 satellites locked on for PPS signal. The GPS module essentially has 4 unknowns – x, y, z, and time. You need three satellites minimum to solve x, y, and z and a forth for time. Exception for the timing modules – if they know their x, y, z via survey-in or fixed set location, they only need a single satellite for time!
There are a couple options we need to tweak with GPSd to ensure it is available upon boot. This isn’t strictly necessary for PPS only operation, but if you want the general NMEA time information (i.e. not just the exact second marker from PPS), this is necessary.
Edit /etc/default/gpsd:
# USB might be /dev/ttyACM0
# serial might be /dev/ttyS0
# on raspberry pi 5 with raspberry pi os based on debian 12 (bookworm)
DEVICES="/dev/ttyAMA0 /dev/pps0"
# -n means start without a client connection (i.e. at boot)
GPSD_OPTIONS="-n"
# also start in general
START_DAEMON="true"
# Automatically hot add/remove USB GPS devices via gpsdctl
USBAUTO="true"
I’m fairly competent at using systemd and such in a Debian-based system, but there’s something about GPSd that’s a bit odd and I haven’t taken the time to figure out yet. So instead of enabling/restarting the service, reboot the whole Raspberry Pi.
sudo reboot
5 – check GPS for good measure
To ensure your GPS has a valid position, you can run gpsmon or cgps to check satellites and such. This check also ensures GPSd is functioning as expected. If your GPS doesn’t have a position solution, you won’t get a good time signal. If GPSd isn’t working, you won’t get any updates on the screen. The top portion will show the analyzed GPS data and the bottom portion will scroll by with the raw GPS sentences from the GPS module.
gpsmon is a bit easier to read for timing info, cgps is a bit easier to read for satellite info (and TDOP, timing dilution of precision, a measure of how accurate the GPS’s internal time determination is).
Here’s a screenshot from cgps showing the current status of my Adafruit Ultimate GPS inside my basement. There are 10 PRNs (satellites) seen, 8 used. It is showing “3D DGPS FIX”, which is the highest accuracy this module offers. The various *DOPs show the estimated errors. Official guides/docs usually say anything < 2.0 is ideal but lower is better. For reference, Arduplane (autopilot software for RC drones, planes) has a limit of 1.4 for HDOP. It will not permit takeoff with a value greater than 1.4. It is sort of a measure of how spread out the satellites are for that given measure. Evenly distributed around the sky is better for location, closer together is better for timing.
cgps
cgps showing 8 satellites used for this position determination, and a TDOP (time dilution of precision) of 1.31, which is decent. notably, cgps does not show the PPS offset
And for gpsmon, it shows both the TOFF, which is the time offset from the NMEA $GPZDA sentence (which will always come in late due to how long it takes the transmit the dozens of bytes over serial – example 79 byte sentence over 9600 bit per second link, which is super common for GPS modules = 79*(8 bits per byte + 1 start bit + 1 end bit)/9600 = 82.3 milliseconds) as well as the PPS offset. This particular setup is not actually using PPS at the moment. It also shows satellites and a few *DOPs but notably lacks TDOP.
gpsmon
gpsmon showing 8 satellites used for the position with HDOP of 1.27. This indicates a decent position solution, but doesn’t say anything about the time solution.
Both gpsmon and cgps will stream the sentences received from the GPS module.
6 – configure chrony to use both NMEA and PPS signals
Now that we know our Raspberry Pi is receiving both the precision second marker (via PPS), as well as the time of day (TOD) data (via the NMEA $GPMRC and $GPZDA sentences), let’s set up chrony to use both sources for accurate time.
This can be done as a one step process, but it is better to gather some statistics about the delay on your own NMEA sentences. So, let’s add our reference sources and also enable logging for chrony.
In the chrony configuration file (/etc/chrony/chrony.conf), add the following near the existing server directives
# SHM refclock is shared memory driver, it is populated by GPSd and read by chrony
# it is SHM 0
# refid is what we want to call this source = NMEA
# offset = 0.000 means we do not yet know the delay
# precision is how precise this is. not 1e-3 = 1 millisecond, so not very precision
# poll 0 means poll every 2^0 seconds = 1 second poll interval
# filter 3 means take the average/median (forget which) of the 3 most recent readings. NMEA can be jumpy so we're averaging here
refclock SHM 0 refid NMEA offset 0.000 precision 1e-3 poll 0 filter 3
# PPS refclock is PPS specific, with /dev/pps0 being the source
# refid PPS means call it the PPS source
# lock NMEA means this PPS source will also lock to the NMEA source for time of day info
# offset = 0.0 means no offset... this should probably always remain 0
# poll 3 = poll every 2^3=8 seconds. polling more frequently isn't necessarily better
# trust means we trust this time. the NMEA will be kicked out as false ticker eventually, so we need to trust the combo
refclock PPS /dev/pps0 refid PPS lock NMEA offset 0.0 poll 3 trust
# also enable logging by uncommenting the logging line
log tracking measurements statistics
Restart chrony
sudo systemctl restart chrony
Now let’s check to see what Chrony thinks is happening:
chronyc sources
This screenshot was taken seconds after restarting chrony. The * in front of NMEA means that’s the currently selected source. This make sense since the PPS source hasn’t even been polled yet (see the 0 in the reach column). The ? in front of PPS means it isn’t sure about it yet.
Wait a minute or two and try again.
Now Chrony has selected PPS as the currently selected source with the * in front. And the NMEA source has been marked as a “false ticker” with the x in front. But since we trusted the PPS source, it’ll remain as the preferred source. Having two sources by itself is usually not advisable for using general internet NTP servers, since if they both disagree, Chrony can’t know which is right, hence >2 is recommended.
The relatively huge estimated error is because Chrony used the NMEA source first, which was quite a bit off of the PPS precise second marker (i.e. >100 millseconds off) and it takes time to average down to a more realistic number.
Since we turned on statistics, we can use that to set an exact offset for NMEA. After waiting a bit (an hour or so), you can cat /var/log/chrony/statistics.log:
We are interested in the ‘Est offset’ (estimated offset) for the NMEA “IP Address”. Here’s a python script to run some numbers for you – just copy + paste the last 100 or so lines from the statistics.log file into a file named ‘chrony_statistics.log’ in the same directory as this python file:
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
def parse_chrony_stats(file_path):
"""
Parse chrony statistics log file and return a pandas DataFrame
"""
# read file contents first
with open(file_path, 'r') as f:
file_contents = f.readlines()
# for each line, if it starts with '=' or ' ', skip it
file_contents = [line for line in file_contents if not line.startswith('=') and not line.startswith(' ')]
# exclude lines that include 'PPS'
file_contents = [line for line in file_contents if 'PPS' not in line]
# Use StringIO to create a file-like object from the filtered contents
from io import StringIO
csv_data = StringIO(''.join(file_contents))
# Read the filtered data using pandas
df = pd.read_csv(csv_data,
delim_whitespace=True,
names=['Date', 'Time', 'IP_Address', 'Std_dev', 'Est_offset', 'Offset_sd',
'Diff_freq', 'Est_skew', 'Stress', 'Ns', 'Bs', 'Nr', 'Asym'])
# Combine Date and Time columns into a datetime column
df['timestamp'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
return df
def plot_est_offset(df):
"""
Create a plot of Est_offset vs time for each IP address
"""
plt.figure(figsize=(12, 6))
# Plot each IP address as a separate series
for ip in df['IP_Address'].unique():
ip_data = df[df['IP_Address'] == ip]
plt.plot(ip_data['timestamp'], ip_data['Est_offset'],
marker='o', label=ip, linestyle='-', markersize=4)
plt.xlabel('Time')
plt.ylabel('Estimated Offset (seconds)')
plt.title('Chrony Estimated Offset Over Time by IP Address')
plt.legend()
plt.grid(True)
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Adjust layout to prevent label cutoff
plt.tight_layout()
return plt
def analyze_chrony_stats(file_path):
"""
Main function to analyze chrony statistics
"""
# Parse the data
df = parse_chrony_stats(file_path)
# Create summary statistics
summary = {
'IP Addresses': df['IP_Address'].nunique(),
'Time Range': f"{df['timestamp'].min()} to {df['timestamp'].max()}",
'Average Est Offset by IP': df.groupby('IP_Address')['Est_offset'].mean().to_dict(),
'Max Est Offset by IP': df.groupby('IP_Address')['Est_offset'].max().to_dict(),
'Min Est Offset by IP': df.groupby('IP_Address')['Est_offset'].min().to_dict(),
'Median Est Offset by IP': df.groupby('IP_Address')['Est_offset'].median().to_dict()
}
# Create the plot
plot = plot_est_offset(df)
return df, summary, plot
# Example usage
if __name__ == "__main__":
file_path = "chrony_statistics.log" # Replace with your file path
df, summary, plot = analyze_chrony_stats(file_path)
# Print summary statistics
print("\nChrony Statistics Summary:")
print("-" * 30)
print(f"Number of IP Addresses: {summary['IP Addresses']}")
print(f"Time Range: {summary['Time Range']}")
print("\nAverage Estimated Offset by IP:")
for ip, avg in summary['Average Est Offset by IP'].items():
print(f"{ip}: {avg:.2e}")
print("\nMedian Estimated Offset by IP:")
for ip, median in summary['Median Est Offset by IP'].items():
print(f"{ip}: {median:.2e}")
# Show the plot
plt.show()
We get a pretty graph (and by pretty, I mean ugly – this is highly variable, with the slow 9600 default bits per second, the timing will actually be influenced by the number of seen/tracked satellites since we haven’t messed with what sentences are outputted) and some outputs.
matplotlib chart for chrony offset for NMEA source running at 9600 bps
And the avg/median offset:
Chrony Statistics Summary:
------------------------------
Number of IP Addresses: 1
Time Range: 2025-02-14 17:33:55 to 2025-02-14 17:38:26
Average Estimated Offset by IP:
NMEA: -2.71e-01
Median Estimated Offset by IP:
NMEA: -2.65e-01
So we need to pick a number here for the offset. They do not differ by much, 271 millseconds vs 265. Let’s just split the difference at 268. Very scientific. With this number, we can change the offset in the chrony config for the NMEA source. Make it positive to offset.
This usually works but I’m not getting good results so please refer to the previous post for how this should look. Turns out with the default sentences, some of the timing was attributed to 900-1000 milliseconds late, meaning the Pi was synchronizing to a full second late than actual. Couple options to resolve: increase baudrate, and reduce/eliminate unnecessary NMEA sentences. I increased the baudrate below, which won’t be necessary for any modules that have a baudrate higher than 9600 for default. If you don’t care about monitoring the GPS status, disable all sentences except for ZDA (time info).
I took an hour or so detour here to figure out how to change the baudrate on the MTK chip used in the Adafruit GPS module.
Long story short on the baudrate change:
austin@raspberrypi5:~ $ cat gps-baud-change.sh
#!/bin/bash
# Stop gpsd service and socket
sudo systemctl stop gpsd.service gpsd.socket
# Set the baud rate
sudo gpsctl -f -x "$PMTK251,38400*27\r\n" /dev/ttyAMA0
# Start gpsd back up
sudo systemctl start gpsd.service
#gpsd -n -s 38400 /dev/ttyAMA0 /dev/pps0
sudo systemctl restart chrony
How to automate this via systemd or whatever is the topic for another post. The GPS module will keep the baudrate setting until it loses power (so it’ll persist through a Pi reboot!).
Turns out that the offset needs to be 511 milliseconds for my Pi/Adafruit GPS at 38400 bps:
Now we can check what Chrony is using for sources with
chronyc sources
# or if you want to watch it change as it happens
watch -n 1 chronyc sources
chronyc sources showing a lock on PPS (denoted with *) and false ticker on NMEA (denoted with x), which is the expected and desired status after a couple minutes
Many people asked how to get both time/NMEA and PPS from a single GPS receiver (i.e. without a 3rd source) and this is how. The keys are the lock directive as well as the trust directive on the PPS source.
8 – results
Check what chrony thinks of the system clock with
chronyc tracking
Here we see a few key items:
System time – this is updated every second with what Chrony thinks the time vs what the system time is, we are showing 64 nanoseconds
Last offset – how far off the system clock was at last update (from whatever source is selected). I got lucky with this capture, which shows 0 nanoseconds off
RMS offset – a long term average of error. I expect this to get to low double-digit nanoseconds. Decreasing further is the topic of the next post.
Frequency – the drift of the system clock. This number can kind of be whatever, as long as it is stable, but the closer to zero, the better. There is always a temperature correlation with the oscillator temperature vs frequency. This is what chrony is constantly correcting.
Residual frequency – difference from what the frequency is and what it should be (as determined by the selected source)
Skew – error in the frequency – lower is better. Less than 0.05 is very stable.
Root delay/dispersion – basically how far from the “source” of truth your chrony is
Update interval – self explanatory
9 – Grafana dashboard showing Chrony stats
And to track the results over time, I feed the Chrony data to InfluxDB via Telegraf. Another topic for a future post. The dashboard looks like this:
Here we can see a gradual increase in the frequency on the Raspberry Pi 5 system clock. The offsets are almost always within 1 microsecond, with average of 16.7 nanoseconds. The spikes in skew correspond to the spikes in offsets. Something is happening on the Pi to probably spike CPU loading (even though I have the CPU throttled to powersave mode), which speeds things up and affects the timing via either powerstate transitions or oscillator temperature changes or both.
Conclusion
In 2025, a GPS sending PPS to Raspberry Pi is still a great way to get super accurate time. In this Chrony config, I showed how to get time of day, as well as precision seconds without an external source. Our offsets are well under one microsecond.
And for the post after that – here’s a preview of using PTP from an Oscilloquartz OSA 5401 SyncPlug. Note the standard deviations and offsets. This device has a OCXO – oven controlled crystal oscillator – that has frequency stability measured in ppb (parts per billion). It also has a NEO-M8T timing chip, the same one I mentioned in the beginning of this post.
screenshot showing three terminals up – 1) chrony sources, with a PHC (physical hardware clock) in the NIC. error is shown as +/- 38 nanoseconds. 2) chrony source stats, showing a standard deviation of 2 nanoseconds for that same source and 3) linuxptp synchronizing the PHC from the OSA 5401 SyncPlug with all numbers shown in nanoseconds. rms is error of the PHC from the SyncPlug, max is max offset, freq is pretty bad for this oscillator at -49xxx PPM, delay is ethernet delay (3.1 microseconds)
The OSA 5401 SyncPlug is quite difficult to come by (I scored mine for $20 – shoutout to the servethehome.com forums! this device likely had a list price in the thousands) so I’ll also show how to just set up a PTP grandmaster (that’s the official term) on your Raspberry Pi.
Next Steps
Document commands to set ublox module to 16 Hz timepulses
Document commands to set ublox to survey-in upon power on
Document commands to set ublox to use GPS + Beidou + Galileo