Tag: gpu

One Layer, +12%: What 667 Configs Reveal About Small LLM Anatomy

Post author By Austin
Post date April 14, 2026
No Comments on One Layer, +12%: What 667 Configs Reveal About Small LLM Anatomy

Math delta (left), EQ delta (center), and combined delta (right) across all 667 (i,j) layer duplication configs. The three-phase encode/reason/decode anatomy is clearly visible at 4B scale.

I’ve been messing around with local LLMs on my 3090 for a while now — I have a growing collection of Qwen models on D:\LLM that I probably should be embarrassed about. A few weeks ago I stumbled across David Noel Ng’s LLM Neuroanatomy blog posts, where he showed that you can take a pretrained transformer and literally just re-run some of its middle layers a second time at inference, no retraining needed, and get meaningfully better outputs.

$Windows Explorer showing my D:\LLM folder stuffed with Qwen models and GGUF files$

The D:\LLM folder. I should probably be embarrassed about this.

The idea is wild: the model’s weights don’t change. You just tell it “hey, run layers 15 through 21 again” and the hidden state gets another pass through those same weights. Ng showed this working on Qwen3.5-27B (a 64-layer model) with up to +15.6% improvement on combined math and emotional reasoning benchmarks.

Naturally, I wanted to know if this works on smaller models too. Welcome to Austin’s Nerdy Things, where we perform brain surgery on 4-billion-parameter language models to make them think twice.

Background: What Is RYS?

Ng’s technique is called RYS and the core concept is surprisingly simple. A normal transformer forward pass goes:

Process input through layers 0, 1, 2, …, N-1 sequentially
Done

With RYS, you pick a contiguous block in the middle — layers i through j-1 — and after the model finishes layer j-1, you jump back and re-execute layers i through N-1. Those middle layers run twice on the evolving hidden state.

The reason this can work is that transformer layers aren’t all doing the same thing. Ng’s work showed models have a recognizable three-phase anatomy:

Early layers (~0-15% depth): Encoding. Converting tokens into contextualized representations. Repeating these produces garbage — the model tries to re-encode already-encoded stuff.
Middle layers (~20-60% depth): Reasoning. The actual thinking. Repeating these is like giving the model extra time to work through the problem.
Late layers (~70-100% depth): Decoding. Converting internal representations back into token predictions. Repeating these also produces garbage.

Ng found the sweet spot consistently in the middle, and his RYS repo provides all the tooling to test this — layer duplication wrappers, benchmark probe sets, the whole thing.

But his experiments were on a 27B model with 64 layers. I wanted to know: does this three-phase anatomy even exist at 4B scale? Can you exploit it on consumer hardware?

The Setup

Model: Qwen3-4B. I picked this one specifically because it’s a pure dense transformer (36 layers, 2560 hidden dim, GQA with 32 Q / 8 KV heads, RoPE, BF16). The Qwen3.5-2B has hybrid linear/full attention which would complicate things, and Qwen3-4B is in the same model family as Ng’s 27B target, which makes cross-scale comparison cleaner.

Hardware: My trusty RTX 3090 (24 GB VRAM). The model takes about 8.1 GB at baseline, which leaves plenty of room for the KV cache overhead from layer duplication.

Benchmarks: I used Ng’s probe sets from the RYS repo:

Math-16: 16 hard math questions (square roots, cube roots, big multiplications) requiring single-integer answers. Scored with digit-level partial credit. No chain-of-thought allowed. Greedy decoding, 64 max new tokens.
EQ-16: 16 EQ-Bench scenarios — complex social dialogues where the model predicts 4 emotion intensities on a 0-10 scale. Max 256 new tokens.

I used /no_think to disable Qwen3’s thinking mode so we’re measuring raw single-pass capability, and greedy decoding (do_sample=False), which I verified is perfectly deterministic across 5 runs on the same input. No need for multi-run variance testing.

The sweep: All 667 valid (i, j) configurations for a 36-layer model, including baseline (0, 0). Every config runs all 32 probe questions. I added early stopping that triggers if the first 2 math probes both produce garbage (saves about 30% of wall time on broken configs). The scanner saves results to JSON after every single config — resume-friendly for when Windows decides it’s update time.

Total sweep time: about 9 hours on a single 3090. Claude helped me write the scanner script (with me providing the architecture decisions and Ng’s RYS library doing the heavy lifting on layer manipulation).

Baseline Scores

Before messing with anything, Qwen3-4B scores:

Probe	Score
Math-16	0.305
EQ-16	0.749
Combined	1.054

The math score looks low, but these are genuinely hard problems (like “what is the cube root of 1019330085047 times 31?”) and the scorer gives partial credit for getting digits right. The EQ score is actually solid — Qwen3-4B is pretty decent at predicting emotional dynamics even without chain-of-thought.

The Heatmaps

Here’s where it gets fun. I swept all 667 configs and plotted the results as heatmaps. Each cell is one (i, j) configuration. Red means improvement over baseline. Blue means degradation. The x-axis is j (where the repeated block ends) and the y-axis is i (where it starts).

Left: math delta. Center: EQ delta. Right: combined delta. Red = improvement, blue = degradation. 667 configs, 36 layers.

Three things jumped out immediately.

1. The three-phase anatomy is clearly present at 4B scale

The top-left corner (early layers duplicated with wide spans) is deep blue — that’s the encoding zone. The bottom-right corner (late layers) is also blue — that’s the decoding zone. The productive region runs diagonally through the middle. This is exactly the encode / reason / decode structure Ng found at 27B.

Layers 0-6 are the encoding wall. Repeat anything starting before layer 5 with a wide span and the model outputs garbage. Layers 30+ are decoding territory — also garbage if you repeat there. The productive zone lives between layers ~5 and ~27, spanning roughly 60% of the model.

2. Math and EQ have different hot zones

This was something I wasn’t expecting. The math heatmap shows gains across a broad band from mid-stack to upper layers. The EQ heatmap’s gains concentrate in a tighter region around layers 7-16. The combined heatmap shows three distinct hot zones:

Zone A (layers 7-15, ~19-42% depth): Strong EQ gains, moderate math
Zone B (layers 15-20, ~42-56% depth): Balanced improvement on both
Zone C (layers 21-27, ~58-75% depth): Strong math gains, EQ roughly neutral

So the model’s “emotional processing” lives slightly earlier in the stack than its “mathematical processing.” That’s a cool finding — different kinds of reasoning occupy different layer ranges even in a small model.

3. The encoding wall is a cliff, not a slope

The transition from “productive duplication” to “catastrophic failure” happens over 1-2 layers. Layer 5 duplication helps. Layer 3 duplication tanks the model. There’s basically no gradient — it’s a cliff edge. Ng observed something similar at 27B but it’s even more pronounced at 4B scale.

The Pareto Frontier

Not all improvements are worth the extra latency. Each extra layer traversal costs time. The practical question is: how much bang per buck?

X-axis: overhead (%). Y-axis: combined score. The curve is sharply concave — almost all the benefit comes from the first 1-2 extra layers.

Size	Config (i,j)	Extra layers	Overhead	Combined	Improvement
XS	(2,3)	1	2.8%	1.090	+3.4%
S	(5,6)	1	2.8%	1.154	+9.6%
M	(21,22)	1	2.8%	1.179	+11.9%
L	(21,23)	2	5.6%	1.192	+13.2%
XL	(14,36)	22	61.1%	1.202	+14.0%

The winner is (21,22): just repeat layer 21 once. That’s +11.9% combined improvement at 2.8% latency overhead. One single extra layer forward pass. That’s it.

Going from 1 extra layer to 2 buys another 1.3 percentage points. Going from 2 extra layers to 22 — literally 10x the overhead — buys only 0.8 more. The returns collapse fast. Look at that Pareto chart — the curve is basically flat after the first couple of points.

Single-Layer Repeats: The 4B Surprise

This is where things get really interesting, and where the results diverge most from Ng’s 27B findings.

Ng reported that single-layer repeats at 27B “almost never help.” You need to duplicate a contiguous block of at least 2-3 layers to see meaningful improvement at that scale.

At 4B? 14 out of 35 single-layer repeats beat baseline. Here are the top performers:

Layer	Config	Combined delta
21	(21,22)	+0.126
5	(5,6)	+0.101
24	(24,25)	+0.100
26	(26,27)	+0.073
19	(19,20)	+0.071
22	(22,23)	+0.063
20	(20,21)	+0.057
17	(17,18)	+0.049

Layers 5 and 17-26 — nearly the entire mid-to-late stack — all produce meaningful gains when repeated individually. That’s a wide, diffuse productive zone spanning about 60% of the model.

My interpretation: smaller models have less specialized layers. At 27B with 64 layers, each layer does something specific enough that repeating just one doesn’t help much — you need a coherent block. At 4B with 36 layers, individual layers carry more general-purpose reasoning capacity. A single extra pass through one of them is already enough to bump quality.

This is arguably the most practically useful finding from the whole experiment. For small models, even the simplest possible intervention works.

How This Compares to Ng’s 27B Results

Property	Qwen3-4B (36 layers)	Qwen3.5-27B (64 layers)
Three-phase anatomy	Yes, clearly visible	Yes
Encoding wall	Layers 0-6 (~0-17%)	~first 15%
Best single-layer	(21,22) = +11.9%	Rarely productive
Best absolute	(14,36) = +14.0% at 61% overhead	~+15.6% at ~15.6% overhead
Best efficiency	(21,22) = +11.9% at 2.8% overhead	Layer 33 = +1.5%
Productive single layers	14/35 (40%)	Rare
Efficiency curve shape	Sharply concave	Roughly linear to ~10 layers

The biggest difference is the shape of the efficiency curve. At 27B, adding more repeated layers gives roughly linear improvement up to about 10 extra layers — there’s a real reason to invest in multi-layer duplication. At 4B, the curve is sharply concave. Almost all the benefit comes from the very first extra layer. After that, you’re paying a lot of overhead for very little gain.

This makes intuitive sense. A bigger model has more specialized layers where repetition compounds — each one contributes something distinct. A smaller model gets most of its benefit from a single extra pass through its most general-purpose reasoning layer, and additional passes hit diminishing returns because those layers are doing similar work.

What This Means If You Want to Use It

If you’re deploying a small dense model and want better reasoning at minimal cost:

Find the model’s “layer 21.” Run a quick single-layer sweep on your target model. It takes minutes per config.
Repeat that one layer. At 2.8% latency overhead, this is basically free.
Don’t over-invest in multi-layer duplication at small scale. The second extra layer buys way less than the first.

For framework implementers: this is a ~10-line change to a model’s forward pass. No weight changes, no retraining, no meaningful VRAM increase. It should be a first-class inference option in llama.cpp, vLLM, ExLlama, etc.

Caveats

I want to be upfront about what this doesn’t prove:

One model, one family. These results are Qwen3-4B specific. The three-phase anatomy probably generalizes (Ng showed it on multiple architectures), but the exact layer numbers won’t. Every model needs its own sweep.
Small probe sets. 16 math + 16 EQ questions. Enough for relative ordering of configs, but the absolute scores have meaningful variance. Validate on larger benchmarks before deploying.
Greedy decoding only. Sampling might interact differently with layer duplication. I haven’t tested that.
BF16 only. Quantized models might behave differently — layer duplication could amplify quantization errors.
No multi-block compositions. Ng’s beam search finds configs that repeat two different blocks (e.g., layers 30-34 AND 43-45). I only tested single contiguous blocks. The multi-block space at 4B is unexplored.
RoPE positions aren’t adjusted. The model sees the same position IDs on the repeated pass. This works empirically but the theoretical interaction is unclear.

Reproducing This

Everything runs on a single 3090 (or any 24GB+ GPU):

Model: Qwen/Qwen3-4B in BF16 (~8 GB VRAM)
Tooling: Ng’s RYS repo for the layer duplicator and probe datasets
Scanner: Custom rys_scan.py — single file, ~460 lines, resume-friendly
Analysis: Custom rys_analyze.py — heatmaps, Pareto frontier, markdown summary
Dependencies: PyTorch, transformers, matplotlib, numpy
Wall time: ~9 hours for the full 667-config sweep

The scanner loads the model once, pre-tokenizes all probes, then iterates through configs. Each config wraps the base model with a layer-index remapping (no weight copies, just pointer rearrangement), runs all probes greedy, scores, and saves. Resume works by checking which config keys already exist in the results JSON.

What’s Next

A few obvious follow-ups I’m thinking about:

Multi-block beam search at 4B. Does combining layers 5-6 and 21-22 compound the gains?
Cross-scale comparison. Run the same sweep on Qwen3.5-2B (hybrid attention), Qwen3.5-9B, maybe a non-Qwen model. See how the efficiency curve changes with scale.
Train-time loop exposure. Train a small model where specific layers are looped during training, compare with inference-time-only duplication.
Integration with inference frameworks. llama.cpp, vLLM, and ExLlama already manage layer weights — adding a “repeat layer N” flag should be pretty straightforward.

The broader takeaway is that transformer layers aren’t interchangeable. They have structure, and that structure is legible even at small scale. You can exploit it at inference time with zero retraining, and the cost is basically nothing.

Layer 21 thinks twice. The model gets smarter. That’s the whole trick.

This work builds directly on David Noel Ng’s RYS and LLM Neuroanatomy series — go read those if you haven’t, they’re excellent. Related academic work: Reasoning with Latent Thoughts and LoopFormer. Data, scanner script, and full results available on GitHub.

Tags gpu, inference, LLM, machine learning, python, qwen

Stable Diffusion Tutorial – Nvidia GPU Installation

Post author By Austin
Post date February 24, 2023
1 Comment on Stable Diffusion Tutorial – Nvidia GPU Installation

screenshot of UI showing AI-generated cat using stable diffusion 1.5 via automatic1111/stable-diffusion-webui with default settings

Like most other internet-connected people, I have seen the increase in AI-generated content in recent months. ChatGPT is fun to use and I’m sure there are plenty of useful use cases for it but I’m not sure I have the imagination required to use it to it’s full potential. The AI art fad of a couple months ago was cool too. In the back of my mind, I kept thinking “where will AI take us in the next couple years”. I still don’t know the answer to that. The only “art” I am good at is pottery (thanks to high-school pottery class – I took 4 semesters of it and had a great time doing so, whole different story). But now I’m able to generate my own AI art thanks to a guide I found the other day on /g/. I am re-writing it here with screenshots and a bit more detail to try and make it more accessible to general users.

NOTE: You need a decent/recent Nvidia GPU to follow this guide. I have a RTX 2080 Super with 8GB of VRAM. There are low-memory workarounds but I haven’t tested them yet. An absolute limit is 2GB VRAM, and a GTX 7xx (Maxwell architecture) or newer GPU.

Stable Diffusion Tutorial Contents

Installing Python 3.10
Installing Git (the source control system)
Clone the Automatic1111 web UI (this is the front-end for using the various models)
Download models
Adjust memory limits & enable listening outside of localhost
First run
Launching the web UI
Generating Stable Diffusion images

Video version of this install guide

Coming soon. I always do the written guide first, then record based off the written guide. Hopefully by end of day (mountain time) Feb 24.

1 – Installing Python 3.10

This is relatively straight-forward. To check your Python version, go to a command line and enter

python --version

If you already have Python 3.10.x installed (as seen in the screenshot below), you’re good to go (minor version doesn’t matter).

Python 3.10 installed for Stable Diffusion

If not, go to the Python 3 download page and select the most recent 3.10 version. As of writing, the most recent is 3.10.10. Download the x64 installer and install. Ensure the “add python.exe to PATH” checkbox is checked. Adding python.exe to PATH means it can be called with only python at a command prompt instead of the full path, which is something like c:/users/whatever/somedirectory/moredirectories/3.10.10/python.exe.

Installing python and adding python.exe to PATH

2 – Installing Git (the source control system)

This is easier than Python – just install it – https://git-scm.com/downloads. Check for presence and version with git –version:

git installed and ready to go for Stable Diffusion

3 – Clone the Automatic1111 web UI (this is the front-end for using the various models)

With Git, clone means to download a copy of the code repository. When you clone a repo, a new directory is created in whatever directory the command is run in. Meaning that if you navigate to your desktop, and run git clone xyz, you will have a new folder on your desktop named xyz with the contents of the repository. To keep things simple, I am going to create a folder for all my Stable Diffusion stuff in the C:/ root named sd and then clone into that folder.

Open a command prompt and enter

cd c:\

Next create the sd folder and enter it:

mkdir sd
cd sd

Now clone the repository while in your sd folder:

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

After the git clone completes, there will be a new directory called ‘stable-diffusion-webui’:

stable-diffusion-webui cloned and ready to download models

4 – Download models

“Models” are what actually generate the content based on provided prompts. Generally, you will want to use pre-trained models. Luckily, there are many ready to use. Training your own model is far beyond the scope of this basic installation tutorial. Training your own models generally also requires huge amounts of time crunching numbers on very powerful GPUs.

As of writing, Stable Diffusion 1.5 (SD 1.5) is the recommended model. It can be downloaded (note: this is a 7.5GB file) from huggingface here.

Take the downloaded file, and place it in the stable-diffusion-webui/models/Stable-diffusion directory and rename it to model.ckpt (it can be named anything you want but the web UI automatically attempts to load a model named ‘model.ckpt’ upon start). If you’re following along with the same directory structure as me, this file will end up at C:\sd\stable-diffusion-webui\models\Stable-diffusion\model.ckpt.

Another popular model is Deliberate. It can be downloaded (4.2GB) here. Put it in the same folder as the other model. No need to rename the 2nd (and other) models.

After downloading both models, the directory should look like this:

Stable Diffusion 1.5 (SD 1.5) and Deliberate_v11 models ready for use

5 – Adjust memory limits & enable listening outside of localhost (command line arguments)

Inside the main stable-diffusion-webui directory live a number of launcher files and helper files. Find webui-user.bat and edit it (.bat files can be right-clicked -> edit).

Add –medvram (two dashes) after the equals sign of COMMANDLINE_ARGS. If you also want the UI to listen on all IP addresses instead of just localhost (don’t do this unless you know what that means), also add –listen.

webui-user.bat after edits

@echo off

set PYTHON=
set GIT=
set VENV_DIR=
set COMMANDLINE_ARGS=--listen --medvram

call webui.bat

6 – First run

The UI tool (developed by automatic1111) will automatically download a variety of requirements upon first launch. It will take a few minutes to complete. Double-click the webui-user.bat file we just edited. It calls a few .bat files and eventually launches a Python file. The .bat files are essentially glue to stick a bunch of stuff together for the main file.

The very first thing it does is creates a Python venv (virtual environment) to keep the Stable Diffusion packages separate from your other Python packages. Then it pip installs a bunch of packages related to cuda/pytorch/numpy/etc so Python can interact with your GPU.

webui-user.bat using pip to install necessary python packages like cuda

After everything is installed and ready to go, you will see a line that says: Running on local URL: http://127.0.0.1:7860. That means the Python web server UI is running on your own computer on port 7860 (if you added –listen to the launch args, it will show 0.0.0.0:7860, which means it is listening on all IP addresses and can be accessed by external machinse).

stable-diffusion-webui launched and ready to load

7 – Launching the web UI

With the web UI server running, it can be accessed via browser on the same computer running the Python at http://127.0.0.1:7860. That link should work for you if you click it.

Note that if the Python process closes for whatever reason (you close the command window, your computer reboots, etc), you need to double-click webui-user.bat to relaunch it and it needs to be running any time you want to access the web UI.

Automatic1111 stable diffusion web UI up and running

As seen in the screenshot, there are a ton of parameters/settings. I’ll highlight a few in the next section

8 – Generating Stable Diffusion images

This is the tricky part. The prompts make or break your generation. I am still learning. The prompt is where you enter what you want to see. Negative prompt is where you enter what you don’t want to see.

Let’s start simple, with cat in the prompt. Then click generate. A very reasonable-looking cat should soon appear (typically takes a couple seconds per image):

AI-generated cat with stable diffusion 1.5 with default settings

To highlight a few of the settings/sliders:

Stable diffusion checkpoint – model selector. Note that it’ll take a bit to load a new model (the multi-GB files need to be read in their entirety and ingested).
Prompt – what you want to see
Negative prompt – what you don’t want to see
Sampling method – various methods to sample new points
Sampling steps – how many iterations to use for image generation for a single image
Width – width of image to generate (in pixels). NOTE, you need a very powerful GPU with a ton of VRAM to go much higher than the default 512
Height – height of image to generate (in pixels). Same warning applies as width
Batch count – how many images to include in a batch generation
Batch size – haven’t used yet, presumably used to specify how many batches to generate
CFG Scale – this slider tells the models how specific they need to be for the prompt. Higher is more specific. Really high values (>12ish) start to get a bit abstract. Probably want to be in the range of 3-10 for this slider.
Seed – random number generator seed. -1 means use a new seed for every image.

Some thoughts on prompt/negative prompt

From my ~24 hours using this tool, it is very clear that prompt/negative prompts are what make or break your generation. I think that your ability as a pre-AI artist would come in handy here. I am no artist so I have a hard time putting what I want to see into words. Take example prompt: valley, fairytale treehouse village covered, matte painting, highly detailed, dynamic lighting, cinematic, realism, realistic, photo real, sunset, detailed, high contrast, denoised, centered. I would’ve said “fairytale treehouse” and stopped at that. Compare the two prompts below with the more detailed prompt directly below and the basic “fairytale treehouse” prompt after that:

AI-generated “fairytale treehouse” via stable diffusion. Prompt: valley, fairytale treehouse village covered, matte painting, highly detailed, dynamic lighting, cinematic, realism, realistic, photo real, sunset, detailed, high contrast, denoised, centered

AI-generated “fairytale treehouse” via stable diffusion. Prompt: fairytale treehouse

One of these looks perfectly in place for a fantasy story. The other you could very possibly see in person in a nearby forest.

Both positive and negative can get very long very quickly. Many of the AI-generated artifacts present over the last month or two can be eliminated with negative prompt entries.

Example negative prompt: ugly, deformed, malformed, lowres, mutant, mutated, disfigured, compressed, noise, artifacts, dithering, simple, watermark, text, font, signage, collage, pixel

I will not pretend to know what works well vs not. Google is your friend here. I believe that “prompt engineering” will be very important in AI’s future. Google is your friend here.

Conclusion

AI-generated content is here. It will not be going away. Even if it is outlawed, the code is out there. AI will be a huge part of our future, regardless of if you want it or not. As the saying goes – pandora’s box is now open.

I figured it was worth trying. The guide this is based off made it relatively easy for me (but I do have above-average computer skill), and I wanted to make it even easier. Hopefully you found this ‘how to set up stable diffusion’ guide easy to use as well. Please let me know in the comments section if you have any questions/comments/feedback – I check at least daily!

Resources

Huge shout out to whoever wrote the guide (“all anons”) at https://rentry.org/voldy. That is essentially where this entire guide came from.

Disclosure: When you click on links to various merchants in this post and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network.

Tags ai, gpu, nvidia, python, stable-diffusion, tutorial