Michael Leung

Why SFT learned the words but GRPO learned the rules

Michael Min Wah Leung — Sat, 02 May 2026 00:00:00 GMT

The three-letter problem

After supervised fine-tuning, our Phi-4 model could recite every tag in our naming table. Ask it the canonical question, “what is the tag for the supply air static pressure setpoint?”, and it would answer correctly. Ask it the inverse, “give me the supply air static pressure setpoint” without the word “tag”, and it would confidently emit:

SupplyAirStaticPressureSetpoint

A perfectly reasonable BACnet-style name. Discoverable, self-documenting, and completely absent from our system. The correct answer was three letters: SPS.

This post is about how I closed that gap with roughly 250 lines of reward function and a quarter-epoch of GRPO, and what the experience taught me about RL on language models that I had not gotten from reading papers about it.

The setting

I work on an internal AI assistant for an industrial domain that has its own naming taxonomy. The tags are short and opinionated, and they look nothing like the open-source conventions an LLM has seen during pretraining. Think SPS, DAT, ZN-2_RHC instead of the verbose, hierarchical strings that public BMS tutorials and BACnet documentation are full of.

The first instinct is RAG: index the table, retrieve the right row at query time. We tried it. It works for direct lookup (tag to description) and breaks for almost everything else: paraphrases, partial matches, descriptions that do not quote the table verbatim, anything that requires the model to reason about the structure of the names rather than recall a row.

RAG retrieves examples; fine-tuning internalises the rules. For a fixed, rule-based vocabulary, the right tool is the one that learns the structure, not the one that looks it up.

The choice was straightforward. Teach the model the vocabulary directly. SFT got us most of the way; it did not get us all the way, and the gap was instructive.

What SFT got right, and what it didn’t

SFT (QLoRA on Phi-4, ~3 epochs over ~5k synthetic examples covering 7 scenario types) gave us a model that could:

Recall the table verbatim when asked directly.
Answer multiple-choice distractor questions with the right tag.
Tolerate moderate paraphrasing in the direct direction (description provided, tag returned).

It still failed at:

Reverse lookup. Given a description without the cue word “tag”, it would invent a plausible BACnet-style name instead of using ours.
Refusing the unknown. Asked about equipment that was not in the taxonomy, it would confidently produce something, usually a tag from a related family, rather than acknowledging it did not know. Phi-4 SFT alone scored 60% on unknown_tag refusal; SFT+GRPO took it to 86.7%.
Generic-naming drift. Given a typo or an ambiguous phrasing, it would back off to the verbose, English-sounding form it had seen during pretraining. SFT scored 40% on typo_robustness; SFT+GRPO took it to 60%.

These are not accuracy failures you fix with more SFT data. They are preference failures: the model’s distribution over plausible answers is wrong in a way that more cross-entropy loss does not address. Cross-entropy rewards being close to the target token. It does not punish a confident, fluent answer that happens to be drawn from the wrong vocabulary.

That is what RL is for.

Why GRPO, not PPO

I went with GRPO over PPO for the standard reason (no critic) and one less-standard reason that mattered more in practice. GRPO’s group-relative advantage gave me a much cleaner signal for reward shaping. Every prompt produces a small group of completions; advantages are normalised within the group. That means the absolute scale of the reward function matters less than the ordering it induces, which is exactly what I wanted when iterating on reward design.

GRPO sidesteps the critic by computing advantages relative to a group of sampled responses. For a small post-training run on a domain task, that is a real engineering simplification: fewer moving parts, less memory, less to debug.

Designing a reward function that punishes the right things

This is the section that matters. Most public GRPO write-ups use a one-line reward, a correctness flag or a regex match. Mine is ~250 lines and has seven scenarios with calibrated reward bands, because the failures I was trying to fix were qualitatively different from each other and a scalar correctness signal could not distinguish them.

The training loop on the left, the reward function’s per-scenario decision tree on the right. The asymmetric reverse_lookup band (+1.0 / -0.8) and the unknown_tag trap (+1.0 for refusal, -1.0 for confidently naming a real tag the model wasn’t asked about) are where most of the behavioural shift came from.

The reward bands, abbreviated:

Scenario	Correct	Wrong	Why this shape
Reverse lookup (description to tag)	+1.0	-0.8	High-stakes, asymmetric. A wrong tag here is the failure mode I was trying to fix.
Direct lookup (tag to description)	+0.7 / +0.3 (partial)	-0.5	SFT was already good here. Light reinforcement only.
Context reasoning (physical vs. virtual)	+0.6 + bonus	-0.4	Reward correct reasoning keywords, not just final answer.
Unknown-tag refusal	+1.0	-1.0 (trap)	Confidently naming a real tag when asked about something out-of-table is the worst failure. The asymmetric trap forces the model to learn “I don’t know” as a high-reward action.
Typo robustness	+1.2 (recover)	-0.5	Edit-distance-based intended-tag recovery. Reward correcting, not refusing.
Hedging penalty	n/a	-0.2	“I think it might be…” is worse than a confident wrong answer here, because hedging is what masked the failure during SFT eval.
Format / concision bonus	+0.1 each	n/a	Short, well-formatted answers preferred.

A few specific reward-design decisions I’d defend:

The unknown_tag trap (+1.0 vs -1.0) is the single most important band. It is what taught the model that “I don’t know” is an answer. Without the asymmetric penalty, the model would fall back to a related-family tag: fluent, plausible, wrong. With it, refusal becomes the high-reward action and the model stops gambling.

The reverse_lookup penalty is asymmetric (-0.8 against +1.0) because the failure mode it targets, inventing a BACnet-style name, is a fluent, confident failure that an SFT eval set will under-measure. Symmetric rewards would let the model trade off these failures against easy wins on direct lookup. Asymmetric rewards make that trade unprofitable.

The hedging penalty is small (-0.2) and intentional. It is not punishing the model for being uncertain; it is punishing it for expressing uncertainty in cases where the answer is recoverable. The right move on a typo is to recover and answer, not to hedge.

Conservative GRPO: refinement, not relearning

Most failure stories I have read with GRPO come from the same place. Too much learning rate, too little KL anchor, too many epochs, and the model drifts off the SFT distribution into a degenerate reward-hacking mode that scores well on the reward function and is useless in production.

My hyperparameters were deliberately conservative:

Param	Value	Rationale
LoRA `r`	4	Tiny adapter. Refine, do not relearn.
LoRA `α`	8	2× `r`, conventional.
Learning rate	5e-7	An order of magnitude below typical SFT.
KL `β`	0.2	Strong anchor to the SFT distribution.
Epochs	0.25	Stop before drift sets in.
Group size `G`	4	Smallest group that gives a meaningful relative advantage.

The thesis: SFT already knows the tags. GRPO’s job is to reshape the model’s preferences over how to use them, not to teach it new ones. That framing, RLHF as preference reshaping with a hard KL leash, is consistent with what the InstructGPT and Anthropic HH-RLHF papers describe in their own runs, and it is what the production behaviour confirmed. The model did not get smarter; it got opinionated in the right direction.

When the residual failures wouldn’t budge: targeted DPO

GRPO closed most of the gap. One residual failure remained: on a specific subclass of reverse-lookup queries (those that paraphrased a description in a way that overlapped lexically with public BACnet conventions), the model would still occasionally drift to the verbose form. The reward function could not distinguish those queries cleanly enough at the group level.

So I switched tools. I generated targeted preference pairs (correct tag preferred over the BACnet-style hallucination), oversampled the failing subclass with reverse_weight=3, and ran DPO on top of the GRPO checkpoint. The pattern, use GRPO for broad behavioural shaping, use DPO for surgical fixes on residual failures, felt right and is something I would reach for again.

That story deserves its own post. For here, it is enough to say that the combination held the SFT+GRPO gains while quietly closing more residual cases.

Results

I evaluated seven model variants. Base SFT, SFT+GRPO, and SFT+DPO across two open-weights families (Phi-4 14B, Mistral 7B), plus a Phi-4-only GRPO baseline (no SFT first), plus a closed-source comparator (GPT-4.1-mini fine-tuned via Azure on the same data). All numbers below are from the same held-out eval, all scored against the same scenario harness. Models ran as quantized GGUF for inference parity.

Overall accuracy across the seven model variants on the held-out eval. Phi-4 SFT+GRPO leads at 83.3%, with SFT+DPO close behind at 82.2%. The 15.5-point lift over Phi-4 SFT alone (67.8%) is the headline. Phi-4 GRPO without SFT first (58.9%) underperforms even SFT alone, confirming the conservative “refine, do not relearn” thesis. The Azure-fine-tuned GPT-4.1-mini at 20% is a useful sanity check: a generic fine-tuning API on a stronger base model loses badly to a thoughtfully post-trained 14B open model on this task.

The per-scenario breakdown is where the reward design earns its keep:

Per-scenario accuracy. The two scenarios that GRPO and DPO target most directly, unknown_tag refusal and typo_robustness, are exactly where the largest lifts show up. SFT alone hits 60% on unknown_tag; SFT+GRPO and SFT+DPO hit 86.7% and 86.7% respectively. On typo_robustness, SFT’s 40% becomes 60% with GRPO. The “easy” scenarios (direct lookup, multiple choice, physical/virtual classification) saturate near 100% across most variants, which is the point: the asymmetric reward bands deliberately spent learning capacity on the hard scenarios without trading away the easy ones.

A few specific things I’d call out from these numbers:

The unknown_tag lift is the most important result in the post. It is the failure mode the asymmetric trap (+1.0 / -1.0) was designed for, and the +26.7-point delta on Phi-4 is the cleanest evidence that the reward shaping worked as intended rather than as a happy accident.
Phi-4 GRPO without SFT (58.9%) underperforms Phi-4 SFT alone (67.8%). Same model, same data, same RL recipe, minus the SFT warm-start. RL on language models without a strong supervised initialisation is a different (and harder) problem; this row is the empirical version of “SFT first, then RL” as a recipe rather than a slogan.
GPT-4.1-mini via Azure fine-tuning at 20% is the comparator that surprised me most. The fine-tuning API does not expose enough of the loss function to express the asymmetric preferences this task needs; a more capable base model with a less expressive post-training surface loses to a smaller open model with a more expressive one. Reward design is the work is not just a slogan I picked for the section heading.

One qualitative example to ground the table:

Prompt: “give me the supply air static pressure setpoint” Phi-4 SFT: SupplyAirStaticPressureSetpoint Phi-4 SFT+GRPO: SPS Ground truth: SPS

The post is built around that one example because every story I have about this project bottoms out in some version of it. A fluent, plausible, confident answer, drawn from the wrong vocabulary, that no SFT eval set was going to flag.

A small detail that signals “this shipped”

One implementation note that did not make the narrative but that I would put in a sidebar: TRL’s GRPOConfig and GRPOTrainer APIs have churned across versions (max_new_tokens vs max_completion_length, processing_class vs tokenizer, reward_funcs vs reward_function). My _build_grpo_config() introspects GRPOConfig.__init__ at runtime and picks the right kwargs for the installed version. It is three small inspect.signature checks. It saved me twice across upgrades and is the kind of thing you only write after you have shipped something for real.

What I’d do next

Step-DPO for the multi-step reasoning scenarios (physical vs. virtual classification), where the failure is in the chain, not the answer.
An RLAIF critic trained on a held-out slice of the taxonomy, to remove the manual-reward-tuning bottleneck for new scenarios.
Reward-model regularisation. Measuring how much of the post-GRPO behaviour is reward-hacking the specific bands vs. genuine preference shift. The Anthropic Constitutional AI paper’s diagnostic ideas would translate cleanly here.

What this taught me

A few things I did not get from reading RLHF papers and only got from running this:

Reward design is the work. The training loop is mechanical; the reward function is where the research is. Every hour I spent on hyperparameters returned less than every hour I spent on the asymmetric reward bands.
KL is not a tuning knob, it is a leash. Keeping β high and learning rate low felt unambitious until I tried lowering them, watched the model drift into reward-hacking, and put them back.
The model already knows. SFT had the information. GRPO did not add knowledge; it changed which knowledge the model preferred to use. That distinction, between teaching and reshaping, is most of what makes RLHF feel different from SFT in practice.

The codebase that backs this post lives in a private repo; a sanitized public version is in progress. If you’re working on similar problems, domain taxonomies, post-training for vocabulary control, asymmetric reward design, I’d love to compare notes.

From consuming a pretrained model to training my own

Michael Min Wah Leung — Fri, 01 May 2026 00:00:00 GMT

An organisation-wide hackathon, a partner, and a ceiling I didn’t expect

My teammate Sharon and I entered an organisation-wide hackathon with over 11,000 project submissions. We set out to build a sign-language Copilot. Sharon owned the agent integration and tooling: Microsoft Agent Framework, the WorkIQ MCP for M365, the RAG-augmented translator that turns recognised gloss into clean English, and the Qt UI’s higher-level event handling. I owned the part you don’t see: the modelling and inference engine that turns webcam frames into something the agent can actually act on. The project went on to win first place in the challenge.

The reason this was a real project and not a clever demo: as far as we could find, there is no published methodology for using sign language as an interface to an AI agent, and there are no foundational models designed for continuous or conversational signing. Production sign-language systems are isolated-gloss recognisers; the research literature is dominated by isolated-sign benchmarks. Conversational signing, the way a deaf user would actually dictate to a Copilot, is an open problem. Our project was a novel attempt to train a continuous-signing model from scratch, including a novel processing workflow for the multi-angle video clips the dataset ships in.

We shipped a working demo by Friday night, built around a strong pretrained isolated-gloss classifier (the Kaggle 1st-place ASL TFLite model). It could recognise “hello”, “thank you”, “schedule” with high confidence. By Sunday afternoon I had realised it could only ever do one sign at a time. “Schedule a meeting with Sharon next Tuesday” was structurally impossible: an isolated-gloss model has no notion of sequence. It recognises one sign per stable window and concatenates the recognitions post-hoc, which is a fundamentally different object from a continuous-signing decoder.

This post is about what I did about that. It covers a small encoder-decoder Transformer trained from scratch on How2Sign in two backends, the architectural choice of attention over recurrence and why it matters for signing specifically, a subtle masked-cross-entropy detail that’s the difference between a model that stops talking and a model that doesn’t, and a hybrid runtime that knows when to defer to the simpler model. The combination, not any single component, is what reached 93.6% sentence-level recognition on our internal evaluation, where the trained model alone struggles on long sentences and the pretrained model alone can’t attempt them.

The pipeline, end to end

The system is five threaded stages: webcam at ~30 FPS, MediaPipe Holistic landmarks (468 face + 21 left hand + 33 pose + 21 right hand = 543 landmarks × 3 coords per frame), an inference stage that turns landmark sequences into ASL gloss tokens, a translator that turns gloss into English with a small RAG-augmented LLM call, and a Copilot agent on top with M365 tooling. A custom event bus of bounded queues coordinates the threads and exposes a clean signing-state signal to the UI.

End-to-end architecture for the sign-language Copilot. Webcam frames are fed through MediaPipe Holistic to produce a 543-landmark stack per frame; the inference engine turns landmark sequences into gloss tokens; a RAG-augmented translator converts gloss to English; a Copilot agent with WorkIQ MCP handles M365 tasks. The middle two stages, where modelling and inference live, are what this post focuses on.

Two stages are mine: the inference engine and the model that drives it. The agent layer, MCP integration, RAG provider, and Qt UI are my partner’s work, and they deserve their own write-ups.

What an isolated-gloss model can’t do

The original inference engine wraps the pretrained TFLite classifier with the obvious scaffolding: a sliding window over the landmark stream, a top-1 smoothed across five predictions, and a stability filter that only emits a sign once it’s been the top-1 for three consecutive predictions. The threshold tuning, the idle-frame token, the dual vocab-format dispatcher, all of it works.

It’s also fundamentally a one-sign-at-a-time machine. Continuous signing produces overlapping windows whose stable top-1 changes mid-phrase, and the inference engine has no representation of a phrase. There’s no path from “schedule” + “meeting” + “Priya” + “next” + “Tuesday” to a coherent sentence, because the model never sees those signs as a sequence, only as five separate top-1 calls. For “hello” or “thank you” this is fine. For anything an actual user would dictate to a Copilot, it isn’t.

A short sample of the continuous-signing input the new model has to handle. Each frame is a 543-landmark stack; the sequence of these stacks carries the sentence-level meaning, not any single frame. A pretrained isolated-gloss classifier evaluates frame windows in isolation and has no representation of the sequence beyond a sliding aggregator on top.

Why an attention-based encoder-decoder, and not an RNN, GRU, or LSTM

The model is a small encoder-decoder Transformer with multi-head self-attention in the encoder, multi-head self-attention plus cross-attention in the decoder, and no recurrence anywhere. The encoder consumes the 1629-dimensional flattened landmark stack per frame; the decoder produces English tokens autoregressively from a 5000-token vocabulary trained on How2Sign captions. Sizes are deliberately modest: embed_dim=256, dense_dim=512, num_heads=4, two encoder layers, two decoder layers, around 10–11M parameters. The intent isn’t SOTA on How2Sign; it’s a model small enough to overfit gracefully on the available budget while being a structurally honest sequence-to-sequence model.

The choice of attention over recurrence wasn’t aesthetic. Continuous signing has two properties that an RNN, GRU, or LSTM handles poorly:

Variable signing rate. The same sentence can take 1.5 seconds or 4 seconds depending on the signer. Recurrent encoders fold information through a hidden state at a fixed cadence and develop a strong recency bias; a fast signer’s early signs decay before the decoder ever attends to them, and a slow signer’s signs blur into each other through the gating. Self-attention has no recency bias by construction. Every encoder position attends to every other position with learned weights, so a sign that takes ten frames and a sign that takes thirty frames are weighted on content, not on temporal distance from the decoder’s current step.

Spatial structure that evolves over time. A frame is 543 spatial landmarks, and the configuration of those landmarks (the relative positions of the right-hand keypoints to the face keypoints to the left-hand keypoints) is what carries the gloss. Recurrent models have no native way to factor “what’s spatially co-occurring inside this frame” from “how is the spatial pattern changing across frames”. An encoder built on attention treats the per-frame landmark stack as a single token whose embedding can carry the spatial structure, and lets the inter-frame attention layer carry the temporal structure separately. The two axes get their own machinery, which is exactly what a problem with non-trivial spatial and temporal structure needs.

The non-obvious architectural choice that follows is the dual positional embedding. The encoder uses FramePositionalEmbedding over continuous frame indices; the decoder uses ordinary PositionalEmbedding over discrete token positions. Modalities have different position semantics, and trying to share one positional code across them produces a model that’s confused about which axis is which. The cleanest way to phrase the framing is that MediaPipe Holistic is a frozen perceptual frontend, and the encoder is learning the temporal-linguistic mapping between landmark sequences and language. That’s the same shape as freezing the vision tower and training only the projector in modern multimodal LLMs.

The masked cross-entropy detail that actually matters

Most of the implementation is mechanical. One detail isn’t.

In a teacher-forced sequence-to-sequence loss, you have to mask the padding positions so the model isn’t penalised for what it predicts after the real sequence ends. The reflexive way to write the mask is “wherever the target token is the padding ID, mask”. This is wrong, and it’s wrong in a way that produces a model that never learns to stop.

The fix is to mask the [padding → padding] transitions but keep the first [real_token → padding] transition trainable. That single transition is where the model learns where end-of-sequence lives. Mask it and the decoder produces fluent, plausible, infinite output at inference time. Keep it and the decoder learns to terminate.

# Wrong: mask all positions where the target is padding.
mask = (targets != 0)

# Right: keep the first padding position (the EOS transition) trainable;
# mask only padding-on-padding.
mask = (targets != 0) | (mx.roll(targets, 1, axis=1) != 0)

It’s a one-line difference in the loss function. Catching it took a wasted training run and a confused afternoon staring at sample outputs. It’s the kind of subtle bug that doesn’t show up in a unit test and only surfaces when you actually look at what the trained model produces. I include it here because it’s representative of the class of problem this work surfaced: not “import the right HF class”, but “make teacher-forced training do what teacher-forced training is supposed to do”.

Two backends, one architecture

The same model is implemented twice: once in Keras, once in Apple MLX. The MLX path is what I used to actually train, on Apple Silicon, with a @mx.compile’d step and an async batch prefetcher built on concurrent.futures that keeps the GPU fed during data preparation. The Keras path is what runs by default if MLX weights aren’t available, and it’s also what the data pipeline (How2SignDataset) uses, with a manual upfront vectorisation step that works around a macOS TF threading deadlock that ate a non-trivial amount of debugging time.

The dual-backend setup wasn’t theoretical. It’s what made it possible to train the model at all on the hardware I had during the hackathon, and it’s what made the model portable enough to ship inside the Qt application my partner had built the agent integration on top of.

Hybrid inference: the runtime as a model decision

Here’s where the engineering ends up doing more than the model alone can.

The trained Seq2Seq is genuinely good at continuous sentences and genuinely weak on the kind of short conversational gestures (“hello”, “thank you”) that aren’t well-represented in How2Sign. The pretrained isolated-gloss model has the opposite profile. The right move isn’t to pick one. It’s to dispatch.

The inference engine became a hybrid: if the frame buffer holds fewer than 16 frames, fall back to the isolated TFLite classifier; if it holds 16 or more, run autoregressive Seq2Seq decode on the trained model. The 16-frame threshold is roughly half a second of signing at 30 FPS, which empirically separates “single sign” from “phrase”. Both models share the MediaPipe-Holistic frontend, so the dispatcher is purely a buffer-length decision.

This gets framed as engineering taste, but it’s actually a modelling choice. The same way mixture-of-experts gating, retrieval-augmented vs. parametric, and small-model-as-router-for-large-model patterns are modelling choices. The right phrasing is: don’t make one model do two jobs when the runtime can choose between two specialists.

Results

The honest version of the model-only numbers looks like this:

Validation results from our internal evaluation set of roughly 300 conversational-signing video clips, grouped by common sentence-starting phrases (“I’m going to…”, “you want to…”, and similar). Each row is a phrase grouping; the bars show the trained model’s recognition rate within that group. The bottom row, Anchor Words, contains the isolated glosses (“hi”, “okay”, “good”) and as expected scores well, since per-frame visual primitives are the easy case. Multi-word continuous sentences are substantially harder; for sentences starting with several signs in tight succession, the model’s autoregressive decode often fragments or stalls. This is the empirical motivation for the hybrid runtime.

A standalone Transformer Seq2Seq trained on a hackathon budget on How2Sign isn’t going to match a production-trained gloss recognition system on isolated glosses, and on long continuous sentences it pays for the small-data regime in exactly the way the figure shows. Most published continuous sign-language models struggle with the same setup, and I don’t claim mine is special.

What changes is what the pipeline achieves once the trained model is composed with the Qt threading layer’s temporal sliding window and the isolated-gloss fallback. End-to-end, on our internal evaluation of conversational signing scenarios, the composed system reached 93.6% sentence-level recognition. The split of where that number comes from is roughly: short-burst gestures handled by the fallback (high precision, narrow scope), continuous phrases handled by the Seq2Seq with sliding-window temporal smoothing on top, and the dispatcher choosing between them on a buffer-length signal that’s empirically clean.

The reason that number is a useful signal rather than a vanity metric is that none of the three components reaches it on its own. The model alone is in the figure above. The fallback alone tops out wherever isolated-gloss accuracy tops out and can’t say sentences. The dispatcher alone is twenty lines of buffer-length logic. The composition is what works, and the composition is where the engineering taste lives.

Honest negatives, since this isn’t a paper:

Long sentences (>10 signs) still degrade. Beam search at decode and a CTC head as an alternative to teacher-forcing are the obvious next moves.
The model has no notion of speaker. Different signers have different rest poses, hand sizes, and signing speeds; a speaker-conditional encoder would help.
The hybrid threshold (16 frames) is empirical. A learned dispatcher would be the principled version.

The wider point

The first version of the system shipped a UI around someone else’s model. The version that actually worked for continuous signing required training a different model. Both were the right call at their moment. The interesting work was in the transition: recognising the pretrained model’s ceiling, designing a different architecture, training it in two backends, getting the masked-CE detail right, and engineering the runtime that lets the two models coexist instead of replacing one with the other.

That loop, see-the-ceiling, design, train, integrate, is what I want to spend the next decade doing. Most of the time, in production, it’ll happen at smaller resolution than this; sometimes at much larger. The pattern is the same.

What’s next

Connectionist Temporal Classification as an alternative training objective. Teacher forcing isn’t the right inductive bias for a problem where the input and output stream lengths are decoupled and there’s no natural alignment.
Beam search at decode, with length normalisation. Greedy autoregressive decode is leaving recall on the table, especially on phrases where one early token error cascades.
Speaker-conditional encoder. Conditioning on a small per-signer embedding learned from a calibration sequence would close most of the cross-signer drift.
Encoder pretraining on raw ASL video. The current encoder learns from labelled translation pairs only, which is a tiny slice of the available signal. A self-supervised pretraining stage on unlabelled signing video, masking or contrastive, is the obvious unlock.

Hackathon collaboration. Credit to Sharon for the agent integration, MCP work, RAG provider, and Qt UI scaffolding; my contribution centres on the modelling and inference layer described above. The project won first place in an organisation-wide hackathon with 11,000+ submissions. How2Sign is a CMU-released dataset under CC-BY-NC-4.0; trained weights derived from it are subject to the dataset’s non-commercial terms.

Patient-specific filters as biomarkers

Michael Min Wah Leung — Thu, 30 Apr 2026 00:00:00 GMT

A filter fit to the patient is a biomarker of the patient

The conventional view of EEG preprocessing treats spatial and spectral filters as a kind of janitorial work. Clean the data, then do the science on whatever’s left. My graduate research convinced me of the inverse: the filter parameters are the science. They quotient out the structure that varies idiosyncratically across people, and the parameters they learn while doing it carry the individual’s signature. The features that remain are comparable across people only because the filter has eaten the variance that wasn’t.

This post is about three filters I worked with, what each one removes, what its parameters reveal, and why the same intuition shows up every time I read a mechanistic interpretability paper.

The setting

The thesis platform was a saccade-based stop-signal task in VR with simultaneous 32-channel scalp EEG, on seven healthy subjects, designed as the pilot validation for a Parkinson’s disease biomarker study. Subjects fixated on a central point, were cued to prepare a prosaccade or antisaccade, and on 20% of trials had to cancel the planned movement when the fixation point turned green. The neural signature of successful cancellation, frontal theta synchronisation followed by motor beta desynchronisation, is well-described in the reach literature; the question was whether a robust classifier could detect that signature on a per-subject basis with a small amount of data.

Off-the-shelf decoders generalise poorly here, and not for the reason most ML readers expect. The problem isn’t sample size or label noise. It’s that the relevant frequency band is genuinely different in different brains. Subject 1’s task-modulating beta peak sits at 29–32 Hz. Subject 2’s sits at 15–20 Hz. Subject 4’s is 13–19 Hz. If your classifier filters the signal at 13–30 Hz “because that’s beta”, you are doing two completely different operations on those subjects and pretending it’s the same one.

The same problem shows up in a much harsher form in the parallel work I did on intraoperative microelectrode recordings from deep brain structures. Different patient, different anatomy, different background spectra, and you only get the recording window the surgeon gives you. There’s no luxury of a population-fit pipeline; the filter has to work on this person, today.

Three filters, three levels of structure removed

Per-subject signal-processing pipeline, reproduced from the thesis (Figure 7). The task-modulating component is selected from ICA, its spatial filter is applied to the session, the power spectrum is computed, and FOOOF separates aperiodic 1/f from narrow oscillatory peaks. The peak parameters become the per-subject β band that the downstream CSP estimator uses.

ICA: remove statistical mixtures

Independent Component Analysis assumes the recorded channels are linear mixtures of statistically independent sources, , and estimates the de-mixing matrix such that recovers components that are as independent as possible. In EEG this works because ocular, muscle, and cardiac artifacts genuinely are statistically independent of cortical activity at the scales we care about, and they have stereotyped topologies (frontal-symmetric for blinks, lateral for muscle).

The thesis-relevant property of ICA is that it’s unsupervised. There’s no experimenter bias in choosing what’s “signal”. You decompose, you look at the topologies, and you keep the component whose ERP and scalp distribution match what neurophysiology says response inhibition should look like. The component is not “the signal”; the component is a basis vector in a decomposition the data itself proposed.

FOOOF: remove the aperiodic background

The brain’s resting power spectrum is dominated by an aperiodic 1/f component on top of which narrow oscillatory peaks live. FOOOF (Fitting Oscillations and One Over F, Donoghue et al.) fits the aperiodic background as a linear function in log-log space, subtracts it, and parameterises whatever peaks remain by their centre frequency, bandwidth, and amplitude.

FOOOF model fit for one subject’s task-modulating component. The aperiodic 1/f line is the background; the peaks above it are the periodic components that survive subtraction. The subject’s β peak is what the next stage uses.

The number that mattered for the thesis is in this table:

Subject	α range (Hz)	β range (Hz)
1	11–13	29–32
2	9–13	15–20
3	none	28–32
4	7–10	13–19
5	5–9	23–28
6	7–13	17–24
7	8–12	20–25

Seven subjects, seven different β bands. “13–30 Hz beta” is not one phenomenon; it’s a population-level smear that hides the structure. Worse, in some subjects the narrow peak is only detectable after ICA has stripped out an artifact component that was dragging power into a different region of the spectrum. The peaks live in a basis the raw signal doesn’t expose.

CSP: remove between-class variance you don’t care about

Common Spatial Patterns finds spatial filters that maximise variance for one class while minimising it for the other. Given band-passed EEG matrices and for two classes, CSP solves a generalised eigenvalue problem on their normalised covariances, and the resulting filters project the data into a subspace where the two classes are maximally separable in variance.

CSP scalp topologies (filter-bank CSP, reproduced from the thesis). Centrally-weighted patterns suggest the model is using motor/sensorimotor sources rather than ocular or muscle artifacts. Source localisation isn’t the point of the figure; which features the classifier ends up using is.

Two things matter here. First, CSP is parameterised by the band-pass it operates in, and that’s where the FOOOF output enters. Feeding CSP the subject-specific narrow band, instead of the conventional 13–30 Hz, lets it find spatial filters that actually correspond to task-relevant activity rather than whatever broad-band variance happens to dominate. Second, the resulting topologies become a check on the rest of the pipeline. A CSP filter that places its weight at the temples is using muscle, not cortex. A filter that places it centrally over sensorimotor cortex is doing what we wanted.

The reframe: filter parameters are features

Notice the move that’s happening across all three stages. ICA gives you a mixing matrix that’s specific to this subject’s recording; the columns of are this person’s source topologies. FOOOF gives you a triple that’s specific to this subject’s resting spectrum. CSP gives you spatial filter weights conditioned on this subject’s data and frequency band.

The conventional pipeline treats these as preprocessing artefacts, things you need to fit but can throw away once you have the post-filter signal. The reframe: the parameters are the features you actually want. They encode the individual in a low-dimensional, interpretable way. The post-filter signals become comparable across people exactly because the parameters have absorbed whatever was idiosyncratic.

This is the same intuition as whitening followed by per-instance layer-norm in a transformer: a per-sample reparameterisation that makes the rest of the pipeline well-conditioned, without which the downstream layers are doing different operations on different inputs and pretending it’s the same operation. The whitening matrix is sample-specific; the post-whitened space is shared.

The empirical payoff

Across the seven subjects, classifying Go versus successful Stop trials with a Random Forest over CSP features showed a consistent pattern: using subject-specific β bands from FOOOF, instead of the conventional 13–30 Hz broad band, lifted stop-trial recall by an average of ~10 percentage points and up to +13.8 points (subject 01). The lift is largest where it should be: subjects whose narrow β is far from the centre of the broad band gain the most, because for them the broad-band filter is paying for off-band noise.

Subject-level confusion matrices for Go vs Stop classification, reproduced from the thesis. The diagonal lift between broad-band β and subject-specific β is visible across most subjects; the qualitative pattern (Stop is the harder, lower-prevalence class) is preserved.

This isn’t a story about elegance. It’s about whether the classifier is learning anything useful at all. When the filter band is wrong, the post-CSP features mix task signal with off-band drift, and the classifier ends up modelling whichever happens to dominate on the training day. When the filter band is right, the features become comparable across subjects, and a small amount of per-subject calibration generalises to held-out sessions. The thesis showed exactly that: decoding performance improved across successive recording sessions on the same subject when the per-subject features were used, where a generic pipeline would degrade as the resting state drifted.

The bridge to mechanistic interpretability

Modern transformers don’t have a 1/f spectrum to subtract, and they don’t have ICA-style independent sources to recover linearly. The math is genuinely different. The question, though, is the same: where in this signal does the structure live, and what filter recovers it?

Sparse autoencoders for mechanistic interpretability are doing roughly this. The MLP activation at a transformer layer is a dense, polysemantic mixture; an SAE finds an overcomplete sparse basis where individual directions correspond to interpretable features. The SAE parameters (the dictionary) are model-specific in the same way ICA’s mixing matrix was subject-specific. The post-decomposition features become legible exactly because the dictionary has absorbed the polysemantic structure.

The framing I find most useful is the one I left graduate school with:

ICA is the linear, statistically-independent ancestor of the SAE. Both find a basis for a noisy mixture in which the components are individually meaningful.
FOOOF is a domain-specific structural prior. It says “we know there’s a 1/f component, fit it explicitly, model the residual.” The transformer analogue is recent work that explicitly subtracts low-rank “background” structure from activations before looking for sparse features.
CSP is task-conditioned dimensionality reduction. It’s most analogous to probing classifiers: find a subspace where the labels are linearly separable, and inspect what the subspace responds to.

None of these are perfect analogies. The relevant signal in a transformer isn’t the brain’s, and the failure modes of spectral filtering don’t translate directly to representation learning. But the willingness to treat the individual sample as the unit of investigation, rather than averaging straight to a population estimator, is a habit of mind that travels.

What this taught me

A few things I took out of this work that still shape how I think about modern ML:

Per-sample parameters are not overhead; they’re often the answer. The reflex to “throw away the calibration” is wrong when the calibration is what makes the rest of the pipeline well-conditioned.
Decomposition before classification. When the input is mixed, fit a decomposition and let the classifier work in the unmixed basis. This is true for EEG, true for vision-language fusion, and increasingly true for transformer interpretability.
Trust the topology more than the accuracy. A model that achieves high accuracy by attending to artifact channels is failing in a way the validation set won’t tell you about. CSP scalp topologies are the cheapest sanity check I’ve ever shipped, and the equivalent in modern ML, looking at which features the model is actually using, is one of the few things I expect to keep doing for the next decade.

The full thesis is open-access at the University of Ottawa repository for anyone who wants the complete methods or the per-subject figures. If you’re working on per-instance reparameterisation in language models, or on bridges between classical signal-decomposition and modern interpretability, I’d love to compare notes.