PureGym recommendations
High-confidence picks from 27,666 customer reviews + 43 Glassdoor staff reviews + Companies House FY2024 filing. Tagged OpEx (recurring £/year) vs CapEx (one-off £). No fabricated £ figures — implementation costs are for PureGym's facilities, CX, and marketing teams to price.
What to do Monday morning
What was deliberately not claimed
£ savings figures per recommendation. Cost / time / size estimates beyond the Companies House filing. Predictions about churn impact at sub-12-month horizons (Lindstrom power analysis: 50 bps churn effects are statistically invisible at PureGym scale within a year). Causal identification per pilot — single-site before / after is a weak counterfactual. The "shift workers" headline (3% verified after 200-sample Sonnet zero-shot — re-framed as 24/7-flexibility before publish, saved a 30× wrong number).
The data behind these picks
| Source | Volume | Role |
|---|---|---|
| Trustpilot reviews | 15,815 | Customer-side text + reply-latency feature |
| Google reviews | 11,851 (+ 9,352 stars-only dropped) | Customer-side text |
| Glassdoor staff reviews | 43 | Staff-side counter-evidence (music null) |
| Companies House FY2024 filing | 1 PDF | Real numbers — vindicates panel review on Perplexity errata |
| Sonnet 4.6 200-sample zero-shot | 200 | Validation — flipped shift-worker headline 30× |
| BERT emotion classifier (rubric-mandated) | 27,666 | Six-class emotion + score-guided OOD re-rank for British 1–2★ joy |
| BERTopic (4 lenses) + Gensim LDA | 5,931 negatives (4,137 cross-platform) | Topic discovery, location specificity, multilingual cluster detection |
| Qwen2.5-7B-Instruct on Colab A100 | 2,537 anger-filtered → 5,999 phrases | Natural-language topic extraction → meta-clustered with BERTopic |
Full long-form record — every refinement, every panel-review critique, every dropped hypothesis — is in the addendum cells of the submission notebook (`basic/basic_notebook.ipynb`, cells 117–127) and `LESSONS_ADDENDUM.md`.
Learnings by rubric item + overall
The CAM_DS_301 rubric has 48 numbered criteria across 8 sections. This is what we learned mapping each one to PureGym's review corpus — the surprises, the substitutions, the things the rubric implied but didn't say out loud.
Section 1 — Importing & cleaning
Items 1–3: Excel import + drop-NaN
Trustpilot's UI requires text; Google's doesn't. Of 23,250 Google reviews, 9,352 (40%) have no Comment at all — Google's review widget allows star-only submissions. Dropping these is correct, but the asymmetry matters: keep them in star-distribution stats, exclude only from text analyses.
Inspect placeholder/numeric values before dropping rows. 216 Trustpilot rows had numeric Location Name placeholders (`345`, `398`). Pierre's first instinct was to drop them. A Sonnet 4.6 read-through of all 216 reviews showed they're real PureGym UK reviews — the placeholders are multi-site catch-all buckets (`345` aggregates 9+ London gyms; `398` is dominantly Shrewsbury). They stay in topic / sentiment / emotion analysis but are excluded from per-location ranking only.
Section 2 — Initial data investigation
Item 6: Preprocessing (lowercase, stopwords, numbers)
Heavy preprocessing HURTS BERTopic. Lemmatization increased outliers from 36.7% → 47.6%. Stem 22/50 (44%) of test reviews flipped sentiment-relevant tokens. BERT was trained on the full Zipf distribution (slope -1.034, R²=0.993) — stripping stopwords strips signal. Two-track pipeline: raw text feeds embeddings; preprocessed feeds CountVectorizer for labels only.
NLTK english (179 words) misses the high-frequency content-light filler. Top-15 negative-review words were dominated by `get, like, even, time, go, would, also, one, use, good, day, people, always, really, great, nice`. Extended `GENERIC_STOPS` set was added on top of NLTK + brand stops (`pure`, `gym`, `puregym`). Cohort tutor explicitly green-lit iterative stopword extension at the 2026-04-16 Q&A.
Item 9: Top-10 word bar plot
After the stopwords extension, top-10 finally surfaces content (`equipment, staff, classes, friendly, clean, machines`) instead of generic filler. The bar charts are styled with consistent palette across panels (Google blue, Trustpilot green) so the platform comparison reads at a glance.
Item 12: Negative-only filter
Filter is `Overall Score < 3` for Google, `Review Stars < 3` for Trustpilot. Yielded 5,931 negatives total; 4,137 of those came from 335 cross-platform locations after manual name-merging.
Section 3 — Initial topic modelling (BERTopic)
Items 13–15: BERTopic + top topics + top words
UMAP is non-deterministic by default. Topic IDs reshuffle between runs; any hardcoded `themes = {0: "Equipment", 1: "Staff"}` dict silently lies after a re-run. Fix: pass `UMAP(random_state=42, ...)` explicitly into BERTopic across all 4 fit_transform calls. Belt-and-braces: replace hardcoded label dict with keyword-rule labelling (`_THEME_RULES` + `_label_topic`) so labels track keywords not topic-IDs.
Defaults are starting points, not solutions. Every BERTopic component (embedding model, UMAP, HDBSCAN, vectorizer, representation) is a knob. Stack `KeyBERTInspired` + `MaximalMarginalRelevance` for cleaner topic labels. Apply `reduce_outliers(strategy="embeddings")` — outliers are clustering artefacts, not garbage data.
Item 16: Interactive topics visualisation
Plotly outputs may render blank on plain Jupyter / nbviewer. BERTopic's `visualize_topics()` saves as plotly JSON; renderers without widget support show "FigureWidget" placeholders. Mitigation: parallel `fig.write_image(...png)` insurance via kaleido. Caveat: kaleido didn't install cleanly on this Colab run, so the static-PNG fallback failed — the plotly figs still render in Colab/Jupyter, but the safety net wasn't there.
Item 18: Heatmap (topic similarity)
Heatmap shows cosine similarity between topic embeddings — link to Week 1.3.2 explicitly. cos = 1 (identical) → cos = 0 (orthogonal). Plus, a custom seaborn c-TF-IDF heatmap of top-10 topics × 14 most-discriminative words was added in the addendum to give the marker something more readable than the default similarity heatmap.
Item 19: Describe 10 clusters
Domain insight comes from specific findings: parking fines of exactly £85 (124 reviews), 4-hour class cancellation policy frustration (115), locker theft (112), water machines broken for weeks without management action (38). This granularity is where BERTopic excels over LDA, which merges related themes.
Section 4 — Further data investigation
Item 21: Top-20 negative-review locations per platform
Seven locations appear in both platforms' top-20 lists. London Stratford leads with 81 combined negative reviews. Locations are roughly similar across platforms when corrected for volume, with the 7-location overlap being the actionable signal for the worst-clubs programme.
Item 22: Cross-platform location merge
23 hand-curated cross-platform location pairs. Naive intersection 310 → normalised 312 → after manual merges 335. Most of the merges are retail-park / mall suffix variance; one is a Knaresborough typo. Done via rapidfuzz `token_set_ratio` scan (≥90 confidence threshold) + Pierre review.
Items 23 + 25: Top-30 wordcloud + top-30 BERTopic comments
Top-30 wordcloud sharpens against the full-dataset wordcloud — broad complaint terms give way to location-specific vocabulary (`mould`, `closed`, instructor names). Top-30 BERTopic acted as a different lens: location-specific issues invisible at full scale (mould in showers, individual gym closures despite 24/7 advertising, instructor-specific class complaints).
Honest reframe. The original report claimed top-30 was "sharper than full" (24.4% outliers vs 32.7%). After the seeded-UMAP fix, top-30 is actually fuzzier than full (37.1% vs 35.6%). Same finding under a more reproducible setup — top-30 is a different lens, not a sharper one.
Section 5 — Emotion analysis
Items 26–28: BERT emotion classifier import + run + bar plot
Emotion model `bhadresh-savani/bert-base-uncased-emotion` is rubric-mandated; not swapped. The "uncased" variant lowercases internally — strips ALL-CAPS anger signal that the training data (Twitter) actually preserved.
Twitter classifier hits politeness-repair on British 1–2★ reviews. 20.6% of 1-star reviews tagged "joy" — way above sarcasm base rate (2–5% per SARC/iSarcasm). Domain mismatch, not sarcasm. Polite British complaint openings ("I have been a loyal customer for three years, however…") read as joy. Frame: register-mismatch, not bug (Brown & Levinson 1987 + Biber & Conrad 2009).
Score-guided re-rank using the model's own probability vector — principled (Confident Learning, Northcutt 2021 JAIR; Snorkel weak supervision, Ratner 2017). Rule pre-specified BEFORE measuring downstream effect to avoid the forking-paths critique (Gelman & Loken 2013). `emotion_raw` preserved.
Item 30: Anger-filtered BERTopic
2,537 reviews, 24.8% outliers. Narrowed primary anger drivers to membership cancellation, rude staff, equipment failures — same themes as the full run but at higher resolution. Anger filter strips out the more diffuse complaints.
Section 6 — Falcon → Qwen + LLM-driven topics
Item 31: Falcon-7b-Instruct
Substituted Qwen2.5-7B / 72B-Instruct. Falcon-7B on T4 was 50 hr / 600 reviews; Qwen on A100 is ~120 s / 600 — 100× faster, multilingual (needed for the Danish/German residual reviews), structured output, Apache 2.0 (not gated). Instructor verbal green-light at 2026-04-16 Q&A. Falcon notebook kept for side-by-side comparison.
Falcon's rubric prompts no longer reproduce. Russell explicitly noted in the 2026-04-24 cohort Q&A that the prompts he wrote into the rubric "worked the first time. And it didn't work this time. Because they've updated the models. You get drift." The Qwen swap is the methodologically correct response.
Item 35: LLM topic-extraction comment
The LLM-driven BERTopic run (Qwen extracts natural-language topic phrases per review; BERTopic meta-clusters the 5,999 phrase outputs) produced substantively different clusters dominated by intent-bearing phrases like "personal turnover" and "rude staff feedback" — capturing customer meaning in a way bag-of-words BERTopic cannot.
0-shot Qwen → 10-shot Qwen with Sonnet-derived examples. Operational-lever agreement 60% → 73%; churn-risk 53% → 70%; primary-topic Jaccard 0.124 → 0.166. Zero marginal cost on Colab Pro+. Coaching a small open model with frontier-model examples closes most of the gap.
Section 7 — Gensim LDA
Items 38–41: LDA + similarity comment
10 topics, coherence 0.449. LDA found language clusters BERTopic missed — automatically separated Danish (Topic 5) and German (Topic 7) — reviews from PureGym's Danish and Swiss operations. Different tools surface different signals; the multilingual contamination was a data-quality finding, not a curiosity. (Subsequently English-only-filtered the corpus to 5,828 with langdetect; non-English residue was caused by Trustpilot's `Review Language` column being trustworthy where Google has no language metadata.)
LDA also produced a clear billing cluster (Topic 9: membership, cancel, joining fee, payment) that aligned with Trustpilot's platform-specific topics. Cross-model agreement is the validation move when no two models share enough vocabulary for direct comparison.
Section 8 — Report
Item 42: 800–1000 word report
Trimmed Zipf's Law / ABSA / complaint DNA — those are beyond-rubric, save words for the course concepts (TF-IDF, cosine similarity). Final at 995 words → 1023 with the "Why Qwen" appendix paragraph + visual-polish edits, slightly over the 1000-word ceiling. Russell-tolerance applies; rubric is binding.
Item 47: Capture comments from earlier steps
Every "comment on" rubric item must be IN THE NOTEBOOK as a markdown cell — not just the report. Markers read the notebook. Same applies for the surprise findings: explicit text in the notebook, not buried in an appendix doc.
Overall — what the rubric didn't say but the project taught
Cross-model triangulation is the validation strategy. No two models share enough vocabulary for direct comparison; agreement on themes is the strongest validation move. Sonnet 4.6 gold-eval (30 held-out): operational-lever 60→73%, churn-risk 53→70%. j-hartmann emotion-english-distilroberta-base on stratified 200-sample is the cross-check for the Twitter-trained rubric model.
"List the right things, let stakeholders price them up" is the recommendation framework. PureGym's facilities, CX, and marketing teams will have £-numbers we don't. Surface the right candidates with confidence; let the stakeholders price.
AI deep research is a starting scaffold, not a final answer. Panel review caught Perplexity errors that would have embarrassed in submission — gym-format size (5,500–25,000 sq ft, not 2,500 boutique), ARPM (£22.64, not £21.60), EBITDA margin (29.7%, not 23%), Leonard Green acquisition year (2017 / $786m, not 2013). Cross-check before citing.
Documented + recurring → move from prompt to code. Prompt-level rules drift; code-level enforcement holds. require_gpu(pipe) beats "remember to assert"; compile() refusal beats "remember to escape heredocs". Three repeats was the threshold across the project.
Resource claims fabricated by an LLM are always wrong unless quoted from a real measurement. Cost / time / RAM / "save resources" estimates are <70% confidence by default. Pierre on Claude Max 20× — no per-token API cost. The rule saves a lot of made-up numbers from showing up in an analysis.
Two parallel Claude sessions on the same repo can co-exist if scoped to different sections. Discovered 2026-04-25 evening — one session was visual-polishing report.md, the other was committing the extended consultant report at the same time. No conflict because edits were to different sections, but coordination is fragile; per-session branches would be safer.
My prompts — the learner's record
Pierre's prompts during the PACE project, curated for the moments where he was thinking out loud, asking why, checking his understanding, or genuinely struggling — not the moments where he was extracting work from the LLM. Verbatim, dictation slips and frustration kept in.
How to read this tab
Each card is a verbatim quote pulled from the Claude Code session JSONLs across the PACE project (2026-03-23 → 2026-04-25). Tags show the cognitive move:
why mental-model frustration integrate catch-mistake spar reflect explore
Curated, not exhaustive. The extraction prompts ("do this", "fix that", "build me X") are excluded — those would dilute the signal this tab is trying to show.
How this differs from typical LLM prompting
Most published examples of LLM prompting are extractive: "Write me a marketing email about...", "Summarise this PDF", "Generate 10 ideas for...", "Fix this code". The user gets an output and moves on. The LLM does the cognitive work; the human reviews.
The prompts below are investigative: "why does X work this way?", "is that right? I thought it was Y", "can you walk me through Z again", "hold on, that doesn't match what we said earlier". The human is doing the cognitive work; the LLM is a tutor / sparring partner / verification layer.
The bar isn't "did you write the prompt well?" — it's "did you stay in the driver's seat?"
223 user prompts scanned across 10 session JSONLs (PACE + data-workbench, 13 Apr → 25 Apr 2026). 42 kept as learner-posture; 181 filtered out as extraction. No edits — every quote preserved as typed (typos, dictation slips, profanity all in).
Context. Apr 13 morning, very start of the project. Pierre has 4 minutes before the live tutor meeting and wants to enter it knowing what's on the rubric vs off-rubric, with prior 1:1 chat notes from his Sparx colleagues as a frame.
Learning move. Tools-as-prep, not tools-as-substitute — he's building his own readiness for a high-stakes human conversation by surfacing context he already owns (BrainDB chats) and locating it inside the rubric. The LLM is a research librarian here, not a ghostwriter.
session ref · ef726887:78
Context. Apr 13, opening hour of the first PACE session. Pierre dumps the WhatsApp class chat and asks for a peer-position read.
Learning move. Calibrating his own pace against the cohort — not 'do better than them' but 'where is the cohort frontier and am I mid-pack or ahead?' This is metacognition before content work.
session ref · ef726887:36
Context. Apr 13, trying to rescue a corrupted recording of his pitch session. Pierre is reasoning aloud about how file formats work and arguing toward a recovery strategy.
Learning move. Active hypothesising about an unknown file format — naming the source app, proposing 'treat as audio until word stops', checking encryption assumption ('it's not encrypted right'). This is engineering intuition out loud, not a delegate-and-wait.
session ref · ef726887:407
Context. Apr 13 evening, late in the first big session. Pierre is starting to think about post-PACE career steps and wants to anchor what he's already built against an industry stack.
Learning move. Mapping new vocabulary onto already-walked terrain — 'show me what we're doing and how that maps' is a learner asking for a translation table, not a tutorial. He's protecting the parts he understands and only buying explanation where the stack is unfamiliar.
session ref · ef726887:685
Context. Apr 13 evening, follow-up after Vertex was suggested as a learning target. Cuts straight to the differential.
Learning move. Refusing to learn a tool whose differential-value isn't named. Classic 'tell me what's on the other side that's not already on this side' question — the right one to ask before sinking time.
session ref · ef726887:710
Context. Apr 13 evening, expanding from PACE coursework to 'where could this live in industry'. He admits he can't articulate the question.
Learning move. Naming the limit of his own articulation — 'i dont know im not askign this question well' — and licensing the assistant to interpret rather than answer narrowly. Honest about the fuzz; trusting the iteration.
session ref · ef726887:715
Context. Apr 13, mid-afternoon. After his industry-mentor pitch session, Russell mentioned Isaac Physics in passing and Pierre didn't know what it was.
Learning move. Catching unfamiliar nouns in a debrief and flagging them for backfill. The two extra spaces and lowercase betray voice-dictation; the willingness to admit 'I don't know what that is' is the move.
session ref · ef726887:651
Context. Apr 14 morning, opening prompt of a follow-on session. Yesterday's session went late and cross-domain; today he wakes up and asks for the situation report before doing anything.
Learning move. Resisting the urge to dive back in. Forcing himself (and the tool) to re-state purpose before resuming work — the ADHD compensation move of 'orient before act'.
session ref · ef726887:947
Context. Apr 14 mid-morning. The walkthrough has gotten dense and Pierre wants to redo it from rubric-anchor outward.
Learning move. Anti-hype contract: 'if things are weak just say so don't hype things up'. Pierre is teaching the assistant a calibration norm because he doesn't want a glossy artefact, he wants an honest one. Also separating chart-comprehension ('what they show and don't show') from chart-production.
session ref · 9c69ef34:6
Context. Apr 14, deep into walkthrough revision. The pipeline is multi-stage (Trustpilot + Google, location merging, language filtering, emotion classifier) and Pierre needs a single visual.
Learning move. Asking for the artefact that proves data lineage, not the explanation of it. He wants drop counts and 2-3 actual before/afters because that's how he checks whether anyone (including him) actually understands the pipeline.
session ref · 9c69ef34:136
Context. Apr 14, mid-afternoon. Walkthrough page got rebuilt and Pierre lost the link in the chat scroll.
Learning move. Frustration as a signal that the loop has too many handoff points. Less a learning prompt than a marker — Pierre is the kind of learner who keeps the friction logs visible rather than smoothing them away.
session ref · 9c69ef34:321
Context. Apr 14 afternoon. The assistant claimed to have visually verified the page rendered bars correctly; Pierre's eyes saw no bars. The assistant had hallucinated from its own source code instead of looking at the rendered page.
Learning move. Diagnosing the failure mode rather than just yelling at it. He distinguishes 'research not good enough' vs 'follow through not good enough', proposes 'where are the gaps', and ends with 'how do we tighten this up' — a learning conversation about how the tool itself learns, not a one-off bug report. This is how Pierre teaches the assistant new norms.
session ref · 9c69ef34:620
Context. Apr 14, after the bar-hallucination episode. Pierre has read elsewhere that there's an autonomous overnight loop pattern; he asks the assistant to research it because his own assumption that there's a better way needs evidence.
Learning move. Trusting his own gut that 'the way you're doing it isn't' optimal — but explicitly licensing the tool to look at what's recent rather than trusting its training. Meta-learning about how to use the tool.
session ref · 9c69ef34:715
Context. Apr 16 morning. The course gave permission to swap the original Falcon model. Pierre doesn't just want a swap — he wants to know what the swap unlocks.
Learning move. Treating a model swap as a capability question, not a config change. 'See what new capabilities we have' is the move that turns a chore into learning. Also flags 'just think for a moment to make sure it's solid' — explicitly slowing the assistant down.
session ref · 3b7cb500:32
Context. Apr 16 morning. The assistant had drifted toward a small local model 'to be safe' on resources; Pierre interrupts with the actual constraints (he has Colab Pro A100 + HF Pro).
Learning move. Re-asserting the real constraint surface. Pierre's lived environment is GPU-rich; the assistant defaulted to scarcity. The 'wait stop' is also a literal pattern interruption — he's training the assistant on his actual stack.
session ref · 3b7cb500:53
Context. Apr 16, live during a tutor Q&A. Pierre is in a real call and quickly types 'where does the emotion stuff come in' to figure out which rubric section emotion classification belongs to so he can ask intelligently.
Learning move. Live-during-call sense-making. The voice-dictation slip ('amotion anger sadness') reveals he's typing in real time. He's not preparing — he's calibrating in the moment which question is well-formed.
session ref · 3b7cb500:78
Context. Apr 16, still in the same Q&A session. The model misclassifies UK polite-complaint as 'sad' instead of 'angry'. Pierre wants both the question framing and the downstream-stakes story.
Learning move. Two layers in one prompt: (1) help me FORM the question, (2) help me understand the BUSINESS IMPLICATION of the answer. He's not outsourcing the asking, he's rehearsing the asking. 'adhd friendly' is a UI directive — break it into chunks his brain can hold.
session ref · 3b7cb500:99
Context. Apr 16, just after the Q&A. Pierre wants the rubric tick-back AND first-20-rows in/out as a sanity layer on top of metrics.
Learning move. Insisting on row-level eyeballing alongside aggregate metrics. He doesn't trust 'accuracy: 0.78' — he wants to see what 20 actual examples look like through the pipeline. This is the data-scientist habit of physically reading samples.
session ref · 3b7cb500:106
Context. Apr 16 afternoon. Mid-build, Pierre asks 'what do you need from me' AND requests an audio walkthrough per rubric item.
Learning move. Active turn-taking — 'what do you need from me' is unusual; most users wait passively. And the audio request shows self-knowledge ('you know how I like to learn') — he's a multimodal/audio learner and is engineering the artefact around how his brain consumes information.
session ref · 3b7cb500:1174
Context. Apr 16 afternoon. Audio walkthroughs being shipped. Pierre wants 2-minute chunks autoplaying as a playlist, plus a 10-minute overview.
Learning move. Specifying the consumption format, not just the content. Two-minute chunks per rubric point + autoplay playlist = a learning loop tuned for ADHD attention budget. He thanked the assistant for the chunking idea — explicit reinforcement of useful structure.
session ref · 3b7cb500:1237
Context. Apr 17 evening, building the basic submission notebook. Pierre wants a specific structure: rubric line, then learnings, then code, repeating.
Learning move. Designing the artefact as a self-teaching document. Rubric-text > learnings-text > code is the structure of a study guide, not a deliverable. He's coding the notebook to teach his future self.
session ref · 1aaf31e5:90
Context. Apr 17 evening. Non-English reviews are showing up and skewing topic models / emotion classification.
Learning move. Cost-of-effort reasoning before committing. 'How many lines in that Python file' = is this worth the lift? He's weighing 'put it in the notebook more sophisticatedly' vs 'just drop non-UK locations' — two different surgical depths.
session ref · 1aaf31e5:280
Context. Apr 18 dawn. The ollama-to-HuggingFace migration has been thrashing. Pierre is exhausted and steps out of the bug to ask about the meta-process.
Learning move. Naming the failure pattern ('this is too reactive') and asking which skill in his framework should capture it ('is it auditor?'). He's not just frustrated — he's tagging the experience for permanent capture so the next migration doesn't relive the same loop.
session ref · 1aaf31e5:395
Context. Apr 18 morning. Continuing the post-mortem. Pierre wants to know what specifically should be lifted out of this PACE-specific pain into the general toolkit.
Learning move. Distillation move. Treating the painful migration not as wasted time but as raw material — the distinction between project-noise and reusable-IP is what turns a bad day into compound interest.
session ref · 1aaf31e5:539
Context. Apr 18 morning. Pierre is loud-CAPS frustrated with himself (and the tool) because he keeps treating PACE as one-off project work instead of as a vehicle for transferable skill.
Learning move. Re-orienting around the actual purpose of the project. The CAPS aren't anger at the assistant — they're at himself for losing the thread. The whole reason PACE exists in his world is for generalisable learnings; he's re-licensing that intent.
session ref · 1aaf31e5:557
Context. Apr 18 mid-morning. The same notebook has been edited ~8 times in-place during model migration, with no commits between rounds. Pierre suddenly asks the version-control question.
Learning move. Catching a destructive default before it bites. The frustration is real but the cognitive move is precise — naming 'destructive' is the right vocabulary, and the assistant confirms it ('Yes: destructive. No recovery points.'). Pierre's instinct on data discipline overrode the 'just keep going' urge.
session ref · 1aaf31e5:1081
Context. Apr 18 morning. The assistant claimed it 'couldn't reach' the brain DB. Pierre knows it's reachable from this machine — the assistant was reaching for an 'I can't' default.
Learning move. Refusing capability gaslighting. Pierre knows his infra; when the tool falsely claims it can't connect, he calls it out. The assistant's response in this turn was to admit 'that was lazy. Brain DB IS reachable; I was reaching for the I can't default.' This is sparring that improves the tool's behaviour.
session ref · 1aaf31e5:1161
Context. Apr 18 afternoon. Just after running through commits. Pierre pivots from 'commit it' to 'who in my system would catch a bad commit'.
Learning move. Asking who-not-what. Instead of describing checks he wants run, he asks which persona owns that responsibility — testing his mental map of the village/skill system. Verifying the system has eyes for an entire class of mistake (committed-secrets) before it bites him.
session ref · 4ee82f98:852
Context. Apr 18 afternoon. The assistant proposed installing a packaged CLI to use the workbench tool. Pierre prefers the lightest-weight integration.
Learning move. Pushing back on tool sprawl. He has a strong prior that installs are tax — and asks for consolidation. The assistant came back with two-line shell aliases instead of an install. Negotiating a smaller solution by stating his constraint, not by approving anyway.
session ref · e2d33ad0:515
Context. Apr 18 evening. After a parallel session merged the workbench skill, Pierre wants a status read AND a direction-to-go question in the same breath.
Learning move. Asking 'what do we need to change' rather than 'is it done'. Treats the merge as a draft rather than a deliverable; expects there's drift to fix. Default-skeptic stance.
session ref · 70109960:426
Context. Apr 18 evening. Substantial restructuring is being proposed; Pierre wants the village's diverse views before signing off.
Learning move. Requesting deliberate disagreement. He's built a multi-persona system specifically so he can stress-test his own decisions through different lenses; this prompt activates that lens-rotation.
session ref · 70109960:526
Context. Apr 24 evening, on a workbench session. The model trained on US tweets reads UK polite complaints as 'sad' rather than 'angry'. Pierre proposes collapsing emotion buckets and 1-2 star reviews into one churn-risk signal.
Learning move. Proposing a domain-justified decision and asking the tool to sanity-check it. 'Does that make sense?' followed by his own three-clause reasoning ('reason being if they stop coming, regardless of emotion, they want to know that'). He's articulating the consultancy logic and inviting challenge.
session ref · 45465c5b:108
Context. Apr 24 evening. The basic notebook is being modified in-place again, and Pierre catches it via a FileNotFoundError when columns moved without an index update. (This is the SECOND time he's caught this same pattern — see 1aaf31e5:1081 from Apr 18.)
Learning move. Pattern recognition across sessions. He saw this exact failure six days earlier and is now catching it instantly. The single phrase 'youre overwriteing again' is the compressed form of a learning that has stuck.
session ref · 45465c5b:561
Context. Apr 25 morning. Day after the second overwrite catch. Pierre adds the rule into the standing instruction set.
Learning move. Promoting a one-off frustration to a standing rule. 'You've got to do versioning right and commit to GitHub' is a lesson moving from incident to invariant.
session ref · 45465c5b:949
Context. Apr 25 morning. Two specific location IDs (345, 398) keep cropping up and Pierre can't remember what was decided about them.
Learning move. Honest about losing track ('I'm kind of losing track again'), naming the specific cognitive miss, then proposing the exact data move (pull both name lists + LLM-assisted fuzzy match) that will fix it. Self-aware of memory limit, decisive about how to compensate.
session ref · 45465c5b:868
Context. Apr 25 morning. Pipeline has bifurcations — a step applied to overall rating but not to per-location rating. Pierre needs the visual to show this.
Learning move. Diagram-thinking. He's spotted the asymmetry himself ('we took common locations and added to overall but not per-location') — the diagram request is to externalise his own internal model so he can verify it against the actual code.
session ref · 45465c5b:1051
Context. Apr 25 morning, on a parallel data-workbench session. Pierre wants to bring fresh learnings (anger+sadness merge, Sonnet+10-shot Qwen) back into an earlier write-up.
Learning move. Treating the report as a living document that gets re-passed when his understanding updates. 'Improve it with what we've learned here' is a deliberate iteration loop — he's not redoing it, he's upgrading it.
session ref · 5204cb6b:42
Context. Apr 25 afternoon. A new agent in this session was about to reinvent something the workbench skill already provides.
Learning move. Caveman-clear reset. 'You've lost the plot' is direct; the fix is even better — go read the claudemd and look around, don't ask me to re-explain the system. He's training the tool to ground itself in repo context before proposing.
session ref · 5204cb6b:742
Context. Apr 25 afternoon. Top-15 negative-review token list is dominated by 'staff', 'people', 'one', 'get', 'time' — words too generic to mean anything for churn analysis.
Learning move. Reading raw output and inferring a config bug ('we don't have good stop words'). Catching a 'random 2 typed in a cell' as separate noise. Real diagnostic instinct from the data, not from the assistant's narrative.
session ref · 5204cb6b:806
Context. Apr 25 afternoon. Tokens are still mostly low-signal. And the joy emotion class keeps hitting on 1-star reviews — clearly wrong.
Learning move. Three nested questions in one prompt: (1) here's evidence the model is misbehaving — show me the examples, (2) is there a better-trained UK model already, (3) if not, how expensive would training one be? This is exactly how a senior analyst escalates: see anomaly, ask for backup, scope the next step.
session ref · 5204cb6b:845
Context. Apr 25 morning. Pierre had been told 8 cohort-feedback patches included seeded UMAP for BERTopic. He confirms each patch one by one, and stops on UMAP to ask for the explanation rather than blindly accept it.
Learning move. Verifying mental model of UMAP role inside BERTopic; voice-dictation slip ('bird topic walls' = BERTopic calls) reveals he was thinking faster than typing. The 'just explain to me what that is again please' is the move — he won't ship code containing a concept he can't define.
session ref · b32262f3:40
Context. Apr 25 evening. Pierre has noticed two parallel sessions are producing two notebook variants and wants to reconcile them before going further.
Learning move. Catching a fork before it diverges further. 'What's the difference?' is the right diff-question — he's not asking which is correct, he's asking what the delta IS so he can decide.
session ref · b32262f3:520
Stack & process
How the PACE project was built — the tools, the splits between local and cloud compute, how skills + personas + brain integrate, and how the process changed over four weeks.
The compute split — local Python ↔ Colab GPU
Where each piece runs
Windows / Git Bash (dolphin laptop) — data wrangling (pandas), Trustpilot/Google Excel parsing, langdetect, light NLP (FreqDist, wordclouds), text preprocessing, report editing, git work, deploy. Anything CPU-bound or that doesn't need a GPU.
Google Colab Pro+ (A100 40GB) — every transformer-touching cell. BERT emotion classifier (27,666 reviews, batch 64), BERTopic (4 lenses), Qwen2.5-7B/72B-Instruct topic extraction, j-hartmann cross-check on stratified samples. The notebook is Colab-first: a single A100 Run All produces every output.
Hetzner CAX31 (Helsinki, ARM64) — the brain DB: PostgreSQL 16 + pgvector, where session telemetry, decisions, and learnings live. SSH-only access. Used for cross-session continuity.
Cloudflare Pages (free tier) — viewer for the deployed pages (this one, plus pace-study.pages.dev with the report + extended report + audio deck + cribsheet + walkthrough + row inspector + tips).
The split that actually saves time
Cell-level Colab Run All for transformer work is non-negotiable on A100 — the cost of a clean kernel restart is < 30 seconds; the cost of cells running on stale state is hours of debugging. Local Python is for everything before the transformer work: data validation, language filtering, location merges, stop-word curation. Each chunk runs where it's cheapest in cognitive tax.
The skill / persona / brain integration
Skills
Skills are markdown files with structured triggers; Claude Code loads them on demand. PACE-relevant skills include:
- workbench — the data-science toolkit (env vars, brain DB DSNs, HF token, Colab patterns, brain-vault paths)
- commit / commit-push-pr — git rituals
- done / wrap — session-close rituals
- visual-verify-loop — self-verify rendered web pages after deploy
- broadcast / persona-check-in — fan-out and fan-in across cognitive handles
Personas (cognitive handles)
Twelve personas comment on every substantive turn via a one-line-per-persona village footer. Each is a focused viewpoint, not a separate agent — they all share Claude's reasoning but emit findings in their own voice. Most relevant during PACE:
- 🕴️ Session Boss — purpose, budget, mode
- 📚 Librarian — recall, search, "what did we say earlier"
- 👮 Cop — drift detection, preference compliance
- 🧑⚖️ Auditor — session-close, repeat-mistake findings
- 💾 Archivist — commits and pushes
- 🗑️ Scrap — disk hygiene, sensitive-path leak detection
- 🔧 Steve — API specialist (rate limits, cost, regression)
- 🌿 Alex — learnings steward (3+ similar errors → suggest matching learning)
- 🦉 Maya — four-lens watchdog (Radar / Wisdom / Left Field / ADHD Coach)
- 🏗️ Bela — infra watchdog (don't reinvent existing services)
Brain DB (the spine)
PostgreSQL 16 + pgvector on Hetzner. Tables include v2.sessions, v2.turns, v2.persona_comments, v2.decisions, doc_chunks (1536-dim OpenAI text-embedding-3-small for semantic search). Every Claude Code turn writes to v2.turns via the village-footer-write hook; every persona comment writes to v2.persona_comments. Cross-session recall is via cosine similarity (/recall skill).
Brain-vault (Obsidian-format markdown at ~/brain-vault/) is the human-readable mirror — learnings, sessions, portfolio cards, skills, recordings.
How the framework grew over the project
Phase 1: Single notebook, manual everything
(2026-03-23 — 2026-04-10) Started with a single Jupyter notebook + raw Excel files. No version discipline; in-place overwrites the norm. Pierre handled all preprocessing decisions manually.
Phase 2: V3 pipeline split
(Apr 11 — 14) Phase work split into v3_01_eda.py through v3_12_actionable_findings.py — each phase produces deterministic artefacts. Manual gold-labelling discipline established (v3/LABELLING_TRAINING.md). First panel review ran (5 PhDs + practitioners critiquing methodology).
Phase 3: Workbench codification
(Apr 16 — 18) Cohort Q&A with Russell. Falcon → Qwen swap green-lit. data-workbench/ tooling lifted out of the project repo: apply.py (executor), render.py (live HTML dashboard), hooks/data-workbench-guard.sh (PreToolUse PreToolUse blocking destructive ops), preflight.py (require_gpu, require_files, require_clean_warnings). Pattern: documented + recurring → move from prompt to code.
Phase 4: Submission notebook + deploy
(Apr 18 — 19) basic/basic_notebook.ipynb ships — 48 rubric items × (verbatim rubric + "our learnings" + code cell). 53/53 cells executed on A100, max execution count 77, all 9 patched cells carry patched source. pace-study.pages.dev deployed (6 PIN-gated pages).
Phase 5: Honest reframe + extended report + visual polish
(Apr 24 — 25) Cohort feedback v2 patches (8 cells, intersection 312→335, exclude 345/398 from rankings). Sonnet validation flips shift-worker headline from "1,177 verified" to "24/7-praise filter, 3% confirmed". EXTENDED_REPORT.md (40 KB consultant memo) shipped; deployed at pace-study.pages.dev/extended. Russell 1:1 + cohort Q&A transcribed via Deepgram. v3 visual polish ships: bar palette, heatmap, plotly→PNG fallback (kaleido didn't install — fallback printed and plotly stayed as HTML).
Models & tools used
LLMs & ML models
- Qwen2.5-7B-Instruct (Apache 2.0, Colab A100, primary topic extraction)
- Qwen2.5-72B-Instruct (HuggingFace Inference Providers, larger-context exploration)
- Claude Sonnet 4.6 (gold-label production, 200-sample shift-worker validation, 30-row gold-eval)
- Claude Opus 4.7 (meta-orchestration in this Claude Code session)
- BERT-base-uncased-emotion (bhadresh-savani) (rubric-mandated 6-class emotion classifier)
- j-hartmann/emotion-english-distilroberta-base (cross-check on stratified 200-sample)
- Falcon-7B-Instruct (rubric-mandated original; substituted for Qwen with documented rationale)
- BERTopic (4 lenses: full, top-30, anger-filtered, LLM-driven; UMAP+HDBSCAN+CountVectorizer+KeyBERT+MMR)
- Gensim LDA (10 topics, coherence 0.449, multilingual cluster detection)
- OpenAI text-embedding-3-small (1536-dim, brain DB doc_chunks)
- Sentence-Transformers all-MiniLM-L6-v2 (BERTopic default embedding backbone)
- paraphrase-multilingual-MiniLM-L12-v2 (multilingual variant for non-English residue)
- Deepgram Nova-2 (10 voice recordings transcribed in 1 background agent run)
- Perplexity Sonar Deep Research (industry context, sense-checked against Companies House)
APIs & external services
- HuggingFace Hub (model loading via HF_TOKEN; HF Inference API deprecated for gated models)
- Anthropic API (Sonnet eval batches; Claude Code itself)
- Deepgram API (Nova-2 transcription, diarized)
- OpenAI API (text-embedding-3-small for semantic search)
- Perplexity API (Sonar Deep Research)
- Cloudflare Pages API (deploy via wrangler + project-create REST)
- GitHub API (commits, push, PR work)
- Canvas LMS API (course content scraping for cohort context)
- Companies House API (FY2024 filing lookup)
- Vaultwarden / Bitwarden API (secret rendering)
- ntfy (mobile notifications when overnight runs finish)
Python ecosystem
- Data: pandas, numpy, openpyxl
- NLP: nltk, langdetect, wordcloud, gensim, pyLDAvis
- Transformers: transformers, sentence-transformers, accelerate, kaleido (intended)
- Topic modelling: bertopic, umap-learn, hdbscan
- Plotting: matplotlib, seaborn, plotly
- Validation: rapidfuzz (cross-platform location matching)
- HTTP: httpx, requests
- Notebook tooling: nbformat, jupyter_client
- Git: gitpython (sparingly; mostly via shell)
Editorial & deploy stack
- Claude Code (CLI) — primary work environment
- VS Code — secondary editor
- Mermaid v11 — flowchart rendering in reports + this page
- Chart.js — analytics chart on Tab 5
- Wrangler — Cloudflare Pages deploy CLI
- Tailscale — private network for Hetzner brain DB access
- SSH — Hetzner brain DB ops
- Bitwarden CLI / Vaultwarden — secret rendering
Prompt analytics
A look at the data behind four weeks of Claude Code sessions on the PACE project. Most of these numbers were computed by an analytics agent reading the session JSONLs directly — the rest are pulled from git, brain-vault, and the deployed notebook.
Tokens — the cost of doing this
Prompt caching carried 99% of the input weight: 5,448× multiplier over fresh input. Without caching, the input-token bill would have been ~5,400× larger.
Activity timeline
What the data showed when I went looking
April 18 was the rebuild-everything day
96 prompts, 608 tool calls in 14.82h
The day Pierre lifted the workbench to a standalone repo, patched the notebook into v2/v3 staged variants, shipped 9 rubric-gap fixes, and recorded the longest single sitting. Roughly 2x the prompts of any other day.
Tool-call density: every typed prompt triggers ~9 actions
8.7 tool calls per prompt; 48% are Bash
257 typed prompts produced 2,245 tool calls. Pierre talks short; the agent does long. Bash dominates because Colab/git/Python checks all run as shell.
Prompt caching carried 99% of the input weight
946,132,882 cached-read tokens vs 173,656 fresh input tokens
~94.5% of model context across the 17 sessions came from cache hits. Without prompt caching, this project's input-token bill would have been ~5,400x larger.
Longest single session was a 25-hour overnight
25.4h — session ef726887 (27 prompts)
Pierre kept one session alive across the BERTopic representation upgrade, the Sparx research dump, and meeting recovery. Eight other sessions ran 4h+; eight more were sub-1h hit-and-runs. Bursty, not steady.
Six notebook variants kept on disk simultaneously
basic_notebook.ipynb, basic_notebook_appendix.ipynb, basic_notebook_patched_v2.ipynb, basic_notebook_patched_v3.ipynb, basic_notebook_v2_pending.ipynb, basic_notebook_v3_pending.ipynb
Pierre's iteration discipline: never overwrite, always stage. v2_pending and v3_pending sat alongside the canonical until each A100 Colab run validated their patch sets.
Frustration ratio held under 6%
15/257 prompts (5.8%)
Across 61 tracebacks/exceptions caught in tool output, only 15 prompts contained frustration markers (fuck/stuck/why isn). Errors were debugged, not vented at.
Submission notebook + git
| Metric | Value |
|---|---|
| Total cells in canonical notebook | 127 (53 code, 74 markdown) |
| Lines of code in notebook | 873 |
| Notebook size on disk | 2.6 MB |
| Notebook variants kept on disk simultaneously | 6 |
| Brain-vault learnings authored in window | 32 |
| Git commits | 36 (first 2026-04-10, last 2026-04-25) |
| Lines added / removed | +436,859 / −78,404 |
| Files changed total | 571 |
| Most-touched file | basic/basic_notebook.ipynb (9 commits) |
| Frustration ratio | 15/257 (5.8%) |
| Prompt length (median / p90) | 119 / 3,200 chars |
Stack inventory
Models / LLMs
BERT-base-uncased · BERTopic · Claude Opus 4.7 · Claude Sonnet 4.6 · Deepgram Nova-2 · DistilRoBERTa · Falcon-7B · Gemini 3 Flash · GoEmotions · LDA (Gensim) · Llama-3 · OpenAI text-embedding-3 · Perplexity Sonar · Qwen2.5-7B · RoBERTa · Sentence-BERT (MiniLM) · Whisper · j-hartmann/emotion-english
External APIs
Anthropic API · Cloudflare Pages · Deepgram API · Google Colab · HuggingFace Hub · Perplexity API · PostgreSQL (brain DB) · Spacy · Tailscale · Vaultwarden · ntfy
Tools (Claude Code primitives + MCP)
Agent · Bash · Edit · Glob · Grep · NotebookEdit · Read · TaskCreate · TaskList · TaskOutput · TaskStop · TaskUpdate · ToolSearch · WebFetch · WebSearch · Write · mcp__claude-in-chrome__find · mcp__claude-in-chrome__form_input · mcp__claude-in-chrome__get_page_text · mcp__claude-in-chrome__javascript_tool · mcp__claude-in-chrome__navigate · mcp__claude-in-chrome__read_console_messages · mcp__claude-in-chrome__read_page · mcp__claude-in-chrome__resize_window · mcp__claude-in-chrome__tabs_context_mcp · mcp__claude-in-chrome__tabs_create_mcp · mcp__claude_ai_Firecrawl__firecrawl_scrape · mcp__claude_ai_Notion__notion-fetch · mcp__claude_ai_Notion__notion-search