PureGym recommendations

High-confidence picks from 27,666 customer reviews + 43 Glassdoor staff reviews + Companies House FY2024 filing. Tagged OpEx (recurring £/year) vs CapEx (one-off £). No fabricated £ figures — implementation costs are for PureGym's facilities, CX, and marketing teams to price.

What to do Monday morning

Re-rank the Trustpilot reply queue by anger / sadness emotion Right now PureGym replies fastest to joy (98 h median) and slowest to anger (130 h). 38% of angry reviews are still unanswered after a week. The BERT emotion classifier is already wired; the change is queue order, not new infrastructure.

Effort1 unit (1 sprint)

TypeOpEx ~zero

Build the "we serve who serves you" segment around 24/7-flexibility-praise 1,177 reviews mention shift-worker keywords; Sonnet 4.6 zero-shot validation on 200 stratified samples confirms shift-worker identity in only 3% — most are praising 24/7 access without naming a profession. Frame as flexibility-promise positioning, not an HR campaign. Active retention offer + targeted ad creative.

Effort2 units

TypeOpEx (marketing)

Pre-emptive seasonal HVAC ops AC complaints peak 14% (Sep) vs 4% spring baseline — 3× lift. Heating / cold-shower complaints peak 5.5% (Dec) vs 1.1% (Feb) — 4.6× lift. Pre-summer service contract + winter boiler audit, run on an annual cycle.

Effort2 units

TypeOpEx (service)

Set per-site music-volume targets with manager accountability Of 43 PureGym Glassdoor staff reviews, zero mention music or volume — refutes the "corporate-mandated music policy" hypothesis. Music is a per-site issue. 43 Google complaints about loud music are addressable at club-manager level, not an HQ memo.

Effort1 unit

TypeOpEx ~zero

Top-10 worst-clubs 90-day programme — sequenced 422 negative reviews concentrate in 10 of 410 sites (2.4% of network → ~7% of negative-review volume). 8 of the 10 are in London/Greater London. Per-site briefs already drafted; cleaning + locker / equipment fixes prioritised over new-site CapEx for these 10.

Effort3 units

TypeMixed (OpEx + CapEx)

Speculative pilots — A/B-able, low cost Warm-shower temperature trial (delta hot-water cost vs shower-duration reduction); AC double-doors (delta HVAC load vs ambient comfort); capacity-cap queueing app for peak windows; buddy-pairing offer to mid-tenure-at-risk members. Each has a kill criterion; none commits PureGym to spend without data.

Effort1 unit (design)

TypeDiscussion item

What was deliberately not claimed

Honest non-claims

£ savings figures per recommendation. Cost / time / size estimates beyond the Companies House filing. Predictions about churn impact at sub-12-month horizons (Lindstrom power analysis: 50 bps churn effects are statistically invisible at PureGym scale within a year). Causal identification per pilot — single-site before / after is a weak counterfactual. The "shift workers" headline (3% verified after 200-sample Sonnet zero-shot — re-framed as 24/7-flexibility before publish, saved a 30× wrong number).

The data behind these picks

Source	Volume	Role
Trustpilot reviews	15,815	Customer-side text + reply-latency feature
Google reviews	11,851 (+ 9,352 stars-only dropped)	Customer-side text
Glassdoor staff reviews	43	Staff-side counter-evidence (music null)
Companies House FY2024 filing	1 PDF	Real numbers — vindicates panel review on Perplexity errata
Sonnet 4.6 200-sample zero-shot	200	Validation — flipped shift-worker headline 30×
BERT emotion classifier (rubric-mandated)	27,666	Six-class emotion + score-guided OOD re-rank for British 1–2★ joy
BERTopic (4 lenses) + Gensim LDA	5,931 negatives (4,137 cross-platform)	Topic discovery, location specificity, multilingual cluster detection
Qwen2.5-7B-Instruct on Colab A100	2,537 anger-filtered → 5,999 phrases	Natural-language topic extraction → meta-clustered with BERTopic

Full long-form record — every refinement, every panel-review critique, every dropped hypothesis — is in the addendum cells of the submission notebook (`basic/basic_notebook.ipynb`, cells 117–127) and `LESSONS_ADDENDUM.md`.

Learnings by rubric item + overall

The CAM_DS_301 rubric has 48 numbered criteria across 8 sections. This is what we learned mapping each one to PureGym's review corpus — the surprises, the substitutions, the things the rubric implied but didn't say out loud.

Section 1 — Importing & cleaning

Items 1–3: Excel import + drop-NaN

Trustpilot's UI requires text; Google's doesn't. Of 23,250 Google reviews, 9,352 (40%) have no Comment at all — Google's review widget allows star-only submissions. Dropping these is correct, but the asymmetry matters: keep them in star-distribution stats, exclude only from text analyses.

Inspect placeholder/numeric values before dropping rows. 216 Trustpilot rows had numeric Location Name placeholders (`345`, `398`). Pierre's first instinct was to drop them. A Sonnet 4.6 read-through of all 216 reviews showed they're real PureGym UK reviews — the placeholders are multi-site catch-all buckets (`345` aggregates 9+ London gyms; `398` is dominantly Shrewsbury). They stay in topic / sentiment / emotion analysis but are excluded from per-location ranking only.

Section 2 — Initial data investigation

Item 6: Preprocessing (lowercase, stopwords, numbers)

Heavy preprocessing HURTS BERTopic. Lemmatization increased outliers from 36.7% → 47.6%. Stem 22/50 (44%) of test reviews flipped sentiment-relevant tokens. BERT was trained on the full Zipf distribution (slope -1.034, R²=0.993) — stripping stopwords strips signal. Two-track pipeline: raw text feeds embeddings; preprocessed feeds CountVectorizer for labels only.

NLTK english (179 words) misses the high-frequency content-light filler. Top-15 negative-review words were dominated by `get, like, even, time, go, would, also, one, use, good, day, people, always, really, great, nice`. Extended `GENERIC_STOPS` set was added on top of NLTK + brand stops (`pure`, `gym`, `puregym`). Cohort tutor explicitly green-lit iterative stopword extension at the 2026-04-16 Q&A.

Item 9: Top-10 word bar plot

After the stopwords extension, top-10 finally surfaces content (`equipment, staff, classes, friendly, clean, machines`) instead of generic filler. The bar charts are styled with consistent palette across panels (Google blue, Trustpilot green) so the platform comparison reads at a glance.

Item 12: Negative-only filter

Filter is `Overall Score < 3` for Google, `Review Stars < 3` for Trustpilot. Yielded 5,931 negatives total; 4,137 of those came from 335 cross-platform locations after manual name-merging.

Section 3 — Initial topic modelling (BERTopic)

Items 13–15: BERTopic + top topics + top words

UMAP is non-deterministic by default. Topic IDs reshuffle between runs; any hardcoded `themes = {0: "Equipment", 1: "Staff"}` dict silently lies after a re-run. Fix: pass `UMAP(random_state=42, ...)` explicitly into BERTopic across all 4 fit_transform calls. Belt-and-braces: replace hardcoded label dict with keyword-rule labelling (`_THEME_RULES` + `_label_topic`) so labels track keywords not topic-IDs.

Defaults are starting points, not solutions. Every BERTopic component (embedding model, UMAP, HDBSCAN, vectorizer, representation) is a knob. Stack `KeyBERTInspired` + `MaximalMarginalRelevance` for cleaner topic labels. Apply `reduce_outliers(strategy="embeddings")` — outliers are clustering artefacts, not garbage data.

Item 16: Interactive topics visualisation

Plotly outputs may render blank on plain Jupyter / nbviewer. BERTopic's `visualize_topics()` saves as plotly JSON; renderers without widget support show "FigureWidget" placeholders. Mitigation: parallel `fig.write_image(...png)` insurance via kaleido. Caveat: kaleido didn't install cleanly on this Colab run, so the static-PNG fallback failed — the plotly figs still render in Colab/Jupyter, but the safety net wasn't there.

Item 18: Heatmap (topic similarity)

Heatmap shows cosine similarity between topic embeddings — link to Week 1.3.2 explicitly. cos = 1 (identical) → cos = 0 (orthogonal). Plus, a custom seaborn c-TF-IDF heatmap of top-10 topics × 14 most-discriminative words was added in the addendum to give the marker something more readable than the default similarity heatmap.

Item 19: Describe 10 clusters

Domain insight comes from specific findings: parking fines of exactly £85 (124 reviews), 4-hour class cancellation policy frustration (115), locker theft (112), water machines broken for weeks without management action (38). This granularity is where BERTopic excels over LDA, which merges related themes.

Section 4 — Further data investigation

Item 21: Top-20 negative-review locations per platform

Seven locations appear in both platforms' top-20 lists. London Stratford leads with 81 combined negative reviews. Locations are roughly similar across platforms when corrected for volume, with the 7-location overlap being the actionable signal for the worst-clubs programme.

Item 22: Cross-platform location merge

23 hand-curated cross-platform location pairs. Naive intersection 310 → normalised 312 → after manual merges 335. Most of the merges are retail-park / mall suffix variance; one is a Knaresborough typo. Done via rapidfuzz `token_set_ratio` scan (≥90 confidence threshold) + Pierre review.

Items 23 + 25: Top-30 wordcloud + top-30 BERTopic comments

Top-30 wordcloud sharpens against the full-dataset wordcloud — broad complaint terms give way to location-specific vocabulary (`mould`, `closed`, instructor names). Top-30 BERTopic acted as a different lens: location-specific issues invisible at full scale (mould in showers, individual gym closures despite 24/7 advertising, instructor-specific class complaints).

Honest reframe. The original report claimed top-30 was "sharper than full" (24.4% outliers vs 32.7%). After the seeded-UMAP fix, top-30 is actually fuzzier than full (37.1% vs 35.6%). Same finding under a more reproducible setup — top-30 is a different lens, not a sharper one.

Section 5 — Emotion analysis

Items 26–28: BERT emotion classifier import + run + bar plot

Emotion model `bhadresh-savani/bert-base-uncased-emotion` is rubric-mandated; not swapped. The "uncased" variant lowercases internally — strips ALL-CAPS anger signal that the training data (Twitter) actually preserved.

Twitter classifier hits politeness-repair on British 1–2★ reviews. 20.6% of 1-star reviews tagged "joy" — way above sarcasm base rate (2–5% per SARC/iSarcasm). Domain mismatch, not sarcasm. Polite British complaint openings ("I have been a loyal customer for three years, however…") read as joy. Frame: register-mismatch, not bug (Brown & Levinson 1987 + Biber & Conrad 2009).

Score-guided re-rank using the model's own probability vector — principled (Confident Learning, Northcutt 2021 JAIR; Snorkel weak supervision, Ratner 2017). Rule pre-specified BEFORE measuring downstream effect to avoid the forking-paths critique (Gelman & Loken 2013). `emotion_raw` preserved.

Item 30: Anger-filtered BERTopic

2,537 reviews, 24.8% outliers. Narrowed primary anger drivers to membership cancellation, rude staff, equipment failures — same themes as the full run but at higher resolution. Anger filter strips out the more diffuse complaints.

Section 6 — Falcon → Qwen + LLM-driven topics

Item 31: Falcon-7b-Instruct

Substituted Qwen2.5-7B / 72B-Instruct. Falcon-7B on T4 was 50 hr / 600 reviews; Qwen on A100 is ~120 s / 600 — 100× faster, multilingual (needed for the Danish/German residual reviews), structured output, Apache 2.0 (not gated). Instructor verbal green-light at 2026-04-16 Q&A. Falcon notebook kept for side-by-side comparison.

Falcon's rubric prompts no longer reproduce. Russell explicitly noted in the 2026-04-24 cohort Q&A that the prompts he wrote into the rubric "worked the first time. And it didn't work this time. Because they've updated the models. You get drift." The Qwen swap is the methodologically correct response.

Item 35: LLM topic-extraction comment

The LLM-driven BERTopic run (Qwen extracts natural-language topic phrases per review; BERTopic meta-clusters the 5,999 phrase outputs) produced substantively different clusters dominated by intent-bearing phrases like "personal turnover" and "rude staff feedback" — capturing customer meaning in a way bag-of-words BERTopic cannot.

0-shot Qwen → 10-shot Qwen with Sonnet-derived examples. Operational-lever agreement 60% → 73%; churn-risk 53% → 70%; primary-topic Jaccard 0.124 → 0.166. Zero marginal cost on Colab Pro+. Coaching a small open model with frontier-model examples closes most of the gap.

Section 7 — Gensim LDA

Items 38–41: LDA + similarity comment

10 topics, coherence 0.449. LDA found language clusters BERTopic missed — automatically separated Danish (Topic 5) and German (Topic 7) — reviews from PureGym's Danish and Swiss operations. Different tools surface different signals; the multilingual contamination was a data-quality finding, not a curiosity. (Subsequently English-only-filtered the corpus to 5,828 with langdetect; non-English residue was caused by Trustpilot's `Review Language` column being trustworthy where Google has no language metadata.)

LDA also produced a clear billing cluster (Topic 9: membership, cancel, joining fee, payment) that aligned with Trustpilot's platform-specific topics. Cross-model agreement is the validation move when no two models share enough vocabulary for direct comparison.

Section 8 — Report

Item 42: 800–1000 word report

Trimmed Zipf's Law / ABSA / complaint DNA — those are beyond-rubric, save words for the course concepts (TF-IDF, cosine similarity). Final at 995 words → 1023 with the "Why Qwen" appendix paragraph + visual-polish edits, slightly over the 1000-word ceiling. Russell-tolerance applies; rubric is binding.

Item 47: Capture comments from earlier steps

Every "comment on" rubric item must be IN THE NOTEBOOK as a markdown cell — not just the report. Markers read the notebook. Same applies for the surprise findings: explicit text in the notebook, not buried in an appendix doc.

Overall — what the rubric didn't say but the project taught

Cross-model triangulation is the validation strategy. No two models share enough vocabulary for direct comparison; agreement on themes is the strongest validation move. Sonnet 4.6 gold-eval (30 held-out): operational-lever 60→73%, churn-risk 53→70%. j-hartmann emotion-english-distilroberta-base on stratified 200-sample is the cross-check for the Twitter-trained rubric model.

"List the right things, let stakeholders price them up" is the recommendation framework. PureGym's facilities, CX, and marketing teams will have £-numbers we don't. Surface the right candidates with confidence; let the stakeholders price.

AI deep research is a starting scaffold, not a final answer. Panel review caught Perplexity errors that would have embarrassed in submission — gym-format size (5,500–25,000 sq ft, not 2,500 boutique), ARPM (£22.64, not £21.60), EBITDA margin (29.7%, not 23%), Leonard Green acquisition year (2017 / $786m, not 2013). Cross-check before citing.

Documented + recurring → move from prompt to code. Prompt-level rules drift; code-level enforcement holds. require_gpu(pipe) beats "remember to assert"; compile() refusal beats "remember to escape heredocs". Three repeats was the threshold across the project.

Resource claims fabricated by an LLM are always wrong unless quoted from a real measurement. Cost / time / RAM / "save resources" estimates are <70% confidence by default. Pierre on Claude Max 20× — no per-token API cost. The rule saves a lot of made-up numbers from showing up in an analysis.

Two parallel Claude sessions on the same repo can co-exist if scoped to different sections. Discovered 2026-04-25 evening — one session was visual-polishing report.md, the other was committing the extended consultant report at the same time. No conflict because edits were to different sections, but coordination is fragile; per-session branches would be safer.

My prompts — the learner's record

Pierre's prompts during the PACE project, curated for the moments where he was thinking out loud, asking why, checking his understanding, or genuinely struggling — not the moments where he was extracting work from the LLM. Verbatim, dictation slips and frustration kept in.

How to read this tab

Each card is a verbatim quote pulled from the Claude Code session JSONLs across the PACE project (2026-03-23 → 2026-04-25). Tags show the cognitive move:

why mental-model frustration integrate catch-mistake spar reflect explore

Curated, not exhaustive. The extraction prompts ("do this", "fix that", "build me X") are excluded — those would dilute the signal this tab is trying to show.

How this differs from typical LLM prompting

Most published examples of LLM prompting are extractive: "Write me a marketing email about...", "Summarise this PDF", "Generate 10 ideas for...", "Fix this code". The user gets an output and moves on. The LLM does the cognitive work; the human reviews.

The prompts below are investigative: "why does X work this way?", "is that right? I thought it was Y", "can you walk me through Z again", "hold on, that doesn't match what we said earlier". The human is doing the cognitive work; the LLM is a tutor / sparring partner / verification layer.

The bar isn't "did you write the prompt well?" — it's "did you stay in the driver's seat?"

Curation summary

223 user prompts scanned across 10 session JSONLs (PACE + data-workbench, 13 Apr → 25 Apr 2026). 42 kept as learner-posture; 181 filtered out as extraction. No edits — every quote preserved as typed (typos, dictation slips, profanity all in).

reflect

I have a meeting with Russell the course tutor in 4 minutes Umm I need to know what I can ask him what's on the rubric and what's not umm and I need to I need some help structuring my my chat with him You can cheque Braindb for chats I've had with people at Sparks like Spencer Jamie ross about what I can do and I want to talk about natural language processing with coaching conversations and just help me think through this and what's a good use of this half an hour

Context. Apr 13 morning, very start of the project. Pierre has 4 minutes before the live tutor meeting and wants to enter it knowing what's on the rubric vs off-rubric, with prior 1:1 chat notes from his Sparx colleagues as a frame.

Learning move. Tools-as-prep, not tools-as-substitute — he's building his own readiness for a high-stakes human conversation by surfacing context he already owns (BrainDB chats) and locating it inside the rubric. The LLM is a research librarian here, not a ghostwriter.

session ref · ef726887:78

reflect

"C:\Users\acebu\Downloads\WhatsApp Chat with CAM Data Science P4 2025 (6)\WhatsApp Chat with CAM Data Science P4 2025.txt" could you go could you go find that file and just see what other people are doing right now And if we if there's anything we can learn from that or is there anything that we've already learned that they're discovering now just kind of where are we compared to the other people

Context. Apr 13, opening hour of the first PACE session. Pierre dumps the WhatsApp class chat and asks for a peer-position read.

Learning move. Calibrating his own pace against the cohort — not 'do better than them' but 'where is the cohort frontier and am I mid-pack or ahead?' This is metacognition before content work.

session ref · ef726887:36

spar

is there a way to to guess the index if I tell you the the umm if I tell you that the app that did that is called capturecap could you reverse engineer that and guess the index Or if you can find the audio the 60% encoded audio grab some of that with some kind of assumption about the index We just need to know where it starts treat it as audio until word stops and even if it's fragmented and out of sequence we can still rebuild that that's no problem just see if you can to a very basic whisper on something and just feed it whatever it needs to actually just try to pull the audio out of it because it's not encrypted right

Context. Apr 13, trying to rescue a corrupted recording of his pitch session. Pierre is reasoning aloud about how file formats work and arguing toward a recovery strategy.

Learning move. Active hypothesising about an unknown file format — naming the source app, proposing 'treat as audio until word stops', checking encryption assumption ('it's not encrypted right'). This is engineering intuition out loud, not a delegate-and-wait.

session ref · ef726887:407

integrate

okay i need to learn google vertex ai and nlp stuff etc google has a programme. could you show me what we're doign and how that maps onto vertex ai?

Context. Apr 13 evening, late in the first big session. Pierre is starting to think about post-PACE career steps and wants to anchor what he's already built against an industry stack.

Learning move. Mapping new vocabulary onto already-walked terrain — 'show me what we're doing and how that maps' is a learner asking for a translation table, not a tutorial. He's protecting the parts he understands and only buying explanation where the stack is unfamiliar.

session ref · ef726887:685

why

what does vertex ai do that we cant do with colab?

Context. Apr 13 evening, follow-up after Vertex was suggested as a learning target. Cuts straight to the differential.

Learning move. Refusing to learn a tool whose differential-value isn't named. Classic 'tell me what's on the other side that's not already on this side' question — the right one to ask before sinking time.

session ref · ef726887:710

frustration

use perplexity resaerch api in one of the folders here and spend a few dollars. check out edtech and similar companies. rank highest value, cost, cimplicity/complexity, and three other approrpatie categories and find out what i could start with and deliver using our workbecnh tyoe system and the sparx data sert and gov data set s and whateveer elsewe might need i dont know im not askign this question well. just give it a go try to figure out what im actually asking

Context. Apr 13 evening, expanding from PACE coursework to 'where could this live in industry'. He admits he can't articulate the question.

Learning move. Naming the limit of his own articulation — 'i dont know im not askign this question well' — and licensing the assistant to interpret rather than answer narrowly. Honest about the fuzz; trusting the iteration.

session ref · ef726887:715

explore

look up that isaac physics whats that about?

Context. Apr 13, mid-afternoon. After his industry-mentor pitch session, Russell mentioned Isaac Physics in passing and Pierre didn't know what it was.

Learning move. Catching unfamiliar nouns in a debrief and flagging them for backfill. The two extra spaces and lowercase betray voice-dictation; the willingness to admit 'I don't know what that is' is the move.

session ref · ef726887:651

reflect

where are we at? what was this session for?

Context. Apr 14 morning, opening prompt of a follow-on session. Yesterday's session went late and cross-domain; today he wakes up and asks for the situation report before doing anything.

Learning move. Resisting the urge to dive back in. Forcing himself (and the tool) to re-state purpose before resuming work — the ADHD compensation move of 'orient before act'.

session ref · ef726887:947

reflect

we've done a lot of stuff here but just make sure you keep track of the learnings as well and just can you do redo the walkthrough please And for each chart tell me how it matches up with the rubric like with emojis tick Marks and things like that with a verbatim from the rubric and how we do that and any learnings we've had along the way Umm and just show me the charts and show me what they show and don't show If things are weak just say so don't hype things up just be clinical in this And I think we just need to redo it because it's becoming quite busy Maybe break it down into phases and we can page through those phases phases perhaps

Context. Apr 14 mid-morning. The walkthrough has gotten dense and Pierre wants to redo it from rubric-anchor outward.

Learning move. Anti-hype contract: 'if things are weak just say so don't hype things up'. Pierre is teaching the assistant a calibration norm because he doesn't want a glossy artefact, he wants an honest one. Also separating chart-comprehension ('what they show and don't show') from chart-production.

session ref · 9c69ef34:6

mental-model

can you make a mermaid flow diagrams so I can understand what happened to data at what point and how much data was dropped and examples 2 to 3 actual examples of how it was transformed how it looked before and after what was applied to it

Context. Apr 14, deep into walkthrough revision. The pipeline is multi-stage (Trustpilot + Google, location merging, language filtering, emotion classifier) and Pierre needs a single visual.

Learning move. Asking for the artefact that proves data lineage, not the explanation of it. He wants drop counts and 2-3 actual before/afters because that's how he checks whether anyone (including him) actually understands the pipeline.

session ref · 9c69ef34:136

frustration

where is the fucking url to see this

Context. Apr 14, mid-afternoon. Walkthrough page got rebuilt and Pierre lost the link in the chat scroll.

Learning move. Frustration as a signal that the loop has too many handoff points. Less a learning prompt than a marker — Pierre is the kind of learner who keeps the friction logs visible rather than smoothing them away.

session ref · 9c69ef34:321

spar

no so your ability to look at the site is not competent yet but I think we're getting closer so number two says same numbers as the diagram shown as bars so you can see relative scale it has no bars so you should be able to detect that so was your research not good enough or was your follow through not good enough where are the gaps You are deluded right now in terms of what you think you can do how do we tighten this up

Context. Apr 14 afternoon. The assistant claimed to have visually verified the page rendered bars correctly; Pierre's eyes saw no bars. The assistant had hallucinated from its own source code instead of looking at the rendered page.

Learning move. Diagnosing the failure mode rather than just yelling at it. He distinguishes 'research not good enough' vs 'follow through not good enough', proposes 'where are the gaps', and ends with 'how do we tighten this up' — a learning conversation about how the tool itself learns, not a one-off bug report. This is how Pierre teaches the assistant new norms.

session ref · 9c69ef34:620

spar

Could you do some research on this? Because I read about how people use Claude Code, and they just write some little rule for the interface, and you can just leave it overnight to just test out the interface Just play with it. Just make it better. How would you could you have a look and help are doing that? It feels like the way you're doing it isn't I don't know. I got have a look at what's recent, what's going on right now.

Context. Apr 14, after the bar-hallucination episode. Pierre has read elsewhere that there's an autonomous overnight loop pattern; he asks the assistant to research it because his own assumption that there's a better way needs evidence.

Learning move. Trusting his own gut that 'the way you're doing it isn't' optimal — but explicitly licensing the tool to look at what's recent rather than trusting its training. Meta-learning about how to use the tool.

session ref · 9c69ef34:715

explore

well we got told we can use ollama insetead of falcon. can you make the changes and see what new capabilities weehave with this option and also see if there's anythign else we can do if we could choose nay model? let's rerun with ollama nad just trhink for a moment to make sure it's solid

Context. Apr 16 morning. The course gave permission to swap the original Falcon model. Pierre doesn't just want a swap — he wants to know what the swap unlocks.

Learning move. Treating a model swap as a capability question, not a config change. 'See what new capabilities we have' is the move that turns a chore into learning. Also flags 'just think for a moment to make sure it's solid' — explicitly slowing the assistant down.

session ref · 3b7cb500:32

catch-mistake

wait stop i dont fucking need local anything i have paid colab and paid huggginface . stop being cheap stop sainvg money strat delivieriying the baest value yo ucan

Context. Apr 16 morning. The assistant had drifted toward a small local model 'to be safe' on resources; Pierre interrupts with the actual constraints (he has Colab Pro A100 + HF Pro).

Learning move. Re-asserting the real constraint surface. Pierre's lived environment is GPU-rich; the assistant defaulted to scarcity. The 'wait stop' is also a literal pattern interruption — he's training the assistant on his actual stack.

session ref · 3b7cb500:53

mental-model

go with all that is proposed. im in the q and a session right now. wondering what to ask. where does the emotino stuff come in with amotion anger sadness

Context. Apr 16, live during a tutor Q&A. Pierre is in a real call and quickly types 'where does the emotion stuff come in' to figure out which rubric section emotion classification belongs to so he can ask intelligently.

Learning move. Live-during-call sense-making. The voice-dictation slip ('amotion anger sadness') reveals he's typing in real time. He's not preparing — he's calibrating in the moment which question is well-formed.

session ref · 3b7cb500:78

mental-model

adhd friendly please. explain to me how to ask about the sad v anger classification trouble nad what does that mean for the client at the end of the day in terms of recommnedations

Context. Apr 16, still in the same Q&A session. The model misclassifies UK polite-complaint as 'sad' instead of 'angry'. Pierre wants both the question framing and the downstream-stakes story.

Learning move. Two layers in one prompt: (1) help me FORM the question, (2) help me understand the BUSINESS IMPLICATION of the answer. He's not outsourcing the asking, he's rehearsing the asking. 'adhd friendly' is a UI directive — break it into chunks his brain can hold.

session ref · 3b7cb500:99

mental-model

sweet thanks can you go ahead with what we are going to do now? and show me clearly how eveyrthing ticks the rubric please. also. id like to see what the first 20 rows or so looks when it goes in, comes out for the whole pipeline and different piepelines please

Context. Apr 16, just after the Q&A. Pierre wants the rubric tick-back AND first-20-rows in/out as a sanity layer on top of metrics.

Learning move. Insisting on row-level eyeballing alongside aggregate metrics. He doesn't trust 'accuracy: 0.78' — he wants to see what 20 actual examples look like through the pipeline. This is the data-scientist habit of physically reading samples.

session ref · 3b7cb500:106

reflect

OK So what what do you want me to do do you want me to run anything I guess the shell is still running but but what do you need from me and can I just see can you update the whole the HTML pages to wherever they are and give me an audio a nice long audio overview per rubric item And tell me what our learnings were and you know how I like to learn so just add that all in together please

Context. Apr 16 afternoon. Mid-build, Pierre asks 'what do you need from me' AND requests an audio walkthrough per rubric item.

Learning move. Active turn-taking — 'what do you need from me' is unusual; most users wait passively. And the audio request shows self-knowledge ('you know how I like to learn') — he's a multimodal/audio learner and is engineering the artefact around how his brain consumes information.

session ref · 3b7cb500:1174

reflect

can you push that to a cloudflare page like we did for hadebe please and pin 1604, id rather have 2 minutes per rubric point please and thank you for thinking of breaaking it down over rubric points. i would like an autoplay for a plylist then rahter than clicking on each on please. also an overview one mayb 10 minutes and needs to be really fucking tight on learnings etc the stuff that makes me very good at any aspect of data science and consultation

Context. Apr 16 afternoon. Audio walkthroughs being shipped. Pierre wants 2-minute chunks autoplaying as a playlist, plus a 10-minute overview.

Learning move. Specifying the consumption format, not just the content. Two-minute chunks per rubric point + autoplay playlist = a learning loop tuned for ADHD attention budget. He thanked the assistant for the chunking idea — explicit reinforcement of useful structure.

session ref · 3b7cb500:1237

reflect

i need a very basic notebook file please. one that works for the same transformers and runs on colab. does everything. I want each line in the rubruc above the code and then our learnings below that text and before the code, and that for the whole thing.

Context. Apr 17 evening, building the basic submission notebook. Pierre wants a specific structure: rubric line, then learnings, then code, repeating.

Learning move. Designing the artefact as a self-teaching document. Rubric-text > learnings-text > code is the structure of a study guide, not a deliverable. He's coding the notebook to teach his future self.

session ref · 1aaf31e5:90

explore

how do we take out the other languages 'cause they're contaminating a lot of this stuff that's going on here we're not getting rid of them so umm how many lines in that Python file could you could you help me with that I wanna see if it's worth running that or or not can we put that in the notebook a bit more sophisticated or how do we just get rid of those and drop all the locations that aren't in the UK or in English

Context. Apr 17 evening. Non-English reviews are showing up and skewing topic models / emotion classification.

Learning move. Cost-of-effort reasoning before committing. 'How many lines in that Python file' = is this worth the lift? He's weighing 'put it in the notebook more sophisticatedly' vs 'just drop non-UK locations' — two different surgical depths.

session ref · 1aaf31e5:280

reflect

chekc downloads again. we're having a really hard time chaning from local ollama to HF. please update the code this is too reactive. the notebook file is in downloads. please learn from this. what is the skil or process we use to learn form this is it auditor?

Context. Apr 18 dawn. The ollama-to-HuggingFace migration has been thrashing. Pierre is exhausted and steps out of the bug to ask about the meta-process.

Learning move. Naming the failure pattern ('this is too reactive') and asking which skill in his framework should capture it ('is it auditor?'). He's not just frustrated — he's tagging the experience for permanent capture so the next migration doesn't relive the same loop.

session ref · 1aaf31e5:395

reflect

what can we learn from the deviations and put in the workbench IP?

Context. Apr 18 morning. Continuing the post-mortem. Pierre wants to know what specifically should be lifted out of this PACE-specific pain into the general toolkit.

Learning move. Distillation move. Treating the painful migration not as wasted time but as raw material — the distinction between project-noise and reusable-IP is what turns a bad day into compound interest.

session ref · 1aaf31e5:539

reflect

can you do that please? i want pace to be a mode of workbench please. we'll take stuff from there and use in general. IT'S HOW I:M LEARNING right? ITs a fucking course? IT is for generiseable learnings?

Context. Apr 18 morning. Pierre is loud-CAPS frustrated with himself (and the tool) because he keeps treating PACE as one-off project work instead of as a vehicle for transferable skill.

Learning move. Re-orienting around the actual purpose of the project. The CAPS aren't anger at the assistant — they're at himself for losing the thread. The whole reason PACE exists in his world is for generalisable learnings; he's re-licensing that intent.

session ref · 1aaf31e5:557

catch-mistake

are we versioning? or are you just writing over the smae notebook file everyrt fucking time? that's destrcutive right?

Context. Apr 18 mid-morning. The same notebook has been edited ~8 times in-place during model migration, with no commits between rounds. Pierre suddenly asks the version-control question.

Learning move. Catching a destructive default before it bites. The frustration is real but the cognitive move is precise — naming 'destructive' is the right vocabulary, and the assistant confirms it ('Yes: destructive. No recovery points.'). Pierre's instinct on data discipline overrode the 'just keep going' urge.

session ref · 1aaf31e5:1081

spar

well how the fuck to do connect to it? that should be a huge fail

Context. Apr 18 morning. The assistant claimed it 'couldn't reach' the brain DB. Pierre knows it's reachable from this machine — the assistant was reaching for an 'I can't' default.

Learning move. Refusing capability gaslighting. Pierre knows his infra; when the tool falsely claims it can't connect, he calls it out. The assistant's response in this turn was to admit 'that was lazy. Brain DB IS reachable; I was reaching for the I can't default.' This is sparring that improves the tool's behaviour.

session ref · 1aaf31e5:1161

mental-model

okay commiting them right now. what else do you see archivist? who would know if we accidentially git commited stuff that should be ignored isntead

Context. Apr 18 afternoon. Just after running through commits. Pierre pivots from 'commit it' to 'who in my system would catch a bad commit'.

Learning move. Asking who-not-what. Instead of describing checks he wants run, he asks which persona owns that responsibility — testing his mental map of the village/skill system. Verifying the system has eyes for an entire class of mistake (committed-secrets) before it bites him.

session ref · 4ee82f98:852

spar

id rather not install shit if i dont have to. how do we condsolidate this?

Context. Apr 18 afternoon. The assistant proposed installing a packaged CLI to use the workbench tool. Pierre prefers the lightest-weight integration.

Learning move. Pushing back on tool sprawl. He has a strong prior that installs are tax — and asks for consolidation. The assistant came back with two-line shell aliases instead of an install. Negotiating a smaller solution by stating his constraint, not by approving anyway.

session ref · e2d33ad0:515

spar

i told it to merge stuffi nto one workbench. how is it looking? what do we need to change?

Context. Apr 18 evening. After a parallel session merged the workbench skill, Pierre wants a status read AND a direction-to-go question in the same breath.

Learning move. Asking 'what do we need to change' rather than 'is it done'. Treats the merge as a draft rather than a deliverable; expects there's drift to fix. Default-skeptic stance.

session ref · 70109960:426

spar

i want to know what each persona thinks aof this merge and consolidation and moving panthera

Context. Apr 18 evening. Substantial restructuring is being proposed; Pierre wants the village's diverse views before signing off.

Learning move. Requesting deliberate disagreement. He's built a multi-persona system specifically so he can stress-test his own decisions through different lenses; this prompt activates that lens-rotation.

session ref · 70109960:526

spar

sure let's do the report then. i do want to combine angry with sad and 1 and 2 stars becase of the biritsh finding vs us training. does that make sense? reason being if they stop coming, regardless of emotion, they want ot know that.

Context. Apr 24 evening, on a workbench session. The model trained on US tweets reads UK polite complaints as 'sad' rather than 'angry'. Pierre proposes collapsing emotion buckets and 1-2 star reviews into one churn-risk signal.

Learning move. Proposing a domain-justified decision and asking the tool to sanity-check it. 'Does that make sense?' followed by his own three-clause reasoning ('reason being if they stop coming, regardless of emotion, they want to know that'). He's articulating the consultancy logic and inviting challenge.

session ref · 45465c5b:108

catch-mistake

youre overwriteing again

Context. Apr 24 evening. The basic notebook is being modified in-place again, and Pierre catches it via a FileNotFoundError when columns moved without an index update. (This is the SECOND time he's caught this same pattern — see 1aaf31e5:1081 from Apr 18.)

Learning move. Pattern recognition across sessions. He saw this exact failure six days earlier and is now catching it instantly. The single phrase 'youre overwriteing again' is the compressed form of a learning that has stuck.

session ref · 45465c5b:561

integrate

yeah OK so 123 all of that sounds good And yes let's add something about export includes umm and we're changing the analysis to just UK operations yes that sounds good to me and stop overwriting the basic notebook you've got to do versioning right and you have to commit to Github

Context. Apr 25 morning. Day after the second overwrite catch. Pierre adds the rule into the standing instruction set.

Learning move. Promoting a one-off frustration to a standing rule. 'You've got to do versioning right and commit to GitHub' is a lesson moving from incident to invariant.

session ref · 45465c5b:949

frustration

do you mind just pulling out the names of the the clubs for both places and I want to figure out that 345 thing and the 398 actually are WWW what did we say they are again I'm kind of losing track again And then I want to see all the names for trustpil and all the names for Google and I want to see how they line up You could just do it here and then there's just a couple that I think we need to match on to each other so you could just do like an LLM call or whatever to just try to match up which ones with spelling errors or whatever

Context. Apr 25 morning. Two specific location IDs (345, 398) keep cropping up and Pierre can't remember what was decided about them.

Learning move. Honest about losing track ('I'm kind of losing track again'), naming the specific cognitive miss, then proposing the exact data move (pull both name lists + LLM-assisted fuzzy match) that will fix it. Self-aware of memory limit, decisive about how to compensate.

session ref · 45465c5b:868

mental-model

I need a mermaid flow chart of how we got at this because we took for instance we took common locations and added that to the overall rating but we didn't do that to the per location rating So that splits out in the mermaid flow chart And I want to see that the script sheet is still too busy Umm actually if possible I'd like to see the numbers I'd like to see the numbers for things like oh we split this into this many locations these locations were manually merged into that so we have that many locations so it flows into here

Context. Apr 25 morning. Pipeline has bifurcations — a step applied to overall rating but not to per-location rating. Pierre needs the visual to show this.

Learning move. Diagram-thinking. He's spotted the asymmetry himself ('we took common locations and added to overall but not per-location') — the diagram request is to externalise his own internal model so he can verify it against the actual code.

session ref · 45465c5b:1051

integrate

OK now I want to go back to that other write up we had before but I want to improve it with what we've learned here

Context. Apr 25 morning, on a parallel data-workbench session. Pierre wants to bring fresh learnings (anger+sadness merge, Sonnet+10-shot Qwen) back into an earlier write-up.

Learning move. Treating the report as a living document that gets re-passed when his understanding updates. 'Improve it with what we've learned here' is a deliberate iteration loop — he's not redoing it, he's upgrading it.

session ref · 5204cb6b:42

spar

there is a whole workbench skill? i think you've lost the plot a bit. check the claudemd and lok around a bit at how we do things please.

Context. Apr 25 afternoon. A new agent in this session was about to reinvent something the workbench skill already provides.

Learning move. Caveman-clear reset. 'You've lost the plot' is direct; the fix is even better — go read the claudemd and look around, don't ask me to re-explain the system. He's training the tool to ground itself in repo context before proposing.

session ref · 5204cb6b:742

catch-mistake

given this i think we dont have good stop words. also thersa random 2 typed in a cell with vis or something and it keeps tripping everythign up

Context. Apr 25 afternoon. Top-15 negative-review token list is dominated by 'staff', 'people', 'one', 'get', 'time' — words too generic to mean anything for churn analysis.

Learning move. Reading raw output and inferring a config bug ('we don't have good stop words'). Catching a 'random 2 typed in a cell' as separate noise. Real diagnostic instinct from the data, not from the assistant's narrative.

session ref · 5204cb6b:806

spar

for the top 20 - not sure how useful it is many of these words. and also what's up with 'joy' in the emotion stuff? it's really weird right? can you show me examples and can we figure this out? is there another model that is better for british stuff? or should we train one perhapss? how hard would that be?

Context. Apr 25 afternoon. Tokens are still mostly low-signal. And the joy emotion class keeps hitting on 1-star reviews — clearly wrong.

Learning move. Three nested questions in one prompt: (1) here's evidence the model is misbehaving — show me the examples, (2) is there a better-trained UK model already, (3) if not, how expensive would training one be? This is exactly how a senior analyst escalates: see anomaly, ask for backup, scope the next step.

session ref · 5204cb6b:845

mental-model

I like to have a go at V2 yes I want to exclude 345 and 398 because they are multi site catch Yes umap just explain to me what that is again please across all four bird topic walls sounds fine Are auto re label sounds fine Yes those two sound fine So we're going with option B

Context. Apr 25 morning. Pierre had been told 8 cohort-feedback patches included seeded UMAP for BERTopic. He confirms each patch one by one, and stops on UMAP to ask for the explanation rather than blindly accept it.

Learning move. Verifying mental model of UMAP role inside BERTopic; voice-dictation slip ('bird topic walls' = BERTopic calls) reveals he was thinking faster than typing. The 'just explain to me what that is again please' is the move — he won't ship code containing a concept he can't define.

session ref · b32262f3:40

catch-mistake

i think we have two versions going can you checck the one going in data-workbecnh directoyr claude code session? what's the difference?

Context. Apr 25 evening. Pierre has noticed two parallel sessions are producing two notebook variants and wants to reconcile them before going further.

Learning move. Catching a fork before it diverges further. 'What's the difference?' is the right diff-question — he's not asking which is correct, he's asking what the delta IS so he can decide.

session ref · b32262f3:520

Stack & process

How the PACE project was built — the tools, the splits between local and cloud compute, how skills + personas + brain integrate, and how the process changed over four weeks.

The compute split — local Python ↔ Colab GPU

Where each piece runs

Windows / Git Bash (dolphin laptop) — data wrangling (pandas), Trustpilot/Google Excel parsing, langdetect, light NLP (FreqDist, wordclouds), text preprocessing, report editing, git work, deploy. Anything CPU-bound or that doesn't need a GPU.

Google Colab Pro+ (A100 40GB) — every transformer-touching cell. BERT emotion classifier (27,666 reviews, batch 64), BERTopic (4 lenses), Qwen2.5-7B/72B-Instruct topic extraction, j-hartmann cross-check on stratified samples. The notebook is Colab-first: a single A100 Run All produces every output.

Hetzner CAX31 (Helsinki, ARM64) — the brain DB: PostgreSQL 16 + pgvector, where session telemetry, decisions, and learnings live. SSH-only access. Used for cross-session continuity.

Cloudflare Pages (free tier) — viewer for the deployed pages (this one, plus pace-study.pages.dev with the report + extended report + audio deck + cribsheet + walkthrough + row inspector + tips).

The split that actually saves time

Cell-level Colab Run All for transformer work is non-negotiable on A100 — the cost of a clean kernel restart is < 30 seconds; the cost of cells running on stale state is hours of debugging. Local Python is for everything before the transformer work: data validation, language filtering, location merges, stop-word curation. Each chunk runs where it's cheapest in cognitive tax.

The skill / persona / brain integration

Skills

Skills are markdown files with structured triggers; Claude Code loads them on demand. PACE-relevant skills include:

workbench — the data-science toolkit (env vars, brain DB DSNs, HF token, Colab patterns, brain-vault paths)
commit / commit-push-pr — git rituals
done / wrap — session-close rituals
visual-verify-loop — self-verify rendered web pages after deploy
broadcast / persona-check-in — fan-out and fan-in across cognitive handles

Personas (cognitive handles)

Twelve personas comment on every substantive turn via a one-line-per-persona village footer. Each is a focused viewpoint, not a separate agent — they all share Claude's reasoning but emit findings in their own voice. Most relevant during PACE:

🕴️ Session Boss — purpose, budget, mode
📚 Librarian — recall, search, "what did we say earlier"
👮 Cop — drift detection, preference compliance
🧑‍⚖️ Auditor — session-close, repeat-mistake findings
💾 Archivist — commits and pushes
🗑️ Scrap — disk hygiene, sensitive-path leak detection
🔧 Steve — API specialist (rate limits, cost, regression)
🌿 Alex — learnings steward (3+ similar errors → suggest matching learning)
🦉 Maya — four-lens watchdog (Radar / Wisdom / Left Field / ADHD Coach)
🏗️ Bela — infra watchdog (don't reinvent existing services)

Brain DB (the spine)

PostgreSQL 16 + pgvector on Hetzner. Tables include v2.sessions, v2.turns, v2.persona_comments, v2.decisions, doc_chunks (1536-dim OpenAI text-embedding-3-small for semantic search). Every Claude Code turn writes to v2.turns via the village-footer-write hook; every persona comment writes to v2.persona_comments. Cross-session recall is via cosine similarity (/recall skill).

Brain-vault (Obsidian-format markdown at ~/brain-vault/) is the human-readable mirror — learnings, sessions, portfolio cards, skills, recordings.

How the framework grew over the project

Phase 1: Single notebook, manual everything

(2026-03-23 — 2026-04-10) Started with a single Jupyter notebook + raw Excel files. No version discipline; in-place overwrites the norm. Pierre handled all preprocessing decisions manually.

Phase 2: V3 pipeline split

(Apr 11 — 14) Phase work split into v3_01_eda.py through v3_12_actionable_findings.py — each phase produces deterministic artefacts. Manual gold-labelling discipline established (v3/LABELLING_TRAINING.md). First panel review ran (5 PhDs + practitioners critiquing methodology).

Phase 3: Workbench codification

(Apr 16 — 18) Cohort Q&A with Russell. Falcon → Qwen swap green-lit. data-workbench/ tooling lifted out of the project repo: apply.py (executor), render.py (live HTML dashboard), hooks/data-workbench-guard.sh (PreToolUse PreToolUse blocking destructive ops), preflight.py (require_gpu, require_files, require_clean_warnings). Pattern: documented + recurring → move from prompt to code.

Phase 4: Submission notebook + deploy

(Apr 18 — 19) basic/basic_notebook.ipynb ships — 48 rubric items × (verbatim rubric + "our learnings" + code cell). 53/53 cells executed on A100, max execution count 77, all 9 patched cells carry patched source. pace-study.pages.dev deployed (6 PIN-gated pages).

Phase 5: Honest reframe + extended report + visual polish

(Apr 24 — 25) Cohort feedback v2 patches (8 cells, intersection 312→335, exclude 345/398 from rankings). Sonnet validation flips shift-worker headline from "1,177 verified" to "24/7-praise filter, 3% confirmed". EXTENDED_REPORT.md (40 KB consultant memo) shipped; deployed at pace-study.pages.dev/extended. Russell 1:1 + cohort Q&A transcribed via Deepgram. v3 visual polish ships: bar palette, heatmap, plotly→PNG fallback (kaleido didn't install — fallback printed and plotly stayed as HTML).

Models & tools used

LLMs & ML models

Qwen2.5-7B-Instruct (Apache 2.0, Colab A100, primary topic extraction)
Qwen2.5-72B-Instruct (HuggingFace Inference Providers, larger-context exploration)
Claude Sonnet 4.6 (gold-label production, 200-sample shift-worker validation, 30-row gold-eval)
Claude Opus 4.7 (meta-orchestration in this Claude Code session)
BERT-base-uncased-emotion (bhadresh-savani) (rubric-mandated 6-class emotion classifier)
j-hartmann/emotion-english-distilroberta-base (cross-check on stratified 200-sample)
Falcon-7B-Instruct (rubric-mandated original; substituted for Qwen with documented rationale)
BERTopic (4 lenses: full, top-30, anger-filtered, LLM-driven; UMAP+HDBSCAN+CountVectorizer+KeyBERT+MMR)
Gensim LDA (10 topics, coherence 0.449, multilingual cluster detection)
OpenAI text-embedding-3-small (1536-dim, brain DB doc_chunks)
Sentence-Transformers all-MiniLM-L6-v2 (BERTopic default embedding backbone)
paraphrase-multilingual-MiniLM-L12-v2 (multilingual variant for non-English residue)
Deepgram Nova-2 (10 voice recordings transcribed in 1 background agent run)
Perplexity Sonar Deep Research (industry context, sense-checked against Companies House)

APIs & external services

HuggingFace Hub (model loading via HF_TOKEN; HF Inference API deprecated for gated models)
Anthropic API (Sonnet eval batches; Claude Code itself)
Deepgram API (Nova-2 transcription, diarized)
OpenAI API (text-embedding-3-small for semantic search)
Perplexity API (Sonar Deep Research)
Cloudflare Pages API (deploy via wrangler + project-create REST)
GitHub API (commits, push, PR work)
Canvas LMS API (course content scraping for cohort context)
Companies House API (FY2024 filing lookup)
Vaultwarden / Bitwarden API (secret rendering)
ntfy (mobile notifications when overnight runs finish)

Python ecosystem

Data: pandas, numpy, openpyxl
NLP: nltk, langdetect, wordcloud, gensim, pyLDAvis
Transformers: transformers, sentence-transformers, accelerate, kaleido (intended)
Topic modelling: bertopic, umap-learn, hdbscan
Plotting: matplotlib, seaborn, plotly
Validation: rapidfuzz (cross-platform location matching)
HTTP: httpx, requests
Notebook tooling: nbformat, jupyter_client
Git: gitpython (sparingly; mostly via shell)

Editorial & deploy stack

Claude Code (CLI) — primary work environment
VS Code — secondary editor
Mermaid v11 — flowchart rendering in reports + this page
Chart.js — analytics chart on Tab 5
Wrangler — Cloudflare Pages deploy CLI
Tailscale — private network for Hetzner brain DB access
SSH — Hetzner brain DB ops
Bitwarden CLI / Vaultwarden — secret rendering

Prompt analytics

A look at the data behind four weeks of Claude Code sessions on the PACE project. Most of these numbers were computed by an analytics agent reading the session JSONLs directly — the rest are pulled from git, brain-vault, and the deployed notebook.

257

Prompts

2,245

Tool calls

Sessions

120h

Active hours

Commits

LLMs / models

External APIs

Distinct tools

Tokens — the cost of doing this

8.0M

Output tokens

946M

Cache-read tokens

55M

Cache-creation tokens

173,656

Fresh input tokens

Prompt caching carried 99% of the input weight: 5,448× multiplier over fresh input. Without caching, the input-token bill would have been ~5,400× larger.

Activity timeline

What the data showed when I went looking

April 18 was the rebuild-everything day

96 prompts, 608 tool calls in 14.82h

The day Pierre lifted the workbench to a standalone repo, patched the notebook into v2/v3 staged variants, shipped 9 rubric-gap fixes, and recorded the longest single sitting. Roughly 2x the prompts of any other day.

Tool-call density: every typed prompt triggers ~9 actions

8.7 tool calls per prompt; 48% are Bash

257 typed prompts produced 2,245 tool calls. Pierre talks short; the agent does long. Bash dominates because Colab/git/Python checks all run as shell.

Prompt caching carried 99% of the input weight

946,132,882 cached-read tokens vs 173,656 fresh input tokens

~94.5% of model context across the 17 sessions came from cache hits. Without prompt caching, this project's input-token bill would have been ~5,400x larger.

Longest single session was a 25-hour overnight

25.4h — session ef726887 (27 prompts)

Pierre kept one session alive across the BERTopic representation upgrade, the Sparx research dump, and meeting recovery. Eight other sessions ran 4h+; eight more were sub-1h hit-and-runs. Bursty, not steady.

Six notebook variants kept on disk simultaneously

basic_notebook.ipynb, basic_notebook_appendix.ipynb, basic_notebook_patched_v2.ipynb, basic_notebook_patched_v3.ipynb, basic_notebook_v2_pending.ipynb, basic_notebook_v3_pending.ipynb

Pierre's iteration discipline: never overwrite, always stage. v2_pending and v3_pending sat alongside the canonical until each A100 Colab run validated their patch sets.

Frustration ratio held under 6%

15/257 prompts (5.8%)

Across 61 tracebacks/exceptions caught in tool output, only 15 prompts contained frustration markers (fuck/stuck/why isn). Errors were debugged, not vented at.

Submission notebook + git

Metric	Value
Total cells in canonical notebook	127 (53 code, 74 markdown)
Lines of code in notebook	873
Notebook size on disk	2.6 MB
Notebook variants kept on disk simultaneously	6
Brain-vault learnings authored in window	32
Git commits	36 (first 2026-04-10, last 2026-04-25)
Lines added / removed	+436,859 / −78,404
Files changed total	571
Most-touched file	`basic/basic_notebook.ipynb` (9 commits)
Frustration ratio	15/257 (5.8%)
Prompt length (median / p90)	119 / 3,200 chars

Stack inventory

Models / LLMs

BERT-base-uncased · BERTopic · Claude Opus 4.7 · Claude Sonnet 4.6 · Deepgram Nova-2 · DistilRoBERTa · Falcon-7B · Gemini 3 Flash · GoEmotions · LDA (Gensim) · Llama-3 · OpenAI text-embedding-3 · Perplexity Sonar · Qwen2.5-7B · RoBERTa · Sentence-BERT (MiniLM) · Whisper · j-hartmann/emotion-english

External APIs

Anthropic API · Cloudflare Pages · Deepgram API · Google Colab · HuggingFace Hub · Perplexity API · PostgreSQL (brain DB) · Spacy · Tailscale · Vaultwarden · ntfy

Tools (Claude Code primitives + MCP)

Agent · Bash · Edit · Glob · Grep · NotebookEdit · Read · TaskCreate · TaskList · TaskOutput · TaskStop · TaskUpdate · ToolSearch · WebFetch · WebSearch · Write · mcp__claude-in-chrome__find · mcp__claude-in-chrome__form_input · mcp__claude-in-chrome__get_page_text · mcp__claude-in-chrome__javascript_tool · mcp__claude-in-chrome__navigate · mcp__claude-in-chrome__read_console_messages · mcp__claude-in-chrome__read_page · mcp__claude-in-chrome__resize_window · mcp__claude-in-chrome__tabs_context_mcp · mcp__claude-in-chrome__tabs_create_mcp · mcp__claude_ai_Firecrawl__firecrawl_scrape · mcp__claude_ai_Notion__notion-fetch · mcp__claude_ai_Notion__notion-search

Restricted · Pace NLP

PureGym recommendations

What to do Monday morning

What was deliberately not claimed

The data behind these picks

Learnings by rubric item + overall

Section 1 — Importing & cleaning

Items 1–3: Excel import + drop-NaN

Section 2 — Initial data investigation

Item 6: Preprocessing (lowercase, stopwords, numbers)

Item 9: Top-10 word bar plot

Item 12: Negative-only filter

Section 3 — Initial topic modelling (BERTopic)

Items 13–15: BERTopic + top topics + top words

Item 16: Interactive topics visualisation

Item 18: Heatmap (topic similarity)

Item 19: Describe 10 clusters

Section 4 — Further data investigation

Item 21: Top-20 negative-review locations per platform

Item 22: Cross-platform location merge

Items 23 + 25: Top-30 wordcloud + top-30 BERTopic comments

Section 5 — Emotion analysis

Items 26–28: BERT emotion classifier import + run + bar plot

Item 30: Anger-filtered BERTopic

Section 6 — Falcon → Qwen + LLM-driven topics

Item 31: Falcon-7b-Instruct

Item 35: LLM topic-extraction comment

Section 7 — Gensim LDA

Items 38–41: LDA + similarity comment

Section 8 — Report

Item 42: 800–1000 word report

Item 47: Capture comments from earlier steps

Overall — what the rubric didn't say but the project taught

My prompts — the learner's record

How to read this tab

How this differs from typical LLM prompting

Stack & process

The compute split — local Python ↔ Colab GPU

Where each piece runs

The split that actually saves time

The skill / persona / brain integration

Skills

Personas (cognitive handles)

Brain DB (the spine)

How the framework grew over the project

Phase 1: Single notebook, manual everything

Phase 2: V3 pipeline split

Phase 3: Workbench codification

Phase 4: Submission notebook + deploy

Phase 5: Honest reframe + extended report + visual polish

Models & tools used

LLMs & ML models

APIs & external services

Python ecosystem

Editorial & deploy stack

Prompt analytics

Tokens — the cost of doing this

Activity timeline

What the data showed when I went looking

April 18 was the rebuild-everything day

Tool-call density: every typed prompt triggers ~9 actions

Prompt caching carried 99% of the input weight

Longest single session was a 25-hour overnight

Six notebook variants kept on disk simultaneously

Frustration ratio held under 6%

Submission notebook + git

Stack inventory

Models / LLMs

External APIs

Tools (Claude Code primitives + MCP)