Promlo

A concrete, build-it-yourself guide to tracking how often ChatGPT and other LLMs mention your brand — prompt design, batch running, mention extraction, citations, and frequency.

If you've heard of AEO and want to know whether your brand is being mentioned in ChatGPT answers, the naive approach is: type your brand into ChatGPT, see what it says. That tells you almost nothing. This guide walks through the actual workflow — the one used by every AEO tool on the market and that you can absolutely build yourself if you want to.

Prerequisites: you understand what AEO is and why mentions matter. If not, read What is Answer Engine Optimization first.

Why typing your brand into ChatGPT once doesn't work

Three reasons it's not a measurement:

Sample size of one. A model's response is non-deterministic. Run "best CRM for startups" three times, get three different orderings of brands. A single query is a snapshot, not a metric.
Wrong question shape. Searching your own brand name ("Is Mixpanel any good?") only tells you about branded sentiment. The high-leverage queries are unbranded ones where buyers don't yet know you exist.
No engine coverage. ChatGPT is one of at least four mainstream answer engines. Each has different training cuts, different search providers, different citation rules. You need them all.

You're not measuring "what does ChatGPT think of my brand". You're measuring "what share of relevant buyer questions surface my brand across the AI surfaces my buyers actually use, and how is that trending".

Step 1 — Build your prompt list

This is the most underrated step. Tools and dashboards are downstream of prompt quality.

A useful prompt list has 50–200 prompts and three categories:

Category	Share	Example
Unbranded category	~50%	"best email marketing tool for Shopify stores"
Comparison / vs	~25%	"Klaviyo vs Mailchimp for ecommerce"
Branded	~25%	"is Klaviyo worth the price", "Klaviyo alternatives"

The 75/25 unbranded-to-branded split is the rough industry default. Tools like Profound use a similar ratio. The reason: branded queries surface sentiment problems (good to know), but unbranded queries are where the category lives. If you only track branded prompts, you'll never discover that your buyer asks "best [adjacent category] tool" and gets a list that doesn't include you.

A cheap way to seed the list:

Open Google Search Console for your site. Pull queries with impressions but no clicks. These are queries your audience asks but where you aren't winning. Many will translate directly into AEO prompts.
Talk to sales / support. What do prospects say when they email "we were considering you and X — what's the difference?" That's a comparison prompt.
Ask an LLM to expand. Feed it your category, three competitors, and your USP, and ask for 50 buyer questions. Then manually review and cut the 30% that aren't realistic. Don't ship the raw output.

Save the list as a flat file: prompt_id, prompt_text, category columns. You'll iterate on this for months.

Step 2 — Run prompts at scale

You need to run each prompt through each engine, ideally weekly, and store every full response. Three options:

Option A — Direct provider APIs. Call OpenAI's chat.completions endpoint for ChatGPT, Anthropic's messages endpoint for Claude, Google's Gemini API, and Perplexity's chat.completions. Each has its own auth, rate limits, and pricing. Realistic cost as of 2026: roughly $0.001–$0.01 per prompt for small/medium models, $0.02–$0.10 per prompt if you use the flagship reasoning models with web access.

Option B — OpenRouter. A single API that proxies to most major model providers, including OpenAI, Anthropic, Google, Perplexity, and several open models. One API key, one billing relationship, consistent response format. The convenience tax is roughly 5–10% over direct pricing. For most teams it pays for itself in saved engineering time.

Option C — Browser automation. Playwright/Puppeteer scripts hitting chat.openai.com, claude.ai, etc. Don't do this for production tracking — TOS gray zone, brittle to UI changes, bot detection will eventually catch you.

Cost math for a real example: 100 prompts × 6 engines × 4 weekly runs = 2,400 calls/month. At an average of $0.025/call (mix of light and heavy models), that's $60/month in pure API cost. Add headroom and you're at $100/month. This is why most teams either use a tracking SaaS or build with OpenRouter to keep the bill predictable.

A few practical notes:

Always store the full raw response, not just an extracted summary. You will want to re-extract later when you change your extraction logic.
Use temperature=0 if the engine supports it. Reduces noise across runs.
For ChatGPT specifically, decide explicitly whether you want web-search-on (more current, slower, more expensive) or off (uses model's training data only). Most teams want web on; some want both, tracked separately.
Rate-limit yourself. Don't burst 100 calls in 5 seconds — providers will throttle and your data quality drops.

Step 3 — Extract brand mentions reliably

This is where the naive approach breaks worst.

The wrong way: regex or substring match. "klaviyo" in response.lower() looks fine until you hit:

Mentions inside a competitor's quote: "Klaviyo's CEO said..." — counts as a mention even if the answer is about something else.
Mentions in a "not recommended" context: "Avoid Klaviyo if you have under 1,000 contacts."
Brand names that collide with common words. "Notion" matches the word "notion".
Spelling/casing variants: "Klaviyo", "klaviyo", "Klavio" (typo by the model).

The right way: a small structured-output LLM call per response. Feed the response, your brand name, and known competitors to a cheap model (gpt-4o-mini, Claude Haiku, Gemini Flash) with a JSON schema:

{
  "mentions": [
    {
      "brand": "Klaviyo",
      "sentiment": "positive | neutral | negative",
      "context": "recommended | mentioned | dismissed",
      "quote": "verbatim sentence from the response"
    }
  ],
  "ranked_brands": ["Klaviyo", "Mailchimp", "Brevo"]
}

This costs roughly $0.0005 per extraction with a small model — negligible on top of the prompt-running cost. Accuracy is markedly better than regex: in our internal tests it eliminated about 30% of false positives (collision and quote cases) and caught roughly 10% more true mentions (typos and morphological variants).

Validate by sampling. Take 50 random extractions a week, eyeball them, and feed errors back into the extraction prompt. After 2–3 iterations you'll be at >95% precision and >90% recall. That's good enough — chasing 99% costs more than it returns.

Step 4 — Track citations and cited sources

When an answer engine cites a source (Perplexity does this aggressively, ChatGPT with web on does it sometimes, Gemini increasingly does), capture three fields per citation:

Domain — reddit.com, g2.com, yourcompetitor.com. Domains aggregated across all your prompts tell you which sources answer engines trust in your category.
URL — for traceability and gap analysis later.
Linked-from-mention — was this citation attached to your brand's mention, a competitor's, or generic?

The high-leverage move: find domains that get cited often in your category but where you have no presence. If reddit.com/r/SaaS is cited in 40% of your category's answers and you've never posted there, that's a content gap. The engineering reward of being able to answer "which third-party domains do I need to show up on" specifically by data is enormous.

Step 5 — Decide your frequency

Hot brands (active launch, new pricing, repositioning): daily for the first 4 weeks, then drop to weekly.
Steady-state tracking: weekly. Anything more frequent is noise — the underlying training data and search index don't change daily, so day-to-day deltas mostly reflect model temperature, not real movement.
Spot checks during a campaign: ad-hoc, but log them in the same store as your scheduled runs.

Plot a 4-week rolling average of mentions and SoV. Avoid reacting to single-week spikes — they're usually noise.

What this looks like in practice

If you build all of the above, you'll end up with roughly:

A prompts table (~100 rows, manually curated)
A runs table (one row per prompt × engine × timestamp, with full raw response)
A mentions table (extracted, JSON output of the small-model extraction step)
A citations table (domain + URL + per-mention linkage)
A weekly cron, a small dashboard, and ~$80–150/month in API costs

Total engineering time to get to a working v1: 1–2 weeks for one engineer. Total ongoing maintenance: a few hours per week to refresh prompts, debug extraction edge cases, and keep model versions current.

If that sounds like more plumbing than you want to own, Promlo does exactly this — 6 LLMs, automated extraction, weekly digest, starting at $29/month. The article is honest about the workflow because the workflow is the same whether you build it or buy it; what we sell is "you don't have to maintain it".

Common mistakes

A short list of things teams get wrong on the first build:

Tracking branded prompts only. You'll feel productive but learn nothing about category positioning.
Using a flagship model for extraction. GPT-4o is overkill at $0.005/extraction when Haiku does the job at $0.0005.
Storing only the extracted JSON, not the raw response. Six months later you'll change extraction logic and want to re-run, and the raw text won't be there.
Comparing absolute mention counts month over month without normalizing for prompt-list changes. If you added 20 prompts, mentions go up. SoV is the metric that doesn't lie.
Treating ChatGPT as the only engine. Perplexity has citations, Gemini owns the Google AI Overviews surface, Claude is rising fast in B2B. All four matter.

For region-specific tactics — especially around which engines actually have user share in Hong Kong and Taiwan — see AEO for SaaS founders shipping out of Hong Kong and Taiwan.

How to track ChatGPT mentions of your brand (the practical guide)