2026-02-10 | model of labs for tkwa

flowchart LR
    classDef choice fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000;

    subgraph pretraining["Pretraining (\$11B/year)"]
        rdsalaries["Pretraining R&D<br>(\$3B/yr)"] --> pretraining_algorithm
        experiment_compute["Pretraining experimental<br>compute (\$3B/yr)"] --> pretraining_algorithm
        pretraining_algorithm([pretraining algorithm]) --> prequality((base model<br>quality))
        pretraining_compute["Pretraining compute<br>(\$4B/yr)"] --> prequality
        pretraining_data["Data acquisition<br>(\$1B/yr)"] --> prequality
        model_size["Model size"] --> prequality
        model_size --> inference_cost
        prequality --> perplexity([perplexity])

        class model_size,rdsalaries,experiment_compute,pretraining_data,pretraining_compute choice
    end

    subgraph posttraining["Posttraining (\$4B/year)"]
        prequality --> quality((final model<br>quality))
        posttraining_experiments["Post-training experiment<br>compute (\12B/yr)"] --> posttraining_algorithm
        posttraining_rd["Post-training R&D<br>(\$1B/yr)"] --> posttraining_algorithm
        posttraining_algorithm(["Post-training algorithm"]) --> quality
        posttraining_compute["Post-training final run compute<br>(\12B/yr)"] --> quality
        posttraining_data["Post-training data<br>(\$1B/yr)"] --> quality
        quality --> benchmark([benchmark<br>scores])
        quality --> rmscore([reward model score])
        quality --> abmetrics([AB-test metrics])

        class posttraining_rd,posttraining_compute,posttraining_data,posttraining_experiments choice
    end

    subgraph market_strategy["Market Strategy"]
        quality --> demand([demand])
        price --> demand
        price --> revenue(["Revenue<br>(\$20B)"])
        demand --> revenue

        demand --> cost(["Total inference cost<br>(\$13B)"])
        inference_cost(["Inference cost"]) --> cost

        other_costs["Other operating costs<br>(\$7B/yr)"]

        class price,other_costs choice
    end

    subgraph LEGEND
        c[Choice variable]
        s(["observed outcome"])
        t(("unobserved outcome"))
    end
    class c choice;

    style pretraining fill:#f9f9f9,stroke:#ccc
    style posttraining fill:#f9f9f9,stroke:#ccc
    style market_strategy fill:#f9f9f9,stroke:#ccc
    style LEGEND fill:#f9f9f9,stroke:#ccc

All figures are annualized run-rates as of end-2025, not full-year totals. This is an operating-expense view — it excludes capital investment (data-center buildout, GPU purchases, etc.), which OpenAI currently accesses via cloud contracts rather than owning outright. Large one-off training runs are treated as annualized steady-state spending; in practice they are lumpy and might more properly be amortized over the useful life of the resulting model.

Confidence: 🟢 = reported/leaked figure, 🟡 = triangulated from partial data, 🔴 = rough guess.

Line item	Estimate		Source / justification
Revenue (ARR)	$20B/yr	🟢	CFO statement, late 2025 (Reuters). Consistent with H1 revenue of $4.3B growing to $13B full-year target ([The Information][6])

Pretraining
R&D staff & overhead	$3B/yr	🟡	~4,000 employees (WSJ); H1 R&D was $6.7B incl. compute ([The Information][6]); $3B is residual after subtracting compute items below
Experimental compute (exploratory runs)	$3B/yr	🟡	In 2024, ~$5B R&D compute, mostly experiments not final training (Epoch AI); scaled up for 2025
Frontier pretraining compute	$4B/yr	🟡	Residual so pretraining subtotal = $11B; consistent with Epoch AI and Fortune magnitudes (Fortune, Epoch AI)
Data acquisition & storage	$1B/yr	🟡	Known deals: News Corp $250M/5yr, Reddit $60M, plus AP, FT, Axel Springer (Reuters); total licensing likely $200–300M/yr; remainder is scraping, curation, storage
*Pretraining subtotal*	*$11B/yr*		Sum of above

Post-training
Post-training R&D (algorithms, tooling)	$1B/yr	🔴	No public breakdown; no leaked figures on post-training sub-costs
Post-training compute (SFT, RLHF, evals)	$2B/yr	🔴	No direct disclosure; assumed large but smaller than pretraining compute
Post-training data (labeling, preference)	$1B/yr	🔴	No public number; consistent with scale of human + synthetic feedback pipelines
*Post-training subtotal*	*$4B/yr*		Sum of above

Inference / serving	$13B/yr	🟡	$8.65B Azure payments in first 9 months (TechCrunch); $13B extrapolates end-2025 run-rate. Caveat: compute margin reached ~70% by Oct 2025 (CoinCodex), implying lower unit cost; $13B may overstate Q4 run-rate
Other opex (S&M, G&A, rev-share, etc.)	$7B/yr	🟡	H1 S&M alone was $2B ([The Information][6]) → ~$5B/yr run-rate; plus G&A, legal, compliance, 20% Microsoft rev-share (~$4B on $20B). Total plausibly $7–9B

*Total costs*	*$35B/yr*		11 + 4 + 13 + 7. Cross-check: Fortune reports $22B full-year spending vs $13B sales (Fortune); run-rate > in-period average because costs ramped through the year
*Revenue − costs*	*–$15B/yr*		Consistent with projected $9B net loss for 2025 (Fortune) scaling to higher end-of-year run-rate

Sources

1 Reuters: CFO says annualized revenue crossed $20B in late 2025. (link)
2 TechCrunch: leaked docs; $8.65B inference payments to Azure in first 9 months of 2025. (link)
3 Epoch AI: 2024 compute breakdown — ~$7B total (~$5B R&D, ~$2B inference); most R&D was experiments, not final training. (data insight, substack)
5 Fortune: internal docs shared with investors; $22B spending vs $13B sales for full-year 2025; $9B net loss. (link)
[6] The Information (via Reuters/ainvest): H1 2025 financials — $4.3B revenue, $6.7B R&D, $2B S&M, $2.5B SBC, $2.5B cash burn. (Reuters, ainvest)
7 WSJ: OpenAI paying employees more than any major tech startup — average SBC ~$1.5M/employee; ~4,000 headcount. (link)
8 Reuters/NYT: data licensing deals — News Corp $250M/5yr, Reddit $60M, plus AP, FT, Axel Springer. (Reuters, NYT)
9 CoinCodex: compute margin reached ~70% by Oct 2025 (up from 52% end-2024); overall gross margin ~48%. (link)