TranscriptAgent
Try it free
TRANSCRIPTAGENT.AI · transcript analysis

Building AlphaGo from scratch – Eric Jang

Channel: Dwarkesh Patel Published: 2026-05-15 11:20
Dwarkesh Patel

Dwarkesh Patel interviews Eric Jang about rebuilding AlphaGo from scratch, focusing on the core mechanics of Go, Monte Carlo tree search, value/policy networks, and how search turns a raw model into a much stronger player. The conversation then broadens into what AlphaGo implies for scaling laws, reinforcement learning, distillation, off-policy training, and automated AI research.

Watch on YouTube ›

Get the market thesis, key claims, assets, contradictions, and follow-up questions from any financial video — then unlock a version personalized to your portfolio, watchlist, and favorite speakers.

Detailed summary

This is a long-form technical interview with Eric Jang, introduced as the former VP of AI at 1X Technologies and previously a senior research scientist at Google DeepMind Robotics, discussing his sabbatical project of reconstructing and improving AlphaGo. He starts with why AlphaGo fascinated him: it solved a seemingly intractable search problem with deep learning, and modern tooling has made a project that once required a large research team and millions of dollars achievable on rented compute. A large portion of the discussion carefully explains Go itself, including the board, stone capture, suicide rules, Tromp-Taylor rules, endgame scoring, and why computer Go uses an unambiguous ruleset. From there, Eric builds the conceptual bridge to AlphaGo: a deterministic game with enormous branching factor and depth, where naive search is impossible. …

🔒 The full detailed summary continues — read all of it free with an account. Read the full summary →

Main takeaways

  1. AlphaGo’s key insight is not brute-force search alone, but search guided by learned value and policy networks.
  2. The training loop is best understood as supervised learning on improved labels produced by MCTS, not as pure sparse-reward RL.
  3. Go is especially suited to this approach because its rules are deterministic, its state is fully observable, and value estimation is meaningful.
  4. ResNets still appear to be a strong baseline for this kind of board-game learning on smaller compute budgets, though global context helps.
  5. Modern compute and coding agents make once-huge research projects much more accessible, but the first successful implementation is still hard.
  6. Off-policy relabeling and replay buffers can help, but only when the sampled states remain close enough to the agent’s actual trajectory.
  7. The broader lesson is that many apparently hard problems may yield to a compressed forward pass plus some search, but the exact transfer to LLM reasoning is still unsettled.
  8. Automated research tools are already useful for experiment execution and hyperparameter search, but not yet reliably good at choosing the next right question.

Market read by horizon

Short term

Near term, the actionable read is that search-plus-value pipelines can still produce large gains quickly if the problem has a strong verifier and a good initialization. The immediate risk is overgeneralizing AlphaGo-style heuristics to open-ended LLM reasoning before the target environment is sufficiently structured.

  • Eric says his current implementation is only partially validated; he has not yet fully completed the tabula-rasa first step and is still testing against strong KataGo-style baselines.
Show more
  • He reports that a strong starting point matters: pretraining on expert data or best-response training against KataGo is more practical than starting from scratch.
  • For immediate implementation, he recommends verifying Go rules, value head quality, and fast simulation before adding full MCTS complexity.
Mid term

Over the next several months, the base case is continued progress from distillation, better priors, and more compute-efficient search loops rather than a brand-new algorithmic breakthrough. Validation will come from whether improved labels and value estimates keep compounding without collapsing off-distribution.

  • Over weeks or months, the likely path is iterative improvement through better initialization, better value estimates, and then search-distillation loops that steadily strengthen the policy.
Show more
  • He expects MCTS to remain a strong teacher so long as the value function stays grounded and the training distribution remains close to reachable states.
  • If the model becomes good enough, the policy should absorb much of the search burden, reducing the number of simulations needed for similar strength.
Long term

Structurally, the transcript argues that many hard problems may be better viewed as search problems that can be compressed into learned forward passes plus lightweight planning. If that holds broadly, the durable regime shift is toward systems that learn to imitate better search, not just raw end-to-end predictors.

  • AlphaGo demonstrates that a comparatively small neural network can amortize a very deep search process and approximate a hard combinatorial problem remarkably well.
Show more
  • The broader structural implication is that some problems thought to require huge explicit computation may instead be tractable through learned compression plus search.
  • He links this to a possible future where forward passes encode a great deal of reasoning or simulation, potentially changing how we think about complexity and AI capability.
Unlock the full horizon read See the full short-term, mid-term, and long-term implications with confirmation and invalidation signals. Unlock horizon read

Key claims (8)

BULLISH AI scaling and search AlphaGo

AlphaGo was profound because deep learning solved a search problem that was long believed intractable for brute-force methods.

Eric frames AlphaGo as solving a historically intractable search class using deep learning.

BULLISH compute efficiency KataGo

KataGo achieved a roughly 40x reduction in compute needed to train a strong Go bot tabula rasa compared with earlier systems.

He explicitly states the 40x figure and says KataGo is very strong.

BULLISH search and planning AlphaGo

Monte Carlo tree search improves Go strength by combining policy priors with value estimates and a visit-count-driven exploration rule.

This is the technical core of his explanation of PUCT and MCTS.

Unlock 5 more claims See the full bullish, bearish, and counter-consensus argument map extracted from the transcript. Unlock all claims

Assets discussed (10)

AlphaGo
NEUTRAL other

Core subject of the conversation; not a tradable asset.

KataGo
NEUTRAL other

Referenced as the strong open-source Go bot and baseline for experiments.

Unlock the full asset map (8 more) See all assets mentioned, their directional bias, and the exact reasoning. Unlock asset map

Interview (22 Q&A)

AlphaGo interest

Why is AlphaGo interesting, and why did you choose it for your sabbatical project?

He says AlphaGo captivated him because it showed how far deep learning could go on a problem long considered intractable for search. He also wanted to understand how a relatively small network could amortize such deep game-tree simulation, especially after seeing the early breakthroughs in 2014-2016.

game end

When does a Go game end?

He says the game ends either when a player resigns or when both players pass consecutively.

AlphaGo method

How do you crack Go with AI, and how does AlphaGo work?

He says the approach is to first build intuition around the search process used to choose moves, then add deep learning to make that search efficient and tractable. He frames the rest of the explanation as an implementation-minded walkthrough of AlphaGo's move selection.

Unlock the full interview (19 more Q&A) Every question, answer summary, and YouTube timestamp. Unlock full Q&A

Where this transcript pushes against consensus

  • Eric’s own evaluation of whether transformers can beat ResNets here remains tentative; he frames it as his experience rather than a settled result.
  • He repeatedly says some claims are not peer-reviewed and should be treated as provisional, especially around scaling laws, compute multipliers, and his own implementation choices.
  • The discussion of whether AlphaGo’s complexity is less impressive once understood is explicitly rejected by Eric; he argues the accomplishment remains profound.
  • He speculates about links to P=NP, chaos, and broader computational hardness, but these are philosophical analogies rather than demonstrated claims.
  • He acknowledges uncertainty about the exact test-time scaling behavior of MCTS and says he does not know the precise curve shape.
  • He is cautious about generalizing MCTS-style search to LLM reasoning, saying the jury is still out and that PUCT-like heuristics may not transfer cleanly.

Topics

AlphaGo reconstructionGo rules and Tromp-Taylor scoringMonte Carlo tree searchPUCT/UCB explorationPolicy and value networksResNet vs transformer architecturesSelf-play and distillationRL, DAgger, and Q-learningOff-policy training and replay buffersAutomated AI research

Create your free research agent

Unlock the full claims, asset map, scores, related transcripts, follow-up questions, and AI chat — shaped around your portfolio, watchlist, favorite speakers, and risks.

  • Full claims and asset map
  • Personalized relevance to your watchlist
  • Follow-up questions you can track
  • Related transcripts from your workspace
  • AI chat about this video
Create your free research agent
TRANSCRIPTAGENT.AI