Teaching a language model a skill (using poker as an example)

Teaching a language model a skill (using poker as an example)

Written on December 30, 2025 @ 22:00 by Bastiaan

Last updated on December 30, 2025 @ 22:45

Introduction

There’s a moment in every LLM project where the vibe changes.

At first, everything is “wow”. You write a prompt, you get a coherent answer, and it feels like you’re already halfway to shipping. Then you ask the model to do something specific and repeatable.

You notice it when:

  • a support bot confidently invents a refund policy that doesn’t exist
  • a legal or medical summarizer uses beautiful language while quietly changing meaning
  • an internal assistant gives answers that sound right, but can’t be tested reliably
  • a poker coach gives “reasonable” advice, while hallucinating stack sizes, actions, or ranges

That’s when prompting starts to feel like negotiating, and you begin looking for something sturdier. This is where fine-tuning becomes interesting.

Poker is a useful case study because it’s a domain where “plausible” is not enough. It’s a decision problem with hard constraints, hidden information, and lots of near-identical situations where consistency matters.

And there’s a dataset for it.

Prompting, retrieval, and fine-tuning

Most real systems end up with one of these three patterns.

Prompting

Prompting is the lightest approach: you tell the model what you want and hope it generalizes. Sometimes you add a few examples.

It’s fast. It’s great for prototyping. But it’s also where hallucinations thrive, because the model is doing what it was trained to do: generate plausible continuations.

RAG

Retrieval-Augmented Generation (RAG) is “search first, answer second”. You keep knowledge outside the model, retrieve relevant chunks at runtime, and feed them in as context.

RAG is excellent for fast-changing knowledge bases: policies, docs, product catalogs. It’s less great when you need behavioral consistency (for example “always output an action + bet size in a strict format”) because retrieval changes the context, not the default reasoning style.

If you want a practical mental model for why RAG works at all, it helps to understand embeddings and vector search. That’s exactly what’s covered in: https://bastiaan.dev/blog/from-text-to-vectors

Fine-tuning

Fine-tuning changes the model. Instead of repeatedly reminding the model what “correct” looks like, you train it on enough examples that correct behavior becomes default. It won’t magically eliminate hallucinations—but it can reduce them by narrowing the space of possible answers and reinforcing your desired output style.

Poker is one of those domains where this matters.

Why poker breaks generic models quickly

A generic LLM can talk about poker concepts. Position, pot odds, and bluffing are no problem.

But ask it to act:

“Here’s the hand history. What do we do now? Fold/call/raise.”

That’s where you’ll see:

  • invented betting actions
  • inconsistent lines between similar hands
  • “confident strategy” that collapses under scrutiny
  • hallucinated details that weren’t in the prompt

PokerBench exists because poker is a strong stress test for LLM decision-making: it’s incomplete-information, strategically complex, and easy to evaluate when you have solver labels.

PokerBench dataset: https://huggingface.co/datasets/RZ412/PokerBench https://github.com/pokerllm/pokerbench

Choosing the right base model for a real-time poker coach

In practice, you’ll be choosing between:

  1. a small model (~3B) that’s fast and “good enough” after fine-tuning
  2. a mid-sized model (~7B–8B) that’s slower but more stable in edge cases

Option A: Llama 3.2 3B Instruct (fastest modern baseline)

A strong small model with good instruction-following for its size. Works well with strict output formats and LoRA fine-tuning.

Model page: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

Option B: Mistral 7B Instruct

Efficient, widely used, and explicitly designed to be fine-tuned. This is often the size where poker advice starts feeling consistent under pressure.

Model page: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

Option C: Qwen2.5 7B Instruct

Especially interesting if you want longer hand histories or structured outputs.

Model page: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

What you actually need to fine-tune a poker coach

Fine-tuning isn’t one thing, it’s a pipeline.

1. Data

You need examples of the behavior you want.

PokerBench provides natural language prompts paired with optimal actions, already split into train and test sets.

Key decisions you still need to make:

  • action only vs. action + sizing
  • include rationale or not
  • enforce a strict output format

For real-time usage, a strict output contract is worth it.

2. Base model and chat template

Instruction-tuned models expect a specific chat format. Mismatched templates lead to subtle failures: role drift, formatting errors, or verbose answers when you want a single action.

Always fine-tune using the model’s native chat template.

3. Fine-tuning method

Full fine-tuning is rarely necessary.

Most teams use LoRA or QLoRA:

  • LoRA: small trainable adapters on top of frozen weights
  • QLoRA: similar, but optimized for lower memory usage

LoRA guide with examples:
https://huggingface.co/docs/peft/main/en/developer_guides/lora

4. Training tooling

A common modern stack:

  • Hugging Face Transformers
  • Hugging Face Datasets
  • PEFT
  • TRL’s SFTTrainer

SFTTrainer docs with examples:
https://huggingface.co/docs/trl/main/en/sft_trainer

PyTorch blog (with Colab notebook):
https://pytorch.org/blog/finetune-llms/

Unsloth (fast fine-tuning, consumer hardware):
https://www.unsloth.ai/blog/llama3

Ollama API

Once running, your poker coach is just another HTTP service (API).

API docs: https://github.com/ollama/ollama/blob/main/docs/api.md

Fine-tuning and hallucinations

Fine-tuning doesn’t turn hallucinations off. What it does:

  • narrows the output space
  • reinforces domain-correct patterns
  • makes behavior testable and repeatable

The most robust setups combine:

  • fine-tuned behavior
  • strict output validation
  • optional retrieval for table or player context

Conclusion

Fine-tuning isn’t about making models smarter. It’s about making them predictable, reducing hallucination risk, and aligning behavior with real-world constraints.

Poker just makes that lesson obvious.

If your system needs to decide, not just explain, fine-tuning is often the difference between a clever demo and something you can actually trust.

References and further reading

LLM Fine-Tuning: A Guide for Domain-Specific Models

Fine-Tuning LLMs for Specific Domains: Why, How, and What to Consider