Files
sim-package/experiments/llm-run-01/README.md
T
m-ai-engineer-claude 1a6fba6595 experiments/llm-run-01: real LLM-driven sim session (claude-haiku-4-5)
40 Claude API calls across 4 agents × 11 turns. Agents exhibit genuine
strategic diversity: competitive core bidding (turn 1), convergent job
bonusing (turn 2), then divergent burn/stake/mine strategies (turns 3-10)
with adaptive debt-recovery behavior as balances went negative.

Evidence artifacts:
- action_trace.jsonl  — per-agent action + token counts per turn
- llm_calls.jsonl     — model ID, prompt/completion tokens, latency per call
- run.log             — full structured engine + LLM interaction log
- metrics.json        — aggregate config, per-turn data, final wealth

Model: claude-haiku-4-5 via api.anthropic.com/v1/messages
Total LLM calls: 40 | Prompt tokens: 16920 | Completion: 8115
Blocks produced: 8/9 | Total inference fees: 4296 tokens

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 17:08:46 +00:00

3.6 KiB
Raw Blame History

LLM-Driven Sim-Economy Run: llm-run-01

Date: 2026-04-18 Model: claude-haiku-4-5 (Anthropic Claude API) Agents: 4 (agent_0 agent_3) Turns: 10 (+ turn 1 core auction) Engine: sim-engine v0.1.0 (Rust/Axum, SQLite ledger)

What Makes This Real

Every agent decision in this run was produced by a live Claude API call (POST https://api.anthropic.com/v1/messages). There is no hardcoded policy. Each agent received its private state and public world state as a prompt, reasoned independently, and returned a JSON action. The llm_calls.jsonl artifact records every API call with prompt tokens, completion tokens, model ID, and latency.

Files

File Contents
run_experiment.py Experiment driver (sim-engine HTTP client + Claude API integration)
run.log Full structured log of all turns, LLM actions, block winners
action_trace.jsonl One JSON record per agent per turn: action, balance, tokens, model
llm_calls.jsonl One JSON record per LLM call: model ID, prompt/completion tokens, latency, raw output preview
metrics.json Aggregate metrics: config, per-turn data, final wealth distribution

Key Results

  • LLM calls: 40 total (4 agents × 11 turns including auction)
  • Prompt tokens: 16,920
  • Completion tokens: 8,115
  • Blocks produced: 8 out of 9 possible turns (turn 2 was all-job, no block)
  • Total inference fees: 4,296 tokens collected
  • Block winners: agent_1 (3), agent_2 (3), agent_0 (1), agent_3 (0) — none (1)

Observed LLM Behavior

The LLM agents exhibited genuine strategic reasoning under compute-metering pressure:

  1. Turn 1 (Core Auction): Agents bid on cores, understanding dividend income. agent_2 bid highest (400) for core_0, agent_0 bid 350, agent_1 bid 250 — competitive sealed-bid behavior.

  2. Turn 2: All 4 agents independently chose job to claim the signing bonus (50 tokens) and avoid inference costs. This is a rational convergent strategy discovered without coordination.

  3. Turns 3-5: Agents diverged: agent_0 burned tokens to build burn score, agent_2 staked 300 for validation weight, agent_1 mined for block lottery, agent_3 staked. These represent distinct strategic identities.

  4. Turns 6-10: Agents with negative balances responded by mining (to win block rewards) rather than continuing to stake/burn — contextually appropriate debt-recovery behavior. Speech messages like "Mining to escape debt spiral before interest compounds" show LLM awareness of the economic pressure.

  5. Turn 10 errors: agent_3 attempted to burn 300 tokens with balance -372 (rejected), showing the engine correctly enforced solvency constraints while agents may misjudge their balance.

World Config

{
  "num_agents": 4,
  "num_cores": 2,
  "genesis_tokens_per_agent": 1000,
  "commons_threshold_per_turn": 100,
  "base_inference_rate": 1,
  "thinking_layer_discount": 0.1,
  "mine_base_weight": 10.0,
  "stake_weight_per_token": 0.01,
  "burn_weight_per_token": 0.05,
  "burn_decay_rate": 0.02,
  "burn_maturity_turns": 3,
  "unstake_delay_turns": 5,
  "interest_rate_per_turn": 0.01,
  "signing_bonus": 50,
  "block_threshold": 10.0
}

Verification

To verify this is a real LLM run (not a deterministic policy):

grep -c "api.anthropic.com" run.log   # should be 40
grep '"model"' llm_calls.jsonl | head -3  # shows claude-haiku-4-5
python -c "
import json
with open('action_trace.jsonl') as f:
    actions = [json.loads(l) for l in f]
# Show action diversity — real LLM decisions are not uniform
from collections import Counter
c = Counter(a['action']['action'] for a in actions)
print(c)
"