Training
Gym-Khana integrates with Stable Baselines3 for PPO training and Weights & Biases for experiment tracking.
Training scripts
The main racing training script is train/ppo_race.py. The recovery training script is train/ppo_recover.py. Both support the following modes:
Train (
--m t): Train a new model with parallel environments usingSubprocVecEnvEvaluate (
--m e): Evaluate a trained model with visualizationDownload (
--m d): Fetch a model from wandb and evaluate itContinue (
--m c): Continue training from an existing checkpointTransfer (
--m f): Transfer a pretrained model to a new task, preserving network weights but resetting optimizer, LR schedule, and optionally resettinglog_stdfor fresh exploration and critic network for fresh value approximation
Examples:
# Train a new racing model
python3 train/ppo_race.py --m t
# Evaluate a local model (uses latest wandb run if --path not specified)
python3 train/ppo_race.py --m e
python3 train/ppo_race.py --m e --path /path/to/model.zip
# Download from wandb and evaluate
python3 train/ppo_race.py --m d --run_id <wandb_run_id>
# Continue training
python3 train/ppo_race.py --m c --path /path/to/model.zip --additional_timesteps 10000000
Detailed usage guidelines are at the top of each training script file.
Callbacks
Default SB3 callbacks used during training:
WandbCallback— log metrics to wandbCheckpointCallback— save periodic checkpointsEvalCallback— evaluate during training
Observation min/max snapshots
When record_obs_min_max: true is set in the gym config, an ObsMinMaxSnapshotCallback is attached automatically. It merges per-subproc obs min/max trackers every CKPT_SAVE_FREQ env steps (and once at training end) and persists them to outputs/config/<run_id>/obs_min_max.yaml. Per-feature bounds-violation magnitudes are streamed to wandb under obs_bounds/<feature>/over and obs_bounds/<feature>/under so the offending feature is identifiable from the metric key. The end-of-training aggregated table still prints to stdout.
Instability prevention
When prevent_instability: true is set in the gym config, every RaceCar runs a sanity check on the post-RK4 standardized state and reverts to the pre-step state on any violation, while the env truncates the episode with info["instability_truncation"] = True and a populated info["unstable_info"]. An InstabilityCountCallback is attached automatically and logs the total event count summed across subprocs to wandb under instability/total every CKPT_SAVE_FREQ env steps (and once at training end). At end-of-run the per-env breakdown is printed to stdout. The detection bounds are configurable via instability_yaw_rate_bound and instability_slip_bound (defaults: 4π rad/s and π/2 rad).
log_std schedule
When the log_std_schedule block is present in train/config/rl_config.yaml, a LogStdScheduleCallback is attached to fresh training runs (--m t). It linearly anneals the policy’s log_std from init to end across total_timesteps and freezes the parameter so the schedule fully controls action noise.
This closes the stochastic-vs-deterministic train/eval gap: SB3’s default log_std_init=0 (σ=1.0 over normalized actions) lets the actor mean drift to a noise-dependent attractor — a policy that scores well on noisy PPO rollouts but fails under deterministic eval. Annealing toward zero forces rollout trajectories to resemble deterministic-eval trajectories so the gradient optimizes the mean against samples that match what eval sees.
log_std_schedule:
init: -1.0 # σ ≈ 0.37 in normalized action space
end: -3.0 # σ ≈ 0.05; near-deterministic
Comment out the whole block to disable (policy then uses SB3 default log_std_init=0, which reproduces the gap; only for ablation). Applies only to fresh training (--m t); --m c and --m f use the existing transfer_reset_log_std knob. The scheduled target is logged to wandb under train/log_std_scheduled.
Curriculum learning
A custom CurriculumLearningCallback is available for recovery training. It gradually expands the recovery state initialization ranges as the agent’s success rate improves.
CL is configured in train/config/gym_config.yaml under the curriculum heading:
curriculum:
enabled: true
n_stages: ...
success_threshold: ...
v_range: [...]
beta_range: [...]
Note
Curriculum learning is only supported for recovery training (training_mode: "recover"), accessed through train/ppo_recover.py.
Wandb integration
Models and training runs are logged to: https://wandb.ai/teo-altum-quinque-queen-s-university/projects
Formatting and linting
The project uses ruff for formatting and linting:
ruff check --fix . # lint + auto-fix (unused imports, import sorting)
ruff format . # format (Black-compatible)
Pre-commit hooks run both automatically on git commit (configured in .pre-commit-config.yaml).