Operational Runbooks
Operational Runbooks
Section titled “Operational Runbooks”This guide covers common operational tasks for running Hydra in production.
Startup Procedures
Section titled “Startup Procedures”Pre-flight Checklist
Section titled “Pre-flight Checklist”Before starting the bot:
-
Verify configuration
Terminal window cat config.yaml | grep mode # Should match intended mode -
Check credentials (live mode only)
Terminal window # Ensure secrets are setecho $POLYMARKET_PRIVATE_KEY | wc -c # Should be > 60 chars -
Verify network connectivity
Terminal window curl -s https://gamma-api.polymarket.com/healthcurl -s https://api.binance.com/api/v3/ping -
Check network latency (critical for latency arbitrage)
Terminal window hydra latency # Using binarybun run latency # From sourceExpected results for production deployment:
- Binance WebSocket: <50ms (ideally <10ms from Tokyo)
- Polymarket REST: <200ms
If Binance latency >100ms, consider deploying to AWS ap-northeast-1 (Tokyo).
-
Check disk space
Terminal window df -h ./runs # Ensure adequate space for logs
Starting the Bot
Section titled “Starting the Bot”# Paper tradinghydra run # Using binarybun run paper # From source
# Live tradinghydra run --mode live # Using binarybun run bot # From source
# With Dockerdocker compose up -d hydraVerifying Startup
Section titled “Verifying Startup”-
Check IPC server is listening:
Terminal window curl http://localhost:8787/health -
Connect TUI to verify data flow:
Terminal window hydra tui # Using binarybun run tui # From source -
Check logs for market discovery:
Terminal window tail -f runs/*/events.jsonl | grep MarketSelected
Monitoring
Section titled “Monitoring”Key Metrics to Watch
Section titled “Key Metrics to Watch”| Metric | Normal Range | Alert Threshold |
|---|---|---|
| Orders/minute | 0-30 | >30 (rate limited) |
| Drawdown | 0-5% | >10% |
| Data staleness | <500ms | >3000ms |
| Fill rate | >80% | <50% |
Health Indicators
Section titled “Health Indicators”Healthy system:
- TUI shows real-time price updates
- Reference prices updating every ~100ms
- No staleness warnings in logs
- Positions reconcile with exchange
Unhealthy indicators:
- Stale data warnings
- Kill switch triggered
- No orders for extended periods
- Position mismatch warnings
Log Monitoring
Section titled “Log Monitoring”# Watch for errorstail -f runs/*/events.jsonl | grep -E '"type":"(Error|KillSwitch|RiskTrip)"'
# Watch order flowtail -f runs/*/events.jsonl | grep -E '"type":"Order(Placed|Filled)"'
# Monitor reference pricestail -f runs/*/events.jsonl | grep ReferencePriceEvent | tail -1Shutdown Procedures
Section titled “Shutdown Procedures”Graceful Shutdown
Section titled “Graceful Shutdown”-
Stop new orders - The bot handles SIGTERM gracefully
Terminal window docker compose stop hydra# orkill -TERM $(pgrep -f "bun.*main.ts") -
Verify open orders cancelled Check logs for order cancellation confirmations
-
Archive session data
Terminal window tar -czf session-$(date +%Y%m%d-%H%M%S).tar.gz runs/
Emergency Shutdown
Section titled “Emergency Shutdown”If graceful shutdown fails:
# Force stopdocker compose kill hydra# orkill -9 $(pgrep -f "bun.*main.ts")
# IMPORTANT: Manually cancel any open orders via Polymarket UIIncident Response
Section titled “Incident Response”Kill Switch Triggered
Section titled “Kill Switch Triggered”Symptoms: Bot stops trading, “KILL SWITCH TRIGGERED” in logs
What happens automatically:
- Risk mode is set to
killed, blocking all new orders - All open orders are cancelled via
tradingService.cancelAllOrders() - All positions are neutralized (UP/DOWN tokens sold to close positions)
- Detailed warnings are logged with the trigger reason
Actions:
- Check trigger reason in logs (look for “KILL SWITCH” entries)
- If data staleness: Check network/API connectivity
- If drawdown: Review recent trades, check position sizing
- If position/exposure limit: Review limit configuration vs trading strategy
- If manual: Intentional, verify before restarting
Recovery:
- Fix underlying issue
- Verify all positions were properly neutralized via Polymarket UI
- Restart bot with fresh state
Position Reconciliation Mismatch
Section titled “Position Reconciliation Mismatch”Symptoms: “Position mismatch” warnings, PnL looks wrong
Actions:
- Check Polymarket UI for actual positions
- Compare with bot’s reported positions in TUI
- If mismatch persists, restart bot to force resync
Data Feed Disconnection
Section titled “Data Feed Disconnection”Symptoms: Stale data warnings, no price updates
Actions:
- Check Binance/Polymarket API status
- Verify network connectivity
- Check for IP rate limiting
Recovery: Bot automatically reconnects. If persistent, restart.
High Slippage / Bad Fills
Section titled “High Slippage / Bad Fills”Symptoms: Fills at worse prices than expected
Actions:
- Check market liquidity in TUI
- Review slippage protection settings
- Consider reducing position sizes
Prevention:
- Increase
minLiquidityUSDC - Decrease
maxSlippagePercent - Use more conservative
edgeThreshold
Maintenance Tasks
Section titled “Maintenance Tasks”- Review overnight PnL
- Check for warning/error logs
- Verify positions match exchange
- Monitor fill rates
Weekly
Section titled “Weekly”- Archive old session logs
- Review market discovery scores
- Check for Polymarket API changes
- Update dependencies if needed
Monthly
Section titled “Monthly”- Full strategy performance review
- Adjust risk parameters based on data
- Test disaster recovery procedures
- Rotate API credentials if policy requires
Disaster Recovery
Section titled “Disaster Recovery”Complete Data Loss
Section titled “Complete Data Loss”- Restore config from backup/version control
- Regenerate API keys if needed
- Start in paper mode to verify
- Switch to live after validation
Corrupted State
Section titled “Corrupted State”# Stop botdocker compose stop hydra
# Clear state (positions will resync from exchange)rm -rf runs/*
# Restartdocker compose up -d hydraNetwork Outage Recovery
Section titled “Network Outage Recovery”Bot automatically handles reconnection. After extended outage:
- Check all positions reconciled correctly
- Review any orders that may have filled during outage
- Verify reference prices are fresh before trading resumes
Scaling Considerations
Section titled “Scaling Considerations”Single Instance Limits
Section titled “Single Instance Limits”- ~30 orders/minute (Polymarket rate limit)
- ~15 markets simultaneously (recommended max)
- ~1GB memory typical usage
Multi-Instance
Section titled “Multi-Instance”Not currently supported. Running multiple instances will cause:
- Order conflicts
- Position tracking issues
- Rate limit exhaustion
Support Escalation
Section titled “Support Escalation”If issues persist after following runbooks:
- Collect logs:
tar -czf debug.tar.gz runs/ - Note exact error messages
- Document steps taken
- Open GitHub issue with details