Monitoring & Telemetry
Hydra is designed for high-frequency trading (HFT), where visibility into sub-millisecond latencies and system health is critical. The monitoring system uses a push-based architecture to minimize performance impact and ensure deterministic execution.
Overview
Section titled “Overview”Unlike traditional “pull-based” systems like Prometheus, Hydra uses a push-based telemetry model.
Why Push-Based?
Section titled “Why Push-Based?”In HFT scenarios, the act of a monitoring system scraping metrics from the bot can cause CPU jitter. These small, unpredictable spikes in CPU usage can delay order execution exactly when market volatility is highest. By pushing metrics over non-blocking UDP (or HTTP), Hydra ensures:
- Zero Scrape Overhead: The trading engine is never interrupted for metric collection.
- Sub-millisecond Precision: High-resolution timestamps are captured at the point of origin.
- Near-Real-Time Visibility: Metrics are flushed at high frequencies (default 100ms) for live dashboards.
Architecture
Section titled “Architecture”The telemetry pipeline follows the flow below:
Hydra → Telegraf → InfluxDB → Grafana
- Hydra: Emits metrics in InfluxDB Line Protocol format via
TelemetryService. - Telegraf: Acts as a lightweight collector/aggregator. It listens for UDP or HTTP packets from one or more Hydra instances.
- InfluxDB: A time-series database that stores the metrics.
- Grafana: Provides a visual dashboard for real-time monitoring and historical analysis.
Configuration
Section titled “Configuration”Telemetry is configured in the telemetry section of config.yaml.
telemetry: # Enable or disable telemetry streaming enabled: true
# Host and port for Telegraf or InfluxDB host: "127.0.0.1" port: 8094
# Protocol to use: "udp" (recommended for low latency) or "http" protocol: "udp"
# InfluxDB database/bucket name database: "hydra"
# Prefix added to all measurement names prefix: "hydra"
# Number of metrics to batch before sending batchSize: 100
# How often to flush the buffer (ms) flushIntervalMs: 100
# Tags added to every metric emitted by this instance globalTags: env: "production" instance: "hydra-01"Metrics Reference
Section titled “Metrics Reference”All measurement names are prefixed with the telemetry.prefix (default: hydra_).
Market & Trading Metrics
Section titled “Market & Trading Metrics”| Measurement | Tags | Fields | Description |
|---|---|---|---|
orderbook | market | up_bid, up_ask, up_spread, down_bid, down_ask, down_spread | Current state of Polymarket orderbooks. |
reference_price | symbol, source | price, network_latency_ms | Reference prices (e.g., Binance) and network arrival latency. |
signal | market, strategy | expected_edge, expected_profit_rate | Strategy-generated trading signals and their projected edge. |
order | market, side, token | price, size | Details of orders placed by the bot. |
fill | market | price, size, fee | Confirmed trade fills and associated costs. |
position | market | up_size, down_size, net | Current exposure in specific markets. |
System Health Metrics
Section titled “System Health Metrics”| Measurement | Tags | Fields | Description |
|---|---|---|---|
event_loop | - | lag_ms, elapsed_ms | Node.js event loop drift/lag. |
risk_trip | breaker | count | Occurrences of safety breaker trips (e.g., STALENESS). |
kill_switch | - | triggered | Status of the global kill switch. |
session | mode, run_id | started, stopped | Lifecycle events for the bot session. |
Event Loop Monitoring
Section titled “Event Loop Monitoring”Hydra’s EventLoopMonitor detects when the JavaScript event loop is blocked. This is critical for HFT because any delay in the event loop directly translates to order execution latency.
- How it works: It schedules a timer every 10ms. If the timer executes significantly later than 10ms, the difference is recorded as
lag_ms. - Why it matters: High lag usually indicates heavy garbage collection (GC) pauses or accidental execution of synchronous, blocking code.
- Alerting: The bot logs a warning if lag exceeds 50ms.
Latency Tracking
Section titled “Latency Tracking”For sub-millisecond precision, Hydra uses high-resolution timestamps (hrTsMs) captured via performance.now().
LatencyTimingInfo
Section titled “LatencyTimingInfo”Critical events (like price updates) carry a LatencyTimingInfo object:
exchangeTsMs: The timestamp when the event was generated by the exchange.arrivalTsMs: The high-resolution timestamp when Hydra received the packet.sendTsMs: The high-resolution timestamp when Hydra sent a resulting order.
Percentile Calculation
Section titled “Percentile Calculation”The LatencyTracker maintains a circular buffer of samples to provide real-time statistics:
- P50: Median latency.
- P90 / P99 / P99.9: Tail latencies, crucial for identifying outliers that could lead to “getting picked off” by faster competitors.
Setup Guide (Docker Compose)
Section titled “Setup Guide (Docker Compose)”The following docker-compose.yaml provides a complete monitoring stack.
version: '3.8'
services: influxdb: image: influxdb:1.8 container_name: influxdb ports: - "8086:8086" environment: - INFLUXDB_DB=hydra volumes: - influxdb_data:/var/lib/influxdb
telegraf: image: telegraf:latest container_name: telegraf ports: - "8094:8094/udp" volumes: - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro depends_on: - influxdb
grafana: image: grafana/grafana:latest container_name: grafana ports: - "3000:3000" depends_on: - influxdb volumes: - grafana_data:/var/lib/grafana
volumes: influxdb_data: grafana_data:Basic telegraf.conf
Section titled “Basic telegraf.conf”[[outputs.influxdb]] urls = ["http://influxdb:8086"] database = "hydra"
[[inputs.udp_listener]] service_address = ":8094" data_format = "influx"Grafana Dashboards
Section titled “Grafana Dashboards”Example Queries (Flux)
Section titled “Example Queries (Flux)”P99 Network Latency (ms)
from(bucket: "hydra") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "hydra_reference_price") |> filter(fn: (r) => r["_field"] == "network_latency_ms") |> aggregateWindow(every: v.windowPeriod, column: "_value", fn: (column, tables=<-) => tables |> quantile(q: 0.99))Average Event Loop Lag
from(bucket: "hydra") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "hydra_event_loop") |> filter(fn: (r) => r["_field"] == "lag_ms") |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)Cumulative Profit (from Fills)
from(bucket: "hydra") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "hydra_fill") |> filter(fn: (r) => r["_field"] == "size" or r["_field"] == "price") // Note: Simplified logic, actual PnL requires market-aware calculation |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value") |> map(fn: (r) => ({ r with _value: r.size * r.price })) |> cumulativeSum()