Skip to content

Monitoring & Telemetry

Hydra is designed for high-frequency trading (HFT), where visibility into sub-millisecond latencies and system health is critical. The monitoring system uses a push-based architecture to minimize performance impact and ensure deterministic execution.

Unlike traditional “pull-based” systems like Prometheus, Hydra uses a push-based telemetry model.

In HFT scenarios, the act of a monitoring system scraping metrics from the bot can cause CPU jitter. These small, unpredictable spikes in CPU usage can delay order execution exactly when market volatility is highest. By pushing metrics over non-blocking UDP (or HTTP), Hydra ensures:

  • Zero Scrape Overhead: The trading engine is never interrupted for metric collection.
  • Sub-millisecond Precision: High-resolution timestamps are captured at the point of origin.
  • Near-Real-Time Visibility: Metrics are flushed at high frequencies (default 100ms) for live dashboards.

The telemetry pipeline follows the flow below:

HydraTelegrafInfluxDBGrafana

  1. Hydra: Emits metrics in InfluxDB Line Protocol format via TelemetryService.
  2. Telegraf: Acts as a lightweight collector/aggregator. It listens for UDP or HTTP packets from one or more Hydra instances.
  3. InfluxDB: A time-series database that stores the metrics.
  4. Grafana: Provides a visual dashboard for real-time monitoring and historical analysis.

Telemetry is configured in the telemetry section of config.yaml.

telemetry:
# Enable or disable telemetry streaming
enabled: true
# Host and port for Telegraf or InfluxDB
host: "127.0.0.1"
port: 8094
# Protocol to use: "udp" (recommended for low latency) or "http"
protocol: "udp"
# InfluxDB database/bucket name
database: "hydra"
# Prefix added to all measurement names
prefix: "hydra"
# Number of metrics to batch before sending
batchSize: 100
# How often to flush the buffer (ms)
flushIntervalMs: 100
# Tags added to every metric emitted by this instance
globalTags:
env: "production"
instance: "hydra-01"

All measurement names are prefixed with the telemetry.prefix (default: hydra_).

MeasurementTagsFieldsDescription
orderbookmarketup_bid, up_ask, up_spread, down_bid, down_ask, down_spreadCurrent state of Polymarket orderbooks.
reference_pricesymbol, sourceprice, network_latency_msReference prices (e.g., Binance) and network arrival latency.
signalmarket, strategyexpected_edge, expected_profit_rateStrategy-generated trading signals and their projected edge.
ordermarket, side, tokenprice, sizeDetails of orders placed by the bot.
fillmarketprice, size, feeConfirmed trade fills and associated costs.
positionmarketup_size, down_size, netCurrent exposure in specific markets.
MeasurementTagsFieldsDescription
event_loop-lag_ms, elapsed_msNode.js event loop drift/lag.
risk_tripbreakercountOccurrences of safety breaker trips (e.g., STALENESS).
kill_switch-triggeredStatus of the global kill switch.
sessionmode, run_idstarted, stoppedLifecycle events for the bot session.

Hydra’s EventLoopMonitor detects when the JavaScript event loop is blocked. This is critical for HFT because any delay in the event loop directly translates to order execution latency.

  • How it works: It schedules a timer every 10ms. If the timer executes significantly later than 10ms, the difference is recorded as lag_ms.
  • Why it matters: High lag usually indicates heavy garbage collection (GC) pauses or accidental execution of synchronous, blocking code.
  • Alerting: The bot logs a warning if lag exceeds 50ms.

For sub-millisecond precision, Hydra uses high-resolution timestamps (hrTsMs) captured via performance.now().

Critical events (like price updates) carry a LatencyTimingInfo object:

  • exchangeTsMs: The timestamp when the event was generated by the exchange.
  • arrivalTsMs: The high-resolution timestamp when Hydra received the packet.
  • sendTsMs: The high-resolution timestamp when Hydra sent a resulting order.

The LatencyTracker maintains a circular buffer of samples to provide real-time statistics:

  • P50: Median latency.
  • P90 / P99 / P99.9: Tail latencies, crucial for identifying outliers that could lead to “getting picked off” by faster competitors.

The following docker-compose.yaml provides a complete monitoring stack.

version: '3.8'
services:
influxdb:
image: influxdb:1.8
container_name: influxdb
ports:
- "8086:8086"
environment:
- INFLUXDB_DB=hydra
volumes:
- influxdb_data:/var/lib/influxdb
telegraf:
image: telegraf:latest
container_name: telegraf
ports:
- "8094:8094/udp"
volumes:
- ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
depends_on:
- influxdb
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
depends_on:
- influxdb
volumes:
- grafana_data:/var/lib/grafana
volumes:
influxdb_data:
grafana_data:
[[outputs.influxdb]]
urls = ["http://influxdb:8086"]
database = "hydra"
[[inputs.udp_listener]]
service_address = ":8094"
data_format = "influx"

P99 Network Latency (ms)

from(bucket: "hydra")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "hydra_reference_price")
|> filter(fn: (r) => r["_field"] == "network_latency_ms")
|> aggregateWindow(every: v.windowPeriod, column: "_value", fn: (column, tables=<-) => tables |> quantile(q: 0.99))

Average Event Loop Lag

from(bucket: "hydra")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "hydra_event_loop")
|> filter(fn: (r) => r["_field"] == "lag_ms")
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)

Cumulative Profit (from Fills)

from(bucket: "hydra")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "hydra_fill")
|> filter(fn: (r) => r["_field"] == "size" or r["_field"] == "price")
// Note: Simplified logic, actual PnL requires market-aware calculation
|> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
|> map(fn: (r) => ({ r with _value: r.size * r.price }))
|> cumulativeSum()