Monitoring & Telemetry

Hydra is designed for high-frequency trading (HFT), where visibility into sub-millisecond latencies and system health is critical. The monitoring system uses a push-based architecture to minimize performance impact and ensure deterministic execution.

Overview

Unlike traditional “pull-based” systems like Prometheus, Hydra uses a push-based telemetry model.

Why Push-Based?

In HFT scenarios, the act of a monitoring system scraping metrics from the bot can cause CPU jitter. These small, unpredictable spikes in CPU usage can delay order execution exactly when market volatility is highest. By pushing metrics over non-blocking UDP (or HTTP), Hydra ensures:

Zero Scrape Overhead: The trading engine is never interrupted for metric collection.
Sub-millisecond Precision: High-resolution timestamps are captured at the point of origin.
Near-Real-Time Visibility: Metrics are flushed at high frequencies (default 100ms) for live dashboards.

Architecture

The telemetry pipeline follows the flow below:

Hydra → Telegraf → InfluxDB → Grafana

Hydra: Emits metrics in InfluxDB Line Protocol format via TelemetryService.
Telegraf: Acts as a lightweight collector/aggregator. It listens for UDP or HTTP packets from one or more Hydra instances.
InfluxDB: A time-series database that stores the metrics.
Grafana: Provides a visual dashboard for real-time monitoring and historical analysis.

Configuration

Telemetry is configured in the telemetry section of config.yaml.

telemetry:
  # Enable or disable telemetry streaming
  enabled: true

  # Host and port for Telegraf or InfluxDB
  host: "127.0.0.1"
  port: 8094

  # Protocol to use: "udp" (recommended for low latency) or "http"
  protocol: "udp"

  # InfluxDB database/bucket name
  database: "hydra"

  # Prefix added to all measurement names
  prefix: "hydra"

  # Number of metrics to batch before sending
  batchSize: 100

  # How often to flush the buffer (ms)
  flushIntervalMs: 100

  # Tags added to every metric emitted by this instance
  globalTags:
    env: "production"
    instance: "hydra-01"

Metrics Reference

All measurement names are prefixed with the telemetry.prefix (default: hydra_).

Market & Trading Metrics

Measurement	Tags	Fields	Description
`orderbook`	`market`	`up_bid`, `up_ask`, `up_spread`, `down_bid`, `down_ask`, `down_spread`	Current state of Polymarket orderbooks.
`reference_price`	`symbol`, `source`	`price`, `network_latency_ms`	Reference prices (e.g., Binance) and network arrival latency.
`signal`	`market`, `strategy`	`expected_edge`, `expected_profit_rate`	Strategy-generated trading signals and their projected edge.
`order`	`market`, `side`, `token`	`price`, `size`	Details of orders placed by the bot.
`fill`	`market`	`price`, `size`, `fee`	Confirmed trade fills and associated costs.
`position`	`market`	`up_size`, `down_size`, `net`	Current exposure in specific markets.

System Health Metrics

Measurement	Tags	Fields	Description
`event_loop`	-	`lag_ms`, `elapsed_ms`	Node.js event loop drift/lag.
`risk_trip`	`breaker`	`count`	Occurrences of safety breaker trips (e.g., STALENESS).
`kill_switch`	-	`triggered`	Status of the global kill switch.
`session`	`mode`, `run_id`	`started`, `stopped`	Lifecycle events for the bot session.

Event Loop Monitoring

Hydra’s EventLoopMonitor detects when the JavaScript event loop is blocked. This is critical for HFT because any delay in the event loop directly translates to order execution latency.

How it works: It schedules a timer every 10ms. If the timer executes significantly later than 10ms, the difference is recorded as lag_ms.
Why it matters: High lag usually indicates heavy garbage collection (GC) pauses or accidental execution of synchronous, blocking code.
Alerting: The bot logs a warning if lag exceeds 50ms.

Latency Tracking

For sub-millisecond precision, Hydra uses high-resolution timestamps (hrTsMs) captured via performance.now().

LatencyTimingInfo

Critical events (like price updates) carry a LatencyTimingInfo object:

exchangeTsMs: The timestamp when the event was generated by the exchange.
arrivalTsMs: The high-resolution timestamp when Hydra received the packet.
sendTsMs: The high-resolution timestamp when Hydra sent a resulting order.

Percentile Calculation

The LatencyTracker maintains a circular buffer of samples to provide real-time statistics:

P50: Median latency.
P90 / P99 / P99.9: Tail latencies, crucial for identifying outliers that could lead to “getting picked off” by faster competitors.

Setup Guide (Docker Compose)

The following docker-compose.yaml provides a complete monitoring stack.

version: '3.8'

services:
  influxdb:
    image: influxdb:1.8
    container_name: influxdb
    ports:
      - "8086:8086"
    environment:
      - INFLUXDB_DB=hydra
    volumes:
      - influxdb_data:/var/lib/influxdb

  telegraf:
    image: telegraf:latest
    container_name: telegraf
    ports:
      - "8094:8094/udp"
    volumes:
      - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
    depends_on:
      - influxdb

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    depends_on:
      - influxdb
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  influxdb_data:
  grafana_data:

Basic `telegraf.conf`

[[outputs.influxdb]]
  urls = ["http://influxdb:8086"]
  database = "hydra"

[[inputs.udp_listener]]
  service_address = ":8094"
  data_format = "influx"

Grafana Dashboards

Example Queries (Flux)

P99 Network Latency (ms)

from(bucket: "hydra")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "hydra_reference_price")
  |> filter(fn: (r) => r["_field"] == "network_latency_ms")
  |> aggregateWindow(every: v.windowPeriod, column: "_value", fn: (column, tables=<-) => tables |> quantile(q: 0.99))

Average Event Loop Lag

from(bucket: "hydra")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "hydra_event_loop")
  |> filter(fn: (r) => r["_field"] == "lag_ms")
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)

Cumulative Profit (from Fills)

from(bucket: "hydra")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "hydra_fill")
  |> filter(fn: (r) => r["_field"] == "size" or r["_field"] == "price")
  // Note: Simplified logic, actual PnL requires market-aware calculation
  |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
  |> map(fn: (r) => ({ r with _value: r.size * r.price }))
  |> cumulativeSum()