For ops · Observability

You can't fix what you can't see.
Here's what you can see.

If you run an ops function — monitoring deploys, watching for anomalies, paging on incidents — this page is for you. TheoCloud ships structured logs, RED+USE+agent-specific metrics, configurable alerts, and integration with the SIEM/APM you already use.

What you observe today

Structured JSON logs · 30-day retention

Shipped

Every container deploy emits JSON-structured logs (level, timestamp, request-id, span-id). 30-day retention by default. Pro+ exports to your own sink (S3, Datadog, Splunk HEC, Elastic).

Runtime metrics — RED + USE + agent-specific

Shipped

Rate, Errors, Duration (RED) + Utilization, Saturation, Errors (USE) for compute, plus agent-specific spans (LLM call latency, tokens consumed, sub-agent fan-out). Available per environment and per deploy.

Health probes · readiness + liveness

Shipped

Every deploy is gated by readiness + liveness probes before traffic shifts. Probes are declared in theo.yaml (open format). Failed probes hold the deploy and surface diagnostic logs.

Alerts — error rate · LLM cost · deploy failures

Team / Enterprise tier

Configurable alerts on error rate spike, LLM token budget threshold, deploy failure, health probe failure. Delivery to email + webhook (Slack, Discord, custom). PagerDuty integration in Team tier.

Custom dashboards · per-team views

Team / Enterprise tier

Pre-built dashboards (Deploys, Errors, LLM spend, Latency p50/p95/p99). Custom dashboard builder in Team tier — drag-and-drop, shareable per-environment, embeddable in Confluence/Notion.

Incident workflow · automated rollback

Enterprise

Auto-rollback on Sev-1 (error rate >5% sustained 3 min). Manual rollback always available via `theo rollback`. Incident timeline auto-generated. Post-mortem template per incident.

Bring your own SIEM / APM

Datadog

Log + metric export via API key

Splunk HEC

Log forwarding to your indexer

Elastic Cloud

Log + metric forwarding

AWS S3

Long-term log archival

PagerDuty

Incident routing (Team tier)

Slack / Discord

Alert webhook delivery

See the runtime architecture

5-step deploy flow including observability stage (logs, metrics, agent-specific spans, `theo rollback`).

Operational status

Public status page with per-surface state + GitHub incident log.

Custom dashboard builder ships in Team tier; pre-built dashboards (Deploys, Errors, LLM spend, Latency p50/p95/p99) are available on every tier. Incident workflow with auto-rollback ships on Enterprise contracts — request via platform@usetheo.dev.