Skip to content

OKF Bundle Examples

Three production-ready bundles you can steal, adapt, and ship today. No toy demos — these have real frontmatter, working cross-links, and folder structures you’d actually deploy.

Hot take: A good bundle passes the “new hire” test. Someone opens index.md and understands the domain in 30 seconds. If they need a guide to navigate it, the structure failed.


1. E-commerce Analytics

A data team documenting their BigQuery tables, metrics, and dashboards. The scenario: a new analyst joins Monday morning and needs to find things without pinging Slack every five minutes.

Folder tree

ecommerce-analytics/
├── index.md
├── tables/
│   ├── index.md
│   ├── orders.md
│   └── customers.md
├── metrics/
│   ├── index.md
│   └── gross-revenue.md
└── log.md

Notice how the structure mirrors the mental model — tables and metrics are separate concerns. No one folder with 40 files.

index.md

# E-commerce Analytics

Knowledge bundle for the Acme e-commerce analytics domain.

# Tables

* [Orders](tables/orders.md) - one row per completed order
* [Customers](tables/customers.md) - customer registry with RFM segmentation

# Metrics

* [Gross Revenue](metrics/gross-revenue.md) - GMV before returns and discounts

This index does one job: orientation. A sentence per item, a link, done. No preamble about “the purpose of this document.”

tables/orders.md

---
type: BigQuery Table
title: Orders
description: One row per finalized order across all channels (app, web, marketplace).
resource: https://console.cloud.google.com/bigquery?p=acme-prod&d=ecommerce&t=orders
tags: [sales, core, revenue]
timestamp: 2026-06-10T14:00:00Z
owner: data-team
sla_freshness: 30min
---

The primary billing table. Every revenue metric starts here.

# Schema

| Column | Type | Description |
|----------------|-----------|--------------------------------------------------|
| `order_id` | STRING | Order UUID. Primary key. |
| `customer_id` | STRING | FK → [customers](/tables/customers.md). |
| `total_usd` | NUMERIC | Total in USD (taxes included, shipping excluded). |
| `status` | STRING | `paid`, `refunded`, `cancelled`. |
| `channel` | STRING | `app`, `web`, `marketplace`. |
| `created_at` | TIMESTAMP | When the order was submitted. |
| `updated_at` | TIMESTAMP | Last status change. |

# Joins

- [customers](/tables/customers.md) via `customer_id`
- Used by [Gross Revenue](/metrics/gross-revenue.md) metric

# Notes

- Orders with `status = cancelled` are **excluded** from revenue calculations.
- Partitioned by `created_at` (DAY). Queries without a date filter will blow up your bill.

# Citations

[1] [Ingestion pipeline docs](https://wiki.acme.internal/data/pipeline-orders)

Notice how sla_freshness: 30min in the frontmatter isn’t part of the OKF spec — it’s a custom field. The spec explicitly allows this. Your agents can filter on it, your dashboards can read it. Use it.

tables/customers.md

---
type: BigQuery Table
title: Customers
description: Unified customer registry with RFM segmentation and lifecycle stage.
resource: https://console.cloud.google.com/bigquery?p=acme-prod&d=ecommerce&t=customers
tags: [customers, segmentation, core]
timestamp: 2026-06-08T10:00:00Z
owner: data-team
---

Canonical customer base. Every cohort analysis or segmentation query starts here.

# Schema

| Column | Type | Description |
|-------------------|-----------|-----------------------------------------------|
| `customer_id` | STRING | UUID. Primary key. |
| `email` | STRING | Primary email (PII — restricted access). |
| `rfm_segment` | STRING | `champions`, `at_risk`, `hibernating`, etc. |
| `first_order_at` | TIMESTAMP | Date of first order. |
| `ltv_usd` | NUMERIC | Accumulated lifetime value. |
| `created_at` | TIMESTAMP | Registration date. |

# Joins

- [orders](/tables/orders.md) via `customer_id`

# Notes

- Updated daily by the RFM segmentation job (6am UTC).
- `email` requires the `pii-reader` role — never expose in public dashboards.

metrics/gross-revenue.md

---
type: Metric
title: Gross Revenue (GMV)
description: Sum of total_usd from paid orders, before returns.
tags: [revenue, kpi, finance]
timestamp: 2026-06-10T14:00:00Z
owner: finance-team
granularity: daily
---

The primary business KPI. Reported in the monthly board deck.

# Definition

```sql
SELECT
  DATE(created_at) AS day,
  SUM(total_usd) AS gross_revenue
FROM `acme-prod.ecommerce.orders`
WHERE status = 'paid'
GROUP BY 1

Data source

Computed from the orders table, filtering status = 'paid'.

Gotchas

  • Excludes refunded and cancelled orders.
  • Does not deduct partial returns (see net-revenue metric when it exists).
  • Shipping is not included in total_usd.

Citations

[1] CFO-approved metric definition — Confluence


What makes this example work: the SQL definition is the single source of truth. No ambiguity about "what counts as revenue" — it's right there in the query. An agent or analyst can read this and reproduce the number exactly.

---

## 2. SaaS Incident Playbooks

An SRE/platform team documenting runbooks, alerts, and escalation procedures. The scenario: it's 3am, the pager went off, and the on-call engineer needs to know what to do without waking anyone up (yet).

### Folder tree

incident-playbooks/ ├── index.md ├── alerts/ │ ├── index.md │ ├── api-latency-p99.md │ └── db-connections-exhausted.md ├── runbooks/ │ ├── index.md │ └── escalate-incident.md └── log.md


### `index.md`

```markdown
# Incident Playbooks

Operational knowledge for the Platform team. If the pager fired, start here.

# Alerts

* [API Latency P99](alerts/api-latency-p99.md) - latency above SLO at the gateway
* [DB Connections Exhausted](alerts/db-connections-exhausted.md) - connection pool depleted

# Runbooks

* [Escalate Incident](runbooks/escalate-incident.md) - when and how to escalate to leadership

alerts/api-latency-p99.md

---
type: Alert
title: API Latency P99 > 2s
description: Fires when API Gateway P99 latency exceeds 2 seconds for 5 minutes.
tags: [oncall, api, latency, sev2]
timestamp: 2026-05-20T09:00:00Z
owner: platform-team
severity: SEV2
slo: 99.5% requests < 2s
---

# Trigger

Prometheus alert rule:

```yaml
- alert: APILatencyP99High
  expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 2
  for: 5m
  labels:
    severity: sev2

Diagnosis

  1. Open the Grafana — API Latency Dashboard.
  2. Check whether it’s a specific endpoint or system-wide.
  3. Check the connection pool — if exhausted, see DB Connections Exhausted.
  4. Check recent deploys: kubectl rollout history deploy/api-gateway -n production.

Mitigation

  • Single endpoint: enable circuit breaker via feature flag cb_{endpoint}.
  • System-wide: scale pods kubectl scale deploy/api-gateway --replicas=10 -n production.
  • Still broken after 15min: follow Escalate Incident.

Known false positives

  • The financial reconciliation batch (daily at 2am UTC) causes a 3-4min spike. Ignore if it self-resolves.

Citations

[1] SLO definitions — internal wiki


Notice the structure here: Trigger → Diagnosis → Mitigation → False positives. That's not random. It mirrors what the on-call engineer's brain does: "What fired? → What's wrong? → How do I fix it? → Wait, is this even real?" Every alert doc should follow this sequence.

### `alerts/db-connections-exhausted.md`

```markdown
---
type: Alert
title: DB Connections Exhausted
description: PostgreSQL connection pool reached 95% capacity.
tags: [oncall, database, postgres, sev1]
timestamp: 2026-06-01T11:00:00Z
owner: platform-team
severity: SEV1
---

# Trigger

```yaml
- alert: DBConnectionPoolExhausted
  expr: pg_stat_activity_count / pg_settings_max_connections > 0.95
  for: 2m
  labels:
    severity: sev1

Diagnosis

  1. Find long-running queries: SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
  2. Check for connection leaks (app not closing connections).
  3. Check if API Latency P99 also fired — cascading failure is common here.

Immediate mitigation

  1. Kill queries running > 5min: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE duration > interval '5 minutes';
  2. If that doesn’t help: restart PgBouncer systemctl restart pgbouncer.
  3. Do NOT restart Postgres without escalating first — see Escalate Incident.

Common root causes

  • Deploy with heavy migration missing statement_timeout.
  • Report cronjob without connection pooling.
  • N+1 query in a newly deployed endpoint.

The bold "Do NOT restart Postgres" is doing real work. At 3am, a panicking engineer might reach for the nuclear option. This doc prevents that.

### `runbooks/escalate-incident.md`

```markdown
---
type: Runbook
title: Escalate Incident
description: Escalation process — when to stop solo debugging and call for backup.
tags: [oncall, process, escalation]
timestamp: 2026-05-15T08:00:00Z
owner: platform-team
---

# When to escalate

Escalate **immediately** if:

- SEV1 with no mitigation after 10 minutes.
- SEV2 with no mitigation after 30 minutes.
- Any incident with visible financial impact (orders failing).
- You don't understand what's happening. This is valid. No shame.

# How to escalate

| Step | Action | Contact |
|------|--------|---------|
| 1 | Declare incident in Slack `#incidents` | `@oncall-platform` |
| 2 | Page engineering manager | PagerDuty escalation policy `platform-em` |
| 3 | If customer impact > 5min | Page `@oncall-cs` for comms |
| 4 | If revenue impacted | Page `@oncall-finance` |

# Declaration template

🚨 INCIDENT DECLARED Severity: SEV{1|2} Alert: {link to the alert that fired} Impact: {what the user is experiencing} Status: Investigating / Mitigating / Resolved IC: @{your-name}


# Post-incident

- Postmortem required for SEV1, optional for SEV2.
- Deadline: 48h after resolution.
- Template: [Confluence — Postmortem Template](https://wiki.acme.internal/templates/postmortem).

# Related

- [API Latency P99](/alerts/api-latency-p99.md) — most common alert that triggers escalation.
- [DB Connections Exhausted](/alerts/db-connections-exhausted.md) — second most common.

3. API Documentation

A bundle for documenting a REST API. Different from OpenAPI (which is a contract spec) — this focuses on contextual knowledge: why the endpoint exists, edge cases, rate limits, real-world examples.

Folder tree

api-docs/
├── index.md
├── auth/
│   ├── index.md
│   └── oauth2-flow.md
├── endpoints/
│   ├── index.md
│   ├── create-order.md
│   └── list-customers.md
├── policies/
│   ├── index.md
│   └── rate-limits.md
└── log.md

index.md

# API Docs — Acme Commerce API

Contextual documentation for the public API v2. For the raw OpenAPI spec, see the [Swagger UI](https://api.acme.com/docs).

# Authentication

* [OAuth2 Flow](auth/oauth2-flow.md) - how to obtain and refresh tokens

# Endpoints

* [Create Order](endpoints/create-order.md) - POST /v2/orders
* [List Customers](endpoints/list-customers.md) - GET /v2/customers

# Policies

* [Rate Limits](policies/rate-limits.md) - per-plan limits and control headers

The distinction between this bundle and OpenAPI matters. OpenAPI tells machines “what parameters does this endpoint accept.” This bundle tells humans and agents “what happens when you use it wrong, and why it was built this way.”

auth/oauth2-flow.md

---
type: Auth Flow
title: OAuth2 Client Credentials
description: Machine-to-machine authentication flow for the Acme API.
resource: https://api.acme.com/oauth/token
tags: [auth, oauth2, security]
timestamp: 2026-06-05T16:00:00Z
owner: api-team
---

# Flow

```bash
curl -X POST https://api.acme.com/oauth/token \
  -d grant_type=client_credentials \
  -d client_id=$CLIENT_ID \
  -d client_secret=$CLIENT_SECRET \
  -d scope="orders:write customers:read"

Response:

{
  "access_token": "eyJhbGci...",
  "token_type": "Bearer",
  "expires_in": 3600,
  "scope": "orders:write customers:read"
}

Usage

Include the token in every request header:

Authorization: Bearer {access_token}

Available scopes

ScopeGrants
orders:readRead orders
orders:writeCreate and update orders
customers:readList and search customers
customers:writeCreate and update customers

Watch out

  • Token expires in 1 hour. Implement refresh before expiration, not after.
  • Rate limit on /oauth/token: 10 req/min per client_id — do not request a fresh token on every API call.
  • Respect the general rate limits once authenticated.

Citations

[1] RFC 6749 — OAuth 2.0 Client Credentials


### `endpoints/create-order.md`

```markdown
---
type: API Endpoint
title: Create Order
description: Creates a new order. Requires orders:write scope.
resource: https://api.acme.com/v2/orders
tags: [orders, write, core]
timestamp: 2026-06-10T10:00:00Z
owner: api-team
method: POST
path: /v2/orders
auth_scope: orders:write
---

# Request

```bash
curl -X POST https://api.acme.com/v2/orders \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "cust_abc123",
    "items": [
      {"sku": "WIDGET-01", "quantity": 2, "unit_price_usd": 49.90}
    ],
    "shipping_address_id": "addr_xyz789"
  }'

Response (201 Created)

{
  "order_id": "ord_def456",
  "status": "pending_payment",
  "total_usd": 99.80,
  "created_at": "2026-06-10T10:30:00Z"
}

Common errors

StatusCodeCauseFix
400invalid_skuSKU doesn’t exist in catalogValidate first via GET /v2/products
401token_expiredToken expiredRefresh via OAuth2 flow
429rate_limitedRate limit exceededSee Rate Limits
422insufficient_stockOut of stockReduce quantity or wait for restock

Idempotency

Send Idempotency-Key: {uuid} in the header to ensure retries don’t duplicate orders.

Notes

  • total_usd is computed server-side. Don’t trust your client-side calculation.
  • Orders stay pending_payment for 30 minutes. After that, automatic cancellation.
  • This endpoint feeds the same entity documented in the analytics bundle — same order, different audience.

### `policies/rate-limits.md`

```markdown
---
type: Policy
title: Rate Limits
description: Per-plan request limits, control headers, and throttle behavior.
tags: [api, rate-limit, policy, infra]
timestamp: 2026-06-01T09:00:00Z
owner: api-team
---

# Limits by plan

| Plan | Requests/min | Requests/day | Burst |
|------------|--------------|--------------|-------|
| Free | 60 | 10,000 | 10 |
| Pro | 600 | 100,000 | 50 |
| Enterprise | 6,000 | 1,000,000 | 200 |

# Response headers

Every response includes:

X-RateLimit-Limit: 600 X-RateLimit-Remaining: 584 X-RateLimit-Reset: 1718020800


- `X-RateLimit-Reset` is a Unix timestamp for when the bucket resets.

# Throttle behavior

Response `429 Too Many Requests`:

```json
{
  "error": "rate_limited",
  "message": "Rate limit exceeded. Retry after 12 seconds.",
  "retry_after": 12
}

Implement exponential backoff. Don’t hammer the API in a tight loop — clients that do this get blocked for 24 hours.

Endpoints with special limits

EndpointLimitReason
POST /oauth/token10/minBrute force prevention
POST /v2/orders30/minOrder spam prevention
GET /v2/reports/*5/minHeavy queries

Best practices

  1. Cache aggressively on GETs that don’t change often (customers, products).
  2. Batch when possiblePOST /v2/orders/batch accepts up to 50 orders per request.
  3. Monitor the X-RateLimit-Remaining header and throttle yourself before hitting the wall.
  4. Need more? Talk to sales or look at the Enterprise plan.

Related

Citations

[1] IETF RFC 6585 — 429 Too Many Requests [2] API Design Guidelines — internal


---

## Patterns across all three examples

1. **`type` is domain-specific.** There's no fixed list. Use `BigQuery Table`, `Alert`, `Runbook`, `API Endpoint`, `Policy`, `Metric` — whatever your team actually calls these things.

2. **Cross-links are generous.** If one concept references another, link it. It's cheap and makes the bundle navigable like a wiki. Notice how the DB connections alert links to the latency alert (cascade), and both link to the escalation runbook. That's a graph, not a folder.

3. **`index.md` is a map, not a junk drawer.** One line per item. Direct link. No throat-clearing.

4. **Extra frontmatter fields are encouraged.** The spec allows any additional key. `owner`, `severity`, `sla_freshness`, `method`, `auth_scope` — add whatever the consumer (human or agent) needs to filter and route.

5. **`# Citations` at the end.** External links that validate content. Agents use these to verify claims. Humans use them to dig deeper.

6. **Body is structured.** Headings, tables, code blocks. More structure means better retrieval by agents. Long prose paragraphs are noise — they bury the signal.