OKF Bundle Examples

Three production-ready bundles you can steal, adapt, and ship today. No toy demos — these have real frontmatter, working cross-links, and folder structures you’d actually deploy.

Hot take: A good bundle passes the “new hire” test. Someone opens index.md and understands the domain in 30 seconds. If they need a guide to navigate it, the structure failed.

1. E-commerce Analytics

A data team documenting their BigQuery tables, metrics, and dashboards. The scenario: a new analyst joins Monday morning and needs to find things without pinging Slack every five minutes.

Folder tree

ecommerce-analytics/
├── index.md
├── tables/
│   ├── index.md
│   ├── orders.md
│   └── customers.md
├── metrics/
│   ├── index.md
│   └── gross-revenue.md
└── log.md

Notice how the structure mirrors the mental model — tables and metrics are separate concerns. No one folder with 40 files.

`index.md`

# E-commerce Analytics

Knowledge bundle for the Acme e-commerce analytics domain.

# Tables

* [Orders](tables/orders.md) - one row per completed order
* [Customers](tables/customers.md) - customer registry with RFM segmentation

# Metrics

* [Gross Revenue](metrics/gross-revenue.md) - GMV before returns and discounts

This index does one job: orientation. A sentence per item, a link, done. No preamble about “the purpose of this document.”

`tables/orders.md`

---
type: BigQuery Table
title: Orders
description: One row per finalized order across all channels (app, web, marketplace).
resource: https://console.cloud.google.com/bigquery?p=acme-prod&d=ecommerce&t=orders
tags: [sales, core, revenue]
timestamp: 2026-06-10T14:00:00Z
owner: data-team
sla_freshness: 30min
---

The primary billing table. Every revenue metric starts here.

# Schema

| Column | Type | Description |
|----------------|-----------|--------------------------------------------------|
| `order_id` | STRING | Order UUID. Primary key. |
| `customer_id` | STRING | FK → [customers](/tables/customers.md). |
| `total_usd` | NUMERIC | Total in USD (taxes included, shipping excluded). |
| `status` | STRING | `paid`, `refunded`, `cancelled`. |
| `channel` | STRING | `app`, `web`, `marketplace`. |
| `created_at` | TIMESTAMP | When the order was submitted. |
| `updated_at` | TIMESTAMP | Last status change. |

# Joins

- [customers](/tables/customers.md) via `customer_id`
- Used by [Gross Revenue](/metrics/gross-revenue.md) metric

# Notes

- Orders with `status = cancelled` are **excluded** from revenue calculations.
- Partitioned by `created_at` (DAY). Queries without a date filter will blow up your bill.

# Citations

[1] [Ingestion pipeline docs](https://wiki.acme.internal/data/pipeline-orders)

Notice how sla_freshness: 30min in the frontmatter isn’t part of the OKF spec — it’s a custom field. The spec explicitly allows this. Your agents can filter on it, your dashboards can read it. Use it.

`tables/customers.md`

---
type: BigQuery Table
title: Customers
description: Unified customer registry with RFM segmentation and lifecycle stage.
resource: https://console.cloud.google.com/bigquery?p=acme-prod&d=ecommerce&t=customers
tags: [customers, segmentation, core]
timestamp: 2026-06-08T10:00:00Z
owner: data-team
---

Canonical customer base. Every cohort analysis or segmentation query starts here.

# Schema

| Column | Type | Description |
|-------------------|-----------|-----------------------------------------------|
| `customer_id` | STRING | UUID. Primary key. |
| `email` | STRING | Primary email (PII — restricted access). |
| `rfm_segment` | STRING | `champions`, `at_risk`, `hibernating`, etc. |
| `first_order_at` | TIMESTAMP | Date of first order. |
| `ltv_usd` | NUMERIC | Accumulated lifetime value. |
| `created_at` | TIMESTAMP | Registration date. |

# Joins

- [orders](/tables/orders.md) via `customer_id`

# Notes

- Updated daily by the RFM segmentation job (6am UTC).
- `email` requires the `pii-reader` role — never expose in public dashboards.

`metrics/gross-revenue.md`

---
type: Metric
title: Gross Revenue (GMV)
description: Sum of total_usd from paid orders, before returns.
tags: [revenue, kpi, finance]
timestamp: 2026-06-10T14:00:00Z
owner: finance-team
granularity: daily
---

The primary business KPI. Reported in the monthly board deck.

# Definition

```sql
SELECT
  DATE(created_at) AS day,
  SUM(total_usd) AS gross_revenue
FROM `acme-prod.ecommerce.orders`
WHERE status = 'paid'
GROUP BY 1

Data source

Computed from the orders table, filtering status = 'paid'.

Gotchas

Excludes refunded and cancelled orders.
Does not deduct partial returns (see net-revenue metric when it exists).
Shipping is not included in total_usd.

Citations

[1] CFO-approved metric definition — Confluence


What makes this example work: the SQL definition is the single source of truth. No ambiguity about "what counts as revenue" — it's right there in the query. An agent or analyst can read this and reproduce the number exactly.

---

## 2. SaaS Incident Playbooks

An SRE/platform team documenting runbooks, alerts, and escalation procedures. The scenario: it's 3am, the pager went off, and the on-call engineer needs to know what to do without waking anyone up (yet).

### Folder tree

incident-playbooks/ ├── index.md ├── alerts/ │ ├── index.md │ ├── api-latency-p99.md │ └── db-connections-exhausted.md ├── runbooks/ │ ├── index.md │ └── escalate-incident.md └── log.md


### `index.md`

```markdown
# Incident Playbooks

Operational knowledge for the Platform team. If the pager fired, start here.

# Alerts

* [API Latency P99](alerts/api-latency-p99.md) - latency above SLO at the gateway
* [DB Connections Exhausted](alerts/db-connections-exhausted.md) - connection pool depleted

# Runbooks

* [Escalate Incident](runbooks/escalate-incident.md) - when and how to escalate to leadership

`alerts/api-latency-p99.md`

---
type: Alert
title: API Latency P99 > 2s
description: Fires when API Gateway P99 latency exceeds 2 seconds for 5 minutes.
tags: [oncall, api, latency, sev2]
timestamp: 2026-05-20T09:00:00Z
owner: platform-team
severity: SEV2
slo: 99.5% requests < 2s
---

# Trigger

Prometheus alert rule:

```yaml
- alert: APILatencyP99High
  expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 2
  for: 5m
  labels:
    severity: sev2

Diagnosis

Open the Grafana — API Latency Dashboard.
Check whether it’s a specific endpoint or system-wide.
Check the connection pool — if exhausted, see DB Connections Exhausted.
Check recent deploys: kubectl rollout history deploy/api-gateway -n production.

Mitigation

Single endpoint: enable circuit breaker via feature flag cb_{endpoint}.
System-wide: scale pods kubectl scale deploy/api-gateway --replicas=10 -n production.
Still broken after 15min: follow Escalate Incident.

Known false positives

The financial reconciliation batch (daily at 2am UTC) causes a 3-4min spike. Ignore if it self-resolves.

Citations

[1] SLO definitions — internal wiki


Notice the structure here: Trigger → Diagnosis → Mitigation → False positives. That's not random. It mirrors what the on-call engineer's brain does: "What fired? → What's wrong? → How do I fix it? → Wait, is this even real?" Every alert doc should follow this sequence.

### `alerts/db-connections-exhausted.md`

```markdown
---
type: Alert
title: DB Connections Exhausted
description: PostgreSQL connection pool reached 95% capacity.
tags: [oncall, database, postgres, sev1]
timestamp: 2026-06-01T11:00:00Z
owner: platform-team
severity: SEV1
---

# Trigger

```yaml
- alert: DBConnectionPoolExhausted
  expr: pg_stat_activity_count / pg_settings_max_connections > 0.95
  for: 2m
  labels:
    severity: sev1

Diagnosis

Find long-running queries: SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
Check for connection leaks (app not closing connections).
Check if API Latency P99 also fired — cascading failure is common here.

Immediate mitigation

Kill queries running > 5min: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE duration > interval '5 minutes';
If that doesn’t help: restart PgBouncer systemctl restart pgbouncer.
Do NOT restart Postgres without escalating first — see Escalate Incident.

Common root causes

Deploy with heavy migration missing statement_timeout.
Report cronjob without connection pooling.
N+1 query in a newly deployed endpoint.


The bold "Do NOT restart Postgres" is doing real work. At 3am, a panicking engineer might reach for the nuclear option. This doc prevents that.

### `runbooks/escalate-incident.md`

```markdown
---
type: Runbook
title: Escalate Incident
description: Escalation process — when to stop solo debugging and call for backup.
tags: [oncall, process, escalation]
timestamp: 2026-05-15T08:00:00Z
owner: platform-team
---

# When to escalate

Escalate **immediately** if:

- SEV1 with no mitigation after 10 minutes.
- SEV2 with no mitigation after 30 minutes.
- Any incident with visible financial impact (orders failing).
- You don't understand what's happening. This is valid. No shame.

# How to escalate

| Step | Action | Contact |
|------|--------|---------|
| 1 | Declare incident in Slack `#incidents` | `@oncall-platform` |
| 2 | Page engineering manager | PagerDuty escalation policy `platform-em` |
| 3 | If customer impact > 5min | Page `@oncall-cs` for comms |
| 4 | If revenue impacted | Page `@oncall-finance` |

# Declaration template

🚨 INCIDENT DECLARED Severity: SEV{1|2} Alert: {link to the alert that fired} Impact: {what the user is experiencing} Status: Investigating / Mitigating / Resolved IC: @{your-name}


# Post-incident

- Postmortem required for SEV1, optional for SEV2.
- Deadline: 48h after resolution.
- Template: [Confluence — Postmortem Template](https://wiki.acme.internal/templates/postmortem).

# Related

- [API Latency P99](/alerts/api-latency-p99.md) — most common alert that triggers escalation.
- [DB Connections Exhausted](/alerts/db-connections-exhausted.md) — second most common.

3. API Documentation

A bundle for documenting a REST API. Different from OpenAPI (which is a contract spec) — this focuses on contextual knowledge: why the endpoint exists, edge cases, rate limits, real-world examples.

Folder tree

api-docs/
├── index.md
├── auth/
│   ├── index.md
│   └── oauth2-flow.md
├── endpoints/
│   ├── index.md
│   ├── create-order.md
│   └── list-customers.md
├── policies/
│   ├── index.md
│   └── rate-limits.md
└── log.md

`index.md`

# API Docs — Acme Commerce API

Contextual documentation for the public API v2. For the raw OpenAPI spec, see the [Swagger UI](https://api.acme.com/docs).

# Authentication

* [OAuth2 Flow](auth/oauth2-flow.md) - how to obtain and refresh tokens

# Endpoints

* [Create Order](endpoints/create-order.md) - POST /v2/orders
* [List Customers](endpoints/list-customers.md) - GET /v2/customers

# Policies

* [Rate Limits](policies/rate-limits.md) - per-plan limits and control headers

The distinction between this bundle and OpenAPI matters. OpenAPI tells machines “what parameters does this endpoint accept.” This bundle tells humans and agents “what happens when you use it wrong, and why it was built this way.”

`auth/oauth2-flow.md`

---
type: Auth Flow
title: OAuth2 Client Credentials
description: Machine-to-machine authentication flow for the Acme API.
resource: https://api.acme.com/oauth/token
tags: [auth, oauth2, security]
timestamp: 2026-06-05T16:00:00Z
owner: api-team
---

# Flow

```bash
curl -X POST https://api.acme.com/oauth/token \
  -d grant_type=client_credentials \
  -d client_id=$CLIENT_ID \
  -d client_secret=$CLIENT_SECRET \
  -d scope="orders:write customers:read"

Response:

{
  "access_token": "eyJhbGci...",
  "token_type": "Bearer",
  "expires_in": 3600,
  "scope": "orders:write customers:read"
}

Usage

Include the token in every request header:

Authorization: Bearer {access_token}

Available scopes

Scope	Grants
`orders:read`	Read orders
`orders:write`	Create and update orders
`customers:read`	List and search customers
`customers:write`	Create and update customers

Watch out

Token expires in 1 hour. Implement refresh before expiration, not after.
Rate limit on /oauth/token: 10 req/min per client_id — do not request a fresh token on every API call.
Respect the general rate limits once authenticated.

Citations

[1] RFC 6749 — OAuth 2.0 Client Credentials


### `endpoints/create-order.md`

```markdown
---
type: API Endpoint
title: Create Order
description: Creates a new order. Requires orders:write scope.
resource: https://api.acme.com/v2/orders
tags: [orders, write, core]
timestamp: 2026-06-10T10:00:00Z
owner: api-team
method: POST
path: /v2/orders
auth_scope: orders:write
---

# Request

```bash
curl -X POST https://api.acme.com/v2/orders \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "cust_abc123",
    "items": [
      {"sku": "WIDGET-01", "quantity": 2, "unit_price_usd": 49.90}
    ],
    "shipping_address_id": "addr_xyz789"
  }'

Response (201 Created)

{
  "order_id": "ord_def456",
  "status": "pending_payment",
  "total_usd": 99.80,
  "created_at": "2026-06-10T10:30:00Z"
}

Common errors

Status	Code	Cause	Fix
400	`invalid_sku`	SKU doesn’t exist in catalog	Validate first via GET /v2/products
401	`token_expired`	Token expired	Refresh via OAuth2 flow
429	`rate_limited`	Rate limit exceeded	See Rate Limits
422	`insufficient_stock`	Out of stock	Reduce quantity or wait for restock

Idempotency

Send Idempotency-Key: {uuid} in the header to ensure retries don’t duplicate orders.

Notes

total_usd is computed server-side. Don’t trust your client-side calculation.
Orders stay pending_payment for 30 minutes. After that, automatic cancellation.
This endpoint feeds the same entity documented in the analytics bundle — same order, different audience.


### `policies/rate-limits.md`

```markdown
---
type: Policy
title: Rate Limits
description: Per-plan request limits, control headers, and throttle behavior.
tags: [api, rate-limit, policy, infra]
timestamp: 2026-06-01T09:00:00Z
owner: api-team
---

# Limits by plan

| Plan | Requests/min | Requests/day | Burst |
|------------|--------------|--------------|-------|
| Free | 60 | 10,000 | 10 |
| Pro | 600 | 100,000 | 50 |
| Enterprise | 6,000 | 1,000,000 | 200 |

# Response headers

Every response includes:

X-RateLimit-Limit: 600 X-RateLimit-Remaining: 584 X-RateLimit-Reset: 1718020800


- `X-RateLimit-Reset` is a Unix timestamp for when the bucket resets.

# Throttle behavior

Response `429 Too Many Requests`:

```json
{
  "error": "rate_limited",
  "message": "Rate limit exceeded. Retry after 12 seconds.",
  "retry_after": 12
}

Implement exponential backoff. Don’t hammer the API in a tight loop — clients that do this get blocked for 24 hours.

Endpoints with special limits

Endpoint	Limit	Reason
`POST /oauth/token`	10/min	Brute force prevention
`POST /v2/orders`	30/min	Order spam prevention
`GET /v2/reports/*`	5/min	Heavy queries

Best practices

Cache aggressively on GETs that don’t change often (customers, products).
Batch when possible — POST /v2/orders/batch accepts up to 50 orders per request.
Monitor the X-RateLimit-Remaining header and throttle yourself before hitting the wall.
Need more? Talk to sales or look at the Enterprise plan.

OAuth2 Flow — token endpoint has its own 10/min limit.
Create Order — endpoint with the special 30/min cap.

Citations

[1] IETF RFC 6585 — 429 Too Many Requests [2] API Design Guidelines — internal


---

## Patterns across all three examples

1. **`type` is domain-specific.** There's no fixed list. Use `BigQuery Table`, `Alert`, `Runbook`, `API Endpoint`, `Policy`, `Metric` — whatever your team actually calls these things.

2. **Cross-links are generous.** If one concept references another, link it. It's cheap and makes the bundle navigable like a wiki. Notice how the DB connections alert links to the latency alert (cascade), and both link to the escalation runbook. That's a graph, not a folder.

3. **`index.md` is a map, not a junk drawer.** One line per item. Direct link. No throat-clearing.

4. **Extra frontmatter fields are encouraged.** The spec allows any additional key. `owner`, `severity`, `sla_freshness`, `method`, `auth_scope` — add whatever the consumer (human or agent) needs to filter and route.

5. **`# Citations` at the end.** External links that validate content. Agents use these to verify claims. Humans use them to dig deeper.

6. **Body is structured.** Headings, tables, code blocks. More structure means better retrieval by agents. Long prose paragraphs are noise — they bury the signal.