OKF Bundle Examples
Three production-ready bundles you can steal, adapt, and ship today. No toy demos — these have real frontmatter, working cross-links, and folder structures you’d actually deploy.
Hot take: A good bundle passes the “new hire” test. Someone opens
index.mdand understands the domain in 30 seconds. If they need a guide to navigate it, the structure failed.
1. E-commerce Analytics
A data team documenting their BigQuery tables, metrics, and dashboards. The scenario: a new analyst joins Monday morning and needs to find things without pinging Slack every five minutes.
Folder tree
ecommerce-analytics/
├── index.md
├── tables/
│ ├── index.md
│ ├── orders.md
│ └── customers.md
├── metrics/
│ ├── index.md
│ └── gross-revenue.md
└── log.mdNotice how the structure mirrors the mental model — tables and metrics are separate concerns. No one folder with 40 files.
index.md
# E-commerce Analytics
Knowledge bundle for the Acme e-commerce analytics domain.
# Tables
* [Orders](tables/orders.md) - one row per completed order
* [Customers](tables/customers.md) - customer registry with RFM segmentation
# Metrics
* [Gross Revenue](metrics/gross-revenue.md) - GMV before returns and discountsThis index does one job: orientation. A sentence per item, a link, done. No preamble about “the purpose of this document.”
tables/orders.md
---
type: BigQuery Table
title: Orders
description: One row per finalized order across all channels (app, web, marketplace).
resource: https://console.cloud.google.com/bigquery?p=acme-prod&d=ecommerce&t=orders
tags: [sales, core, revenue]
timestamp: 2026-06-10T14:00:00Z
owner: data-team
sla_freshness: 30min
---
The primary billing table. Every revenue metric starts here.
# Schema
| Column | Type | Description |
|----------------|-----------|--------------------------------------------------|
| `order_id` | STRING | Order UUID. Primary key. |
| `customer_id` | STRING | FK → [customers](/tables/customers.md). |
| `total_usd` | NUMERIC | Total in USD (taxes included, shipping excluded). |
| `status` | STRING | `paid`, `refunded`, `cancelled`. |
| `channel` | STRING | `app`, `web`, `marketplace`. |
| `created_at` | TIMESTAMP | When the order was submitted. |
| `updated_at` | TIMESTAMP | Last status change. |
# Joins
- [customers](/tables/customers.md) via `customer_id`
- Used by [Gross Revenue](/metrics/gross-revenue.md) metric
# Notes
- Orders with `status = cancelled` are **excluded** from revenue calculations.
- Partitioned by `created_at` (DAY). Queries without a date filter will blow up your bill.
# Citations
[1] [Ingestion pipeline docs](https://wiki.acme.internal/data/pipeline-orders)Notice how sla_freshness: 30min in the frontmatter isn’t part of the OKF spec — it’s a custom field. The spec explicitly allows this. Your agents can filter on it, your dashboards can read it. Use it.
tables/customers.md
---
type: BigQuery Table
title: Customers
description: Unified customer registry with RFM segmentation and lifecycle stage.
resource: https://console.cloud.google.com/bigquery?p=acme-prod&d=ecommerce&t=customers
tags: [customers, segmentation, core]
timestamp: 2026-06-08T10:00:00Z
owner: data-team
---
Canonical customer base. Every cohort analysis or segmentation query starts here.
# Schema
| Column | Type | Description |
|-------------------|-----------|-----------------------------------------------|
| `customer_id` | STRING | UUID. Primary key. |
| `email` | STRING | Primary email (PII — restricted access). |
| `rfm_segment` | STRING | `champions`, `at_risk`, `hibernating`, etc. |
| `first_order_at` | TIMESTAMP | Date of first order. |
| `ltv_usd` | NUMERIC | Accumulated lifetime value. |
| `created_at` | TIMESTAMP | Registration date. |
# Joins
- [orders](/tables/orders.md) via `customer_id`
# Notes
- Updated daily by the RFM segmentation job (6am UTC).
- `email` requires the `pii-reader` role — never expose in public dashboards.metrics/gross-revenue.md
---
type: Metric
title: Gross Revenue (GMV)
description: Sum of total_usd from paid orders, before returns.
tags: [revenue, kpi, finance]
timestamp: 2026-06-10T14:00:00Z
owner: finance-team
granularity: daily
---
The primary business KPI. Reported in the monthly board deck.
# Definition
```sql
SELECT
DATE(created_at) AS day,
SUM(total_usd) AS gross_revenue
FROM `acme-prod.ecommerce.orders`
WHERE status = 'paid'
GROUP BY 1Data source
Computed from the orders table, filtering status = 'paid'.
Gotchas
- Excludes refunded and cancelled orders.
- Does not deduct partial returns (see
net-revenuemetric when it exists). - Shipping is not included in
total_usd.
Citations
[1] CFO-approved metric definition — Confluence
What makes this example work: the SQL definition is the single source of truth. No ambiguity about "what counts as revenue" — it's right there in the query. An agent or analyst can read this and reproduce the number exactly.
---
## 2. SaaS Incident Playbooks
An SRE/platform team documenting runbooks, alerts, and escalation procedures. The scenario: it's 3am, the pager went off, and the on-call engineer needs to know what to do without waking anyone up (yet).
### Folder tree
incident-playbooks/ ├── index.md ├── alerts/ │ ├── index.md │ ├── api-latency-p99.md │ └── db-connections-exhausted.md ├── runbooks/ │ ├── index.md │ └── escalate-incident.md └── log.md
### `index.md`
```markdown
# Incident Playbooks
Operational knowledge for the Platform team. If the pager fired, start here.
# Alerts
* [API Latency P99](alerts/api-latency-p99.md) - latency above SLO at the gateway
* [DB Connections Exhausted](alerts/db-connections-exhausted.md) - connection pool depleted
# Runbooks
* [Escalate Incident](runbooks/escalate-incident.md) - when and how to escalate to leadershipalerts/api-latency-p99.md
---
type: Alert
title: API Latency P99 > 2s
description: Fires when API Gateway P99 latency exceeds 2 seconds for 5 minutes.
tags: [oncall, api, latency, sev2]
timestamp: 2026-05-20T09:00:00Z
owner: platform-team
severity: SEV2
slo: 99.5% requests < 2s
---
# Trigger
Prometheus alert rule:
```yaml
- alert: APILatencyP99High
expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: sev2Diagnosis
- Open the Grafana — API Latency Dashboard.
- Check whether it’s a specific endpoint or system-wide.
- Check the connection pool — if exhausted, see DB Connections Exhausted.
- Check recent deploys:
kubectl rollout history deploy/api-gateway -n production.
Mitigation
- Single endpoint: enable circuit breaker via feature flag
cb_{endpoint}. - System-wide: scale pods
kubectl scale deploy/api-gateway --replicas=10 -n production. - Still broken after 15min: follow Escalate Incident.
Known false positives
- The financial reconciliation batch (daily at 2am UTC) causes a 3-4min spike. Ignore if it self-resolves.
Citations
[1] SLO definitions — internal wiki
Notice the structure here: Trigger → Diagnosis → Mitigation → False positives. That's not random. It mirrors what the on-call engineer's brain does: "What fired? → What's wrong? → How do I fix it? → Wait, is this even real?" Every alert doc should follow this sequence.
### `alerts/db-connections-exhausted.md`
```markdown
---
type: Alert
title: DB Connections Exhausted
description: PostgreSQL connection pool reached 95% capacity.
tags: [oncall, database, postgres, sev1]
timestamp: 2026-06-01T11:00:00Z
owner: platform-team
severity: SEV1
---
# Trigger
```yaml
- alert: DBConnectionPoolExhausted
expr: pg_stat_activity_count / pg_settings_max_connections > 0.95
for: 2m
labels:
severity: sev1Diagnosis
- Find long-running queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; - Check for connection leaks (app not closing connections).
- Check if API Latency P99 also fired — cascading failure is common here.
Immediate mitigation
- Kill queries running > 5min:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE duration > interval '5 minutes'; - If that doesn’t help: restart PgBouncer
systemctl restart pgbouncer. - Do NOT restart Postgres without escalating first — see Escalate Incident.
Common root causes
- Deploy with heavy migration missing
statement_timeout. - Report cronjob without connection pooling.
- N+1 query in a newly deployed endpoint.
The bold "Do NOT restart Postgres" is doing real work. At 3am, a panicking engineer might reach for the nuclear option. This doc prevents that.
### `runbooks/escalate-incident.md`
```markdown
---
type: Runbook
title: Escalate Incident
description: Escalation process — when to stop solo debugging and call for backup.
tags: [oncall, process, escalation]
timestamp: 2026-05-15T08:00:00Z
owner: platform-team
---
# When to escalate
Escalate **immediately** if:
- SEV1 with no mitigation after 10 minutes.
- SEV2 with no mitigation after 30 minutes.
- Any incident with visible financial impact (orders failing).
- You don't understand what's happening. This is valid. No shame.
# How to escalate
| Step | Action | Contact |
|------|--------|---------|
| 1 | Declare incident in Slack `#incidents` | `@oncall-platform` |
| 2 | Page engineering manager | PagerDuty escalation policy `platform-em` |
| 3 | If customer impact > 5min | Page `@oncall-cs` for comms |
| 4 | If revenue impacted | Page `@oncall-finance` |
# Declaration template
🚨 INCIDENT DECLARED Severity: SEV{1|2} Alert: {link to the alert that fired} Impact: {what the user is experiencing} Status: Investigating / Mitigating / Resolved IC: @{your-name}
# Post-incident
- Postmortem required for SEV1, optional for SEV2.
- Deadline: 48h after resolution.
- Template: [Confluence — Postmortem Template](https://wiki.acme.internal/templates/postmortem).
# Related
- [API Latency P99](/alerts/api-latency-p99.md) — most common alert that triggers escalation.
- [DB Connections Exhausted](/alerts/db-connections-exhausted.md) — second most common.3. API Documentation
A bundle for documenting a REST API. Different from OpenAPI (which is a contract spec) — this focuses on contextual knowledge: why the endpoint exists, edge cases, rate limits, real-world examples.
Folder tree
api-docs/
├── index.md
├── auth/
│ ├── index.md
│ └── oauth2-flow.md
├── endpoints/
│ ├── index.md
│ ├── create-order.md
│ └── list-customers.md
├── policies/
│ ├── index.md
│ └── rate-limits.md
└── log.mdindex.md
# API Docs — Acme Commerce API
Contextual documentation for the public API v2. For the raw OpenAPI spec, see the [Swagger UI](https://api.acme.com/docs).
# Authentication
* [OAuth2 Flow](auth/oauth2-flow.md) - how to obtain and refresh tokens
# Endpoints
* [Create Order](endpoints/create-order.md) - POST /v2/orders
* [List Customers](endpoints/list-customers.md) - GET /v2/customers
# Policies
* [Rate Limits](policies/rate-limits.md) - per-plan limits and control headersThe distinction between this bundle and OpenAPI matters. OpenAPI tells machines “what parameters does this endpoint accept.” This bundle tells humans and agents “what happens when you use it wrong, and why it was built this way.”
auth/oauth2-flow.md
---
type: Auth Flow
title: OAuth2 Client Credentials
description: Machine-to-machine authentication flow for the Acme API.
resource: https://api.acme.com/oauth/token
tags: [auth, oauth2, security]
timestamp: 2026-06-05T16:00:00Z
owner: api-team
---
# Flow
```bash
curl -X POST https://api.acme.com/oauth/token \
-d grant_type=client_credentials \
-d client_id=$CLIENT_ID \
-d client_secret=$CLIENT_SECRET \
-d scope="orders:write customers:read"Response:
{
"access_token": "eyJhbGci...",
"token_type": "Bearer",
"expires_in": 3600,
"scope": "orders:write customers:read"
}Usage
Include the token in every request header:
Authorization: Bearer {access_token}Available scopes
| Scope | Grants |
|---|---|
orders:read | Read orders |
orders:write | Create and update orders |
customers:read | List and search customers |
customers:write | Create and update customers |
Watch out
- Token expires in 1 hour. Implement refresh before expiration, not after.
- Rate limit on
/oauth/token: 10 req/min per client_id — do not request a fresh token on every API call. - Respect the general rate limits once authenticated.
Citations
[1] RFC 6749 — OAuth 2.0 Client Credentials
### `endpoints/create-order.md`
```markdown
---
type: API Endpoint
title: Create Order
description: Creates a new order. Requires orders:write scope.
resource: https://api.acme.com/v2/orders
tags: [orders, write, core]
timestamp: 2026-06-10T10:00:00Z
owner: api-team
method: POST
path: /v2/orders
auth_scope: orders:write
---
# Request
```bash
curl -X POST https://api.acme.com/v2/orders \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"customer_id": "cust_abc123",
"items": [
{"sku": "WIDGET-01", "quantity": 2, "unit_price_usd": 49.90}
],
"shipping_address_id": "addr_xyz789"
}'Response (201 Created)
{
"order_id": "ord_def456",
"status": "pending_payment",
"total_usd": 99.80,
"created_at": "2026-06-10T10:30:00Z"
}Common errors
| Status | Code | Cause | Fix |
|---|---|---|---|
| 400 | invalid_sku | SKU doesn’t exist in catalog | Validate first via GET /v2/products |
| 401 | token_expired | Token expired | Refresh via OAuth2 flow |
| 429 | rate_limited | Rate limit exceeded | See Rate Limits |
| 422 | insufficient_stock | Out of stock | Reduce quantity or wait for restock |
Idempotency
Send Idempotency-Key: {uuid} in the header to ensure retries don’t duplicate orders.
Notes
total_usdis computed server-side. Don’t trust your client-side calculation.- Orders stay
pending_paymentfor 30 minutes. After that, automatic cancellation. - This endpoint feeds the same entity documented in the analytics bundle — same order, different audience.
### `policies/rate-limits.md`
```markdown
---
type: Policy
title: Rate Limits
description: Per-plan request limits, control headers, and throttle behavior.
tags: [api, rate-limit, policy, infra]
timestamp: 2026-06-01T09:00:00Z
owner: api-team
---
# Limits by plan
| Plan | Requests/min | Requests/day | Burst |
|------------|--------------|--------------|-------|
| Free | 60 | 10,000 | 10 |
| Pro | 600 | 100,000 | 50 |
| Enterprise | 6,000 | 1,000,000 | 200 |
# Response headers
Every response includes:
X-RateLimit-Limit: 600 X-RateLimit-Remaining: 584 X-RateLimit-Reset: 1718020800
- `X-RateLimit-Reset` is a Unix timestamp for when the bucket resets.
# Throttle behavior
Response `429 Too Many Requests`:
```json
{
"error": "rate_limited",
"message": "Rate limit exceeded. Retry after 12 seconds.",
"retry_after": 12
}Implement exponential backoff. Don’t hammer the API in a tight loop — clients that do this get blocked for 24 hours.
Endpoints with special limits
| Endpoint | Limit | Reason |
|---|---|---|
POST /oauth/token | 10/min | Brute force prevention |
POST /v2/orders | 30/min | Order spam prevention |
GET /v2/reports/* | 5/min | Heavy queries |
Best practices
- Cache aggressively on GETs that don’t change often (customers, products).
- Batch when possible —
POST /v2/orders/batchaccepts up to 50 orders per request. - Monitor the
X-RateLimit-Remainingheader and throttle yourself before hitting the wall. - Need more? Talk to sales or look at the Enterprise plan.
Related
- OAuth2 Flow — token endpoint has its own 10/min limit.
- Create Order — endpoint with the special 30/min cap.
Citations
[1] IETF RFC 6585 — 429 Too Many Requests [2] API Design Guidelines — internal
---
## Patterns across all three examples
1. **`type` is domain-specific.** There's no fixed list. Use `BigQuery Table`, `Alert`, `Runbook`, `API Endpoint`, `Policy`, `Metric` — whatever your team actually calls these things.
2. **Cross-links are generous.** If one concept references another, link it. It's cheap and makes the bundle navigable like a wiki. Notice how the DB connections alert links to the latency alert (cascade), and both link to the escalation runbook. That's a graph, not a folder.
3. **`index.md` is a map, not a junk drawer.** One line per item. Direct link. No throat-clearing.
4. **Extra frontmatter fields are encouraged.** The spec allows any additional key. `owner`, `severity`, `sla_freshness`, `method`, `auth_scope` — add whatever the consumer (human or agent) needs to filter and route.
5. **`# Citations` at the end.** External links that validate content. Agents use these to verify claims. Humans use them to dig deeper.
6. **Body is structured.** Headings, tables, code blocks. More structure means better retrieval by agents. Long prose paragraphs are noise — they bury the signal.