Open Knowledge Format (OKF) — An Annotated Guide

Version 0.1 — Draft

A developer’s walkthrough. Opinions included.

OKF is an open format for representing knowledge — metadata, context, and curated insight about your data and systems. People write it, agents generate it, organizations exchange it. Both humans and LLMs can consume it without special tools.

The format is a directory of markdown files with YAML frontmatter. That’s it. No schema registry, no central authority, no mandatory tooling. cat a file to read it. git clone a repo to distribute it.

What’s this for, concretely?
Think of OKF as the “README.md for your data catalog” — but with enough structure for an LLM agent to navigate on its own. Your data team documents tables, metrics, playbooks. An agent reads that without a proprietary SDK. Humans read it without opening any tool. That’s the pitch.

1. Motivation

The knowledge representation space for AI agents is moving fast. Incompatible conventions keep appearing. OKF starts from the premise that knowledge is best represented in accessible, established formats that are:

Readable by humans without special tools.
Parseable by agents without bespoke SDKs.
Diffable in version control.
Portable across tools, organizations, and time.

The format is minimally opinionated. It nails down the structural conventions needed for a knowledge corpus to be self-describing, and stops there. Everything else is up to you.

I think this is the right call. Knowledge management standards that try to do too much get adopted by nobody. OKF bets on the other extreme: almost no rules, maximum adoption surface.

Goals

Define a universal format that enrichment agents can write to.
Inform how consumer agents should read and navigate the content.
Enable exchange of knowledge between systems and organizations.
Standardize the small number of required fields that must exist for meaningful consumption.

Non-goals

Defining a fixed taxonomy of concept types.
Prescribing storage, serving, or query infrastructure.
Replacing domain-specific schemas (Avro, Protobuf, OpenAPI, etc.) — OKF references them; it doesn’t try to swallow them.

If you already have a .proto describing your schema — great. OKF doesn’t replace it. OKF adds business context, examples, and cross-links between assets that a .proto file will never have.

2. Terminology

Knowledge Bundle — A hierarchical, self-contained collection of knowledge documents. The unit of distribution. Think of it as a “package” you clone or download.
Concept — A single unit of knowledge within a bundle. Represented as one markdown file. Could describe a tangible asset (a table, an API) or an abstract idea (a metric, a business process).
Concept ID — The file path within the bundle, minus the .md suffix. Example: tables/users.md has concept ID tables/users.
Frontmatter — YAML metadata block delimited by --- at the top of a markdown file.
Body — Everything in the file after the frontmatter.
Link — A standard markdown link from one concept to another, used to express relationships beyond the implicit parent/child hierarchy.
Citation — A link from a concept to an external source that backs a claim made in the body.

If you use Obsidian, most of this is familiar. Bundle ≈ vault. Concept ≈ note. Frontmatter ≈ that YAML block at the top. Link ≈ wikilink. The difference is that OKF formalizes minimal rules for interoperability between tools.

3. Bundle Structure

A bundle is a directory tree of markdown files. The directory structure is domain-independent — producers organize concepts however makes sense for the knowledge being captured.

path/to/bundle/
├── index.md                      # Optional. Directory listing (progressive disclosure).
├── log.md                        # Optional. Chronological update history.
├── <concept>.md                  # A concept at the bundle root.
└── <subdirectory>/               # Subdirectories group concepts.
    ├── index.md
    ├── <concept>.md
    └── <subdirectory>/
        └── …

A bundle MAY be distributed as:

A git repository (recommended — gives you history, attribution, diffs).
A tarball or zip of the directory.
A subdirectory within a larger repository.

In practice, the most common setup will be a knowledge/ or docs/catalog/ directory inside your monorepo. Nothing prevents a standalone repo, but the convenience of keeping it near the code is significant. The agent can look at dbt schemas AND OKF documentation in the same git clone.

3.1 Reserved Filenames

The following filenames have defined meaning at any level of the hierarchy and MUST NOT be used for concept documents:

File	Purpose
`index.md`	Directory listing. See §6.
`log.md`	Update history. See §7.

All other .md files are concept documents.

Tags remain first-class citizens — see the tags field in frontmatter (§4.1). OKF does not specify a separate format for aggregating documents by tag; anyone wanting a tag view can synthesize one at consumption time by scanning frontmatters.

4. Concept Documents

Every concept is a UTF-8 markdown file with two parts:

A YAML frontmatter block, delimited by --- on its own line at the start of the file and a closing --- on its own line.
A markdown body, with free-form content.

4.1 Frontmatter

---
type: <Type name>                  # REQUIRED
title: <Optional display name>
description: <One-line summary, optional>
resource: <Canonical URI of the underlying asset, optional>
tags: [<tag>, <tag>, …]            # Optional
timestamp: <ISO 8601 datetime>     # Optional, last significant update
# … other key/value pairs defined by the producer
---

Required:

type — A short string identifying the concept’s type. Consumers use this for routing, filtering, and presentation. Example values: BigQuery Table, BigQuery Dataset, API Endpoint, Metric, Playbook, Reference.
Type values are not registered centrally. Producers MUST choose descriptive, self-explanatory values; consumers MUST tolerate unknown types gracefully (typically treating them as generic concepts).

This is both liberating and dangerous.
You can create type: dbt Model or type: Kafka Topic without asking permission. But without governance, two teams at the same company might use type: Table and type: BigQuery Table for the same thing. Agree on conventions in your design doc before things drift.

Recommended (in priority order):

title — Human-readable name. If omitted, consumers MAY derive a title from the filename.
description — A single sentence summarizing the concept. Used by index.md generators, search snippets, and previews.
resource — A URI that uniquely identifies the asset the concept describes. Absent for concepts describing abstract ideas.
tags — YAML list of short strings for cross-cutting categorization.
timestamp — ISO 8601 datetime of the last significant change.

Extensions: Producers MAY include additional keys. Consumers MUST preserve unknown keys on round-trip and MUST NOT reject documents with unrecognized fields.

Useful extension example: Add owner: data-team@company.com or freshness_sla: 30m to the frontmatter. The spec doesn’t prohibit it, and a smart agent can use those extra metadata fields for decision-making.

4.2 Body

The body is standard markdown. Producers SHOULD prefer structural markdown — headings, lists, tables, fenced code blocks — over free prose. Structure helps humans scan and helps agents retrieve specific sections.

There are no mandatory body sections. The following headings have conventional meaning and SHOULD be used when applicable:

Heading	Purpose
`# Schema`	Structured description of an asset’s columns/fields.
`# Examples`	Concrete usage examples, usually in code blocks.
`# Citations`	External sources backing claims in the body. See §8.

Why structure matters for agents
An LLM with RAG performs noticeably better when documents have clear headings. Ask “show me the schema for the orders table” and the agent jumps straight to # Schema instead of parsing a wall of text. I’ve tested this: unstructured docs lead to hallucinated schemas. Headings are cheap insurance.

4.3 Example: Concept Linked to a Resource

---
type: BigQuery Table
title: Customer Orders
description: One row per completed customer order across all channels.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders
tags: [sales, orders, revenue]
timestamp: 2026-05-28T14:30:00Z
---

# Schema

| Column        | Type      | Description                             |
|---------------|-----------|-----------------------------------------|
| `order_id`    | STRING    | Globally unique order identifier.       |
| `customer_id` | STRING    | FK to [customers](/tables/customers.md).|
| `total_usd`   | NUMERIC   | Order total in USD.                     |
| `placed_at`   | TIMESTAMP | When the customer submitted the order.  |

# Joins

Join with [customers](/tables/customers.md) via `customer_id`.

# Citations

[1] [BigQuery table schema](https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders)

4.4 Example: Concept Without a Resource

---
type: Playbook
title: Incident Response — Freshness Alert
description: Steps to triage a freshness alert on the orders pipeline.
tags: [oncall, incident]
timestamp: 2026-04-12T09:00:00Z
---

# Trigger

The freshness alert fires when `orders` falls more than 30 minutes
behind expected SLA. See the [orders table](/tables/orders.md).

# Steps

1. Check the [ingestion job dashboard](https://example.com/dash).
2. Verify the source system is responding.
3. If it's a source failure, escalate to the partner team via Slack #data-incidents.
4. If it's an internal failure, escalate via PagerDuty.

4.5 Example: A Metric Concept

The spec doesn’t show this case explicitly, but it works well:

---
type: Metric
title: Revenue per Customer (LTV 90d)
description: Cumulative revenue per customer over the last 90 days.
tags: [finance, growth, kpi]
timestamp: 2026-06-01T10:00:00Z
---

# Definition

Sum of `total_usd` from the [orders](/tables/orders.md) table grouped
by `customer_id`, filtered to `placed_at >= CURRENT_DATE - 90`.

# Reference SQL

```sql
SELECT customer_id, SUM(total_usd) AS ltv_90d
FROM `acme.sales.orders`
WHERE placed_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY customer_id

Limitations

Does not include refunds (refunds table not yet ingested).
Excludes orders with status cancelled.


---

## 5. Cross-linking

Concepts MAY link to other concepts using standard markdown links. Two forms are supported:

### 5.1 Absolute Links (Bundle-relative)

Start with `/`, interpreted relative to the bundle root.

```markdown
See the [customers table](/tables/customers.md) for the join key.

This is the recommended form because it stays stable when documents are moved within their subdirectory.

5.2 Relative Links

Standard markdown relative paths.

See the [neighboring concept](./other.md).

5.3 Link Semantics

A link from concept A to concept B asserts a relationship. What kind of relationship — parent/child, join, dependency — is conveyed by the surrounding prose, not the link syntax. Consumers building a graph view typically treat all links as untyped directed edges.

Consumers MUST tolerate broken links — a link whose target doesn’t exist in the bundle is not malformed; it may simply represent knowledge not yet written.

Broken links are a feature, not a bug.
You can write [refunds table](/tables/refunds.md) before that file exists. Reference first, fill in later. I’ve seen too many systems where you can’t reference something until it’s fully defined. That kills momentum for teams documenting incrementally.

6. Index Files

An index.md MAY appear in any directory, including the bundle root. It lists the directory’s contents for progressive disclosure — you see what’s available without opening each document.

Index files contain no frontmatter. The body uses one or more sections, each grouping concepts under a heading:

# Sales Tables

* [Orders](orders.md) - One row per completed order
* [Customers](customers.md) - Customer master data

# Support Tables

* [Tickets](tickets.md) - Support tickets

Entries SHOULD include the description from the linked concept’s frontmatter. Producers MAY generate index.md automatically; consumers MAY synthesize one on the fly when none is present.

Automating index.md
In practice, you’ll want a script that walks the bundle and generates index.md files from each concept’s title and description fields. Quick and dirty:
for f in tables/*.md; do
  [ "$(basename $f)" = "index.md" ] && continue
  title=$(grep '^title:' "$f" | sed 's/title: //')
  desc=$(grep '^description:' "$f" | sed 's/description: //')
  echo "* [$title]($(basename $f)) - $desc"
done
Is this crude? Yes. Does it work for 90% of cases? Also yes.

7. Log Files (Optional)

A log.md MAY appear at any level of the hierarchy to record the change history at that scope. The format is a flat list of entries grouped by date, most recent first:

# Update Log

## 2026-05-22
* **Update**: Added new BigQuery table reference for [Customer Metrics](/tables/customer-metrics.md).
* **Create**: Established the [Dataplex Playbook](/playbooks/dataplex.md).

## 2026-05-15
* **Init**: Created foundational directory structure.
* **Update**: Added progressive-disclosure guidelines to root [index](/index.md).

Date headings MUST use ISO 8601 YYYY-MM-DD format. Log entries are prose; the initial bold word (**Update**, **Create**, **Deprecation**, etc.) is a convention, not a requirement.

log.md vs git log
log.md doesn’t replace git log. Different audience. A human or agent skimming the log wants high-level changes: “they added the metrics table in May.” Git log tells the granular commit story. Think of log.md as a hand-written CHANGELOG for a knowledge base.

8. Citations

When a concept’s body makes claims based on external material, those sources MUST be listed under a # Citations heading at the end of the document, numbered:

# Citations

[1] [BigQuery public dataset announcement](https://cloud.google.com/blog/products/data-analytics/...)
[2] [Internal data quality runbook](https://wiki.acme.internal/data/quality)

Citation links MAY be absolute URLs, bundle-relative paths, or paths within a references/ subdirectory that mirrors external material as first-class OKF concepts.

Why citations matter for agents
When an LLM answers “table X has a freshness SLA of 30 minutes,” a citation lets you trace where that came from. Without citations, the agent hallucinates with confidence. With them, you can validate the source. “Show your work” for AI. I think this is the most underrated part of the spec — most teams skip it and pay for it later when debugging agent answers.

9. Conformance

A bundle is conformant with OKF v0.1 if:

Every non-reserved .md file in the tree contains a parseable YAML frontmatter block.
Every frontmatter block contains a non-empty type field.
Every reserved filename (index.md, log.md) follows the structure described in §6 and §7, respectively, when present.

Consumers MUST treat all other constraints as soft guidance. In particular, consumers MUST NOT reject a bundle because of:

Missing optional frontmatter fields.
Unknown type values.
Unknown additional frontmatter keys.
Broken cross-links.
Missing index.md files.

This permissive model is intentional. Bundles grow, get refactored, get partially generated by agents. Strict validation would break constantly.

The golden rule of conformance
If it has frontmatter with type, it’s valid OKF. Full stop. Everything else is optional. Your team can start with minimal concepts and add richness over time. A linter checking rules 1 and 2 is about 10 lines of bash.

10. Relationship to Other Formats

OKF is intentionally close to several established patterns:

LLM “wiki” repositories that use markdown + frontmatter as agent-readable knowledge bases.
Personal knowledge tools like Obsidian and Notion, which use hierarchical markdown with cross-links.
“Metadata as code” approaches that store catalog metadata alongside source code rather than in a separate registry.

Where OKF differs: it’s specified. A small set of rules that enable interoperability, without dictating tools.

Here’s an honest comparison:

Format/Tool	Human-readable?	Agent-readable?	Specified?	Portable?
Obsidian vault	✅	Sort of	❌	❌ (plugins)
Notion export	✅	❌ (messy JSON)	❌	❌
DataHub/OpenMetadata	❌ (needs UI)	✅ (API)	✅	Partial
OKF	✅	✅	✅	✅

OKF occupies a specific niche: maximum portability, minimum ceremony. Will this niche sustain a formal standard? I don’t know. But the adoption bar is so low that even if OKF stays informal, the pattern is worth copying.

11. Versioning

This document specifies OKF version 0.1. Future revisions will follow the <major>.<minor> format:

A minor version bump introduces backward-compatible additions (new optional fields, new conventional section headings).
A major version bump may introduce breaking changes (renaming required fields, changing reserved filenames).

Bundles MAY declare the OKF version they target by including okf_version: "0.1" in the frontmatter of the root index.md (the only place where frontmatter is permitted in an index.md). Consumers that don’t understand the declared version MUST attempt best-effort consumption rather than refusing the bundle.

Appendix A — Minimal Example Bundle

my_bundle/
├── index.md
├── datasets/
│   ├── index.md
│   └── sales.md
└── tables/
    ├── index.md
    ├── orders.md
    └── customers.md

datasets/sales.md:

---
type: BigQuery Dataset
title: Sales
description: All sales tables for the retail business.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales
tags: [sales]
timestamp: 2026-05-28T00:00:00Z
---

The sales dataset contains transactional tables, including
[orders](/tables/orders.md) and [customers](/tables/customers.md).

tables/orders.md:

---
type: BigQuery Table
title: Orders
description: One row per completed customer order.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders
tags: [sales, orders]
timestamp: 2026-05-28T00:00:00Z
---

# Schema

| Column        | Type      | Description                 |
|---------------|-----------|-----------------------------|
| `order_id`    | STRING    | Unique identifier.          |
| `customer_id` | STRING    | FK to [customers](/tables/customers.md). |
| `total_usd`   | NUMERIC   | Order total in USD.         |

Part of the [sales dataset](/datasets/sales.md).

Appendix B — Adoption Guide

This isn’t part of the official spec. It’s my suggested path for teams adopting OKF.

Step 1: Pick your scope

Don’t try to document everything at once. Start with the 10 most-queried tables, or the 5 on-call playbooks currently lost in Confluence.

Step 2: Create the minimal structure

mkdir -p knowledge/{tables,metrics,playbooks}

Step 3: Write your first concept

---
type: BigQuery Table
title: users
description: Platform user master table.
resource: bigquery://project.dataset.users
tags: [core, identity]
timestamp: 2026-06-01T00:00:00Z
---

# Schema

| Column      | Type      | Description              |
|-------------|-----------|--------------------------|
| `user_id`   | STRING    | User UUID.               |
| `email`     | STRING    | Primary email.           |
| `created_at`| TIMESTAMP | Account creation date.   |

# Notes

- PII data: `email` is masked in non-prod environments.
- Partitioned by `created_at` (daily).

Step 4: Automate

CI that validates every .md has frontmatter with type.
Script that regenerates index.md on every PR.
(Optional) Agent that enriches new concepts with real schema via BigQuery/Snowflake API.

A minimal CI check:

#!/bin/bash
# validate-okf.sh — fails if any concept is missing type
set -euo pipefail

find knowledge/ -name "*.md" ! -name "index.md" ! -name "log.md" | while read f; do
  if ! head -20 "$f" | grep -q "^type:"; then
    echo "FAIL: $f missing 'type' in frontmatter"
    exit 1
  fi
done
echo "All concepts valid."

Step 5: Distribute

Git push. Done. Your bundle is a repo (or subdirectory). Other teams clone it and have access. No vendor accounts required.

Appendix C — Design Opinions

Observations on the spec’s design choices, with some criticism:

The untyped-link decision is pragmatic but limiting. You can’t query “show me all tables that join to customers” without parsing prose around links. A typed-link syntax (like [customers](/tables/customers.md "joins-via:customer_id")) would enable richer graph queries. I suspect a future version will add optional link attributes.

No schema for the body is both a strength and weakness. It makes adoption trivial. But it means two teams documenting BigQuery tables might structure their bodies completely differently. If you’re adopting OKF across an org, define your own body templates alongside it.

The resource field as a URI is underspecified. Can it be a BigQuery resource path? A custom URI scheme? An ARN? The spec says “URI” without constraining the scheme, which means consumers can’t reliably resolve resources. Workable for now; will need tightening.

Frontmatter-only validation is genius. Reducing conformance to “has frontmatter + has type” means validation is a one-liner and adoption costs almost nothing. Most specs fail because conformance demands too much upfront work.

Source: GoogleCloudPlatform/knowledge-catalog — okf/SPEC.md

Guide written: June 2026. OKF v0.1 Draft.