PDF-to-MD output is unpredictable
Converter tools produce markdown with inconsistent headings, broken tables, and missing structure. Without validation, bad output silently enters your pipeline.
Latest Release: mdshape Core v0.1
Define a schema, parse any markdown, get strongly-typed JSON back. Built for RAG pipelines, PDF-to-MD validation, AI Skills, and structured content ingestion.
document()section()match()block()The Problem
You get markdown from converters, authors, and imports. But turning it into usable, structured data always ends in fragile custom code.
Converter tools produce markdown with inconsistent headings, broken tables, and missing structure. Without validation, bad output silently enters your pipeline.
Chunking raw markdown for vector databases loses context. Without typed extraction, your retrieval quality degrades and you can't trust what's stored.
Remark plugins, regex extraction, Zod schemas stitched together — each project reinvents markdown parsing with a fragile, untested custom layer.
Built and Documented
Core Capabilities
mdshape handles parsing, validation, and typed extraction in a single runtime — so you stop writing custom glue for every project.
| Name | Alex Turner |
| alex@zayra.com |
| Level | P1 |
| Escalation | immediate |
How It Works
Three steps: your markdown, your schema, your structured data. Works with any source — PDF converters, authored docs, imported files.
From a PDF converter, a content author, an import tool, or any other source.
# RUNBOOK: Payment Risk Incident
## 1. OWNER
- Name: Alex Turner
- Email: alex@zayra.com
## 2. SEVERITY
- Level: P1
- Escalation: immediateDefine the structure you expect. mdshape validates and extracts in one pass.
const schema = md.document({
title: md.heading(1),
owner: md.section('1. OWNER').fields({
Name: md.string(),
Email: md.email(),
}),
severity: md.section('2. SEVERITY').fields({
Level: md.string(),
Escalation: md.string(),
}),
})Get structured, strongly-typed output ready for your database, RAG pipeline, or API.
{
"success": true,
"data": {
"title": "RUNBOOK: Payment Risk Incident",
"owner": {
"Name": "Alex Turner",
"Email": "alex@zayra.com"
},
"severity": {
"Level": "P1",
"Escalation": "immediate"
}
}
}Comparison
You can — but you'll write the glue yourself. Here's what you get out of the box with mdshape vs. assembling your own stack.
| Capability | mdshape | Zod + remark | Markdoc | Contentlayer | Valibot + custom |
|---|---|---|---|---|---|
| Markdown → typed JSON in one call | Native | Custom required | Custom required | Partial | Custom required |
| Structure validation (heading order, section sequence, field presence) | Native | Custom required | Custom required | Custom required | Custom required |
| Rich block extraction (tables, mermaid, math, footnotes) | Native | Custom required | Partial | Partial | Custom required |
| Typed diagnostics with code, path, and line number | Native | Partial | Partial | Partial | Partial |
| Ready for production without custom integration layer | Native | Custom required | Custom required | Partial | Custom required |
Based on documented default capabilities as of each tool's latest stable release.
Common Questions
Yes. Define a schema with the structure you expect, run safeParse on the converter output, and get typed diagnostics for every deviation — missing headings, wrong field order, broken tables.
Instead of chunking raw markdown and losing context, mdshape extracts structured, typed JSON from your documents. Each field, section, and block becomes a typed entry you can store in your vector database with full context preserved.
Exactly this use case. Define a schema for your skill format — required sections, field order, metadata — and validate every .md file before it enters your agent pipeline. Catch formatting issues at authoring time, not at runtime.
You don't have to drop Zod. mdshape replaces the glue layer — the remark plugins, AST walkers, and custom extraction — with a single call. Zod validates values; mdshape validates and extracts markdown structure natively.
A basic schema is under 10 lines. The Playground lets you iterate without installing anything. Most teams go from zero to first validation in under 15 minutes.
A parser turns markdown into an AST. That's it — you still need to walk the tree, extract fields, validate structure, and shape the output yourself. mdshape does all of that in one call: you define a schema, it returns typed JSON or typed errors. No AST manipulation.
You get a typed error object with every issue: which field is missing, which section is out of order, the exact line and column number. No generic "parse failed" — every failure is actionable.
No. mdshape is read-only. It parses and extracts — it never changes the source markdown. Your files stay portable and untouched.
Yes. We serve a llms.txt file at docs.markschema.com/llms.txt with the full documentation index — pages, API types, guides, and examples — so LLMs and AI agents can discover and reference our docs natively.