Sources

Files

Catch-all parser for folders or archives of plain documents — Markdown, text, PDF, JSON, CSV.

The files source is the everything-else parser. Point it at a folder or archive of documents and it ingests one item per file, with the body extracted from the most common formats.

It's the right source for personal notes you don't keep in Notion, miscellaneous PDFs, downloaded reference docs, and any one-off corpora that don't fit a more specific source.

Get the export

There's nothing to download. Either:

  • Pass a folder path containing the files you want ingested, recursively.
  • Or pass an archive (.zip, .tar.gz, .tgz) — it will be extracted into staging/ first.

Register the export

Folder:

bash
vault add-export \
  --source files \
  --path ~/Documents/notes \
  --account personal

Archive:

bash
vault add-export \
  --source files \
  --path ~/Downloads/research-papers.zip \
  --account research

What gets ingested

File typeHow it's parsed
.md, .markdownBody is the Markdown text. Front-matter (if YAML) lands in metadata.
.txtBody is the raw text.
.pdfBody is text extracted page-by-page. Page count recorded in metadata.
.json, .jsoncBody is the pretty-printed JSON; structure not normalized further.
.csvBody is the CSV (preserved verbatim); columns inferred from the header.
Other extensionsSkipped silently. Filenames still recorded as item_media.

For each ingested file:

  • kind=document
  • ts is the file's mtime when no better timestamp is in the body.
  • The directory hierarchy is preserved as tags (notes/journal/2025 becomes three tags).

Ingest it

bash
vault ingest --source files
vault
$ vault ingest --source files
Walking ~/Documents/notes ...
found 412 .md, 18 .pdf, 47 .txt, 6 .csv
documents: 483
Run 18 completed in 4.2s

Caveats

  • PDF extraction is approximate. Scanned PDFs without an OCR layer come through as empty bodies. Run them through OCR first if you want them searchable.
  • Large files are still ingested whole. A multi-MB JSON file will become a multi-MB item body — consider chunking very large files yourself before pointing the parser at them.
  • No deduplication across sources. A document also ingested via notion won't be detected as a duplicate of the same file ingested via files — they have different source slugs and so different dedup keys. Pick one source per document.
  • Hidden files and .git/ directories are skipped. Symlinks are not followed.