- Python 100%
| pginx | ||
| .gitignore | ||
| cli.py | ||
| pyproject.toml | ||
| README.md | ||
PGINX — Logic Flow Contract
Overview
This Project is a fork and refactor of https://github.com/VectifyAI/PageIndex. The Project was forked because it was heavily vibecoded with no concern for maintainability, testability, readability or usability. In order to fix the issues found while using the original a preliminary refactoring was done.
PGINX converts PDFs into hierarchical tree structures with page-level indexing, enabling structured retrieval without vector search.
The codebase is organised into two Bounded Contexts:
| BC | Package | Responsibility |
|---|---|---|
| Ingestion | pginx/ingestion/ |
Parse PDF → extract structure via LLM → enrich → persist |
| Querying | pginx/querying/ |
Load persisted documents → answer metadata/structure/page queries |
A Shared Kernel (pginx/shared/) holds cross-cutting concerns used by both BCs.
CLI
Installation
pip install pginx[cli]
This installs the core library plus the CLI dependencies (typer, rich, openai-agents, openai).
Configuration
Create a .env file in the project root with your API keys:
OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="sk-ant-..."
USE_OTEL_LITELLM_REQUEST_SPAN=true # optional: enables OpenTelemetry tracing
Only the key for your chosen provider is required. The default model is anthropic/claude-sonnet-4-6 (set in pginx/config.yaml).
Commands
Index a PDF
python cli.py index paper.pdf
python cli.py index paper.pdf --workspace ./my-workspace --model anthropic/claude-sonnet-4-6
Parses the PDF, builds a hierarchical structure via LLM, and persists it to the workspace. Prints the assigned doc_id on success.
Chat with a document
python cli.py chat <doc_id>
python cli.py chat --pdf paper.pdf # index then chat in one step
python cli.py chat <doc_id> --model gpt-4o
Starts an interactive Q&A session. The agent retrieves page content on demand — type exit or quit to stop.
Inspect document structure
python cli.py structure <doc_id>
python cli.py structure <doc_id> --depth 2
Prints the chapter/section tree for an already-indexed document. --depth limits how many levels are shown (default: 3).
Module Map
Public API
| Module | Role |
|---|---|
pginx/client.py |
PageIndexClient — public facade; wires both BCs |
pginx/__init__.py |
Re-exports PageIndexClient + three retrieval functions |
Shared Kernel
| Module | Role |
|---|---|
pginx/shared/config.py |
ConfigLoader — merges config.yaml defaults with user overrides |
pginx/shared/tracing.py |
@traced decorator + setup_tracing() (OpenTelemetry) |
pginx/shared/document_schema.py |
Frozen dataclasses: NodeRecord, DocumentRecord, PageEntry |
Ingestion BC
| Module | Role |
|---|---|
pginx/ingestion/domain/model.py |
PageSlice, TocEntry, IndexingDocument (aggregate root) |
pginx/ingestion/domain/ports.py |
LlmPort, PdfReaderPort (ABCs — no infrastructure imports) |
pginx/ingestion/domain/services/ |
Package: TocProcessor, StructureBuilder, Enricher + tree helpers |
pginx/ingestion/domain/services/_helpers.py |
Pure helper functions (tree manipulation, JSON parsing, page chunking) |
pginx/ingestion/domain/services/toc_processor.py |
TocProcessor — TOC detection, extraction, verification, fixing |
pginx/ingestion/domain/services/structure_builder.py |
StructureBuilder — flat TOC list → tree, large-node subdivision |
pginx/ingestion/domain/services/enricher.py |
Enricher — node summaries and document description |
pginx/ingestion/application/index_document.py |
IndexDocumentService — thin use-case orchestrator |
pginx/ingestion/infrastructure/litellm_adapter.py |
LiteLlmAdapter(LlmPort) — wraps LiteLLM |
pginx/ingestion/infrastructure/pypdf_adapter.py |
PyPdfAdapter(PdfReaderPort) — wraps PyPDF2 / PyMuPDF |
Querying BC
| Module | Role |
|---|---|
pginx/querying/domain/model.py |
PageRange, IndexedDocument (read-only aggregate) |
pginx/querying/domain/ports.py |
DocumentStorePort (ABC) |
pginx/querying/application/query_service.py |
QueryService — get_document / get_structure / get_pages |
pginx/querying/infrastructure/json_document_store.py |
JsonDocumentStore(DocumentStorePort) — workspace file I/O or in-memory store (workspace=None); load_workspace_into() for backward-compat |
Backward-compat shims (do not import in new code)
pginx/page_index.py, pginx/retrieve.py, pginx/utils.py, pginx/tracing.py
Entry Points
PageIndexClient(api_key, model, retrieve_model, workspace)
# stateful facade; wires both BCs
# retrieve_model: if blank, defaults to model; routed via _normalize_retrieve_model()
# → strips openai/ prefix; passes litellm/ through; otherwise prepends litellm/
# all query methods resolve doc_id via _resolve_doc_id():
# → matches doc_name or filename from _meta.json before treating as bare id
page_index(doc, **opts) # legacy shim → IndexDocumentService (no persistence)
Ingestion Pipeline
PageIndexClient.index(file_path)
↓ IndexDocumentService.index(file_path, opt)
↓ PyPdfAdapter.read_pages(file_path) # → list[PageSlice]
↓ IndexingDocument.record_pages(pages) # status: PARSING → STRUCTURING
↓ asyncio.run(_build_structure())
↓ StructureBuilder.build(page_list, opt)
↓ TocProcessor.check_toc() # LLM: detect TOC presence
↓ TocProcessor.process(page_list, mode='process_no_toc')
↓ _process_no_toc()
↓ _build_labelled_page_texts()
↓ _page_list_to_group_text() # chunk by token limit
↓ _generate_toc_init() # LLM: first chunk → flat TOC list
↓ _generate_toc_continue() # LLM: extend for remaining chunks
↓ validate_and_truncate_physical_indices()
↓ _verify_toc() # async: check_title_appearance per item
↓ [if accuracy < 1.0] _fix_incorrect_toc_with_retries()
↓ add_preface_if_needed()
↓ check_title_appearance_in_start_concurrent()
↓ post_processing() # flat list → tree (start/end indices)
↓ _process_large_node() # recurse for nodes > max_page/token limit
↓ write_node_id(), add_node_text(), generate_summaries(), format_structure()
↓ IndexingDocument.record_structure(structure) # status: STRUCTURING → ENRICHING
↓ [if_add_doc_description] Enricher.generate_doc_description()
↓ IndexingDocument.enrich(doc_description) # status: ENRICHING → COMPLETE
↓ JsonDocumentStore.save(doc_id, doc)
↓ QueryService.invalidate(doc_id)
↓ return doc_id
Key invariant: every node in the final structure has start_index and end_index as 1-based physical PDF page numbers.
Querying
QueryService.get_document(doc_id)
↓ _resolve_doc_id(doc_id) # name/filename → bare doc_id
↓ JsonDocumentStore.load(doc_id) # from disk or memory
↓ IndexedDocument.get_metadata() # → JSON: id, name, description, type, page_count
QueryService.get_document_structure(doc_id)
↓ _resolve_doc_id(doc_id)
↓ IndexedDocument.get_structure() # → JSON tree with 'text' fields stripped
QueryService.get_page_content(doc_id, pages)
↓ _resolve_doc_id(doc_id)
↓ PageRange.from_string("5-7"|"3,8"|"12")
↓ IndexedDocument.get_pages(page_range)
↓ prefer cached doc['pages']
↓ fallback: open PDF with PyPDF2
↓ → JSON: [{page, content}, ...]
Ports
LlmPort (ABC) PdfReaderPort (ABC)
.complete(prompt, ...) → str .read_pages(source) → list[PageSlice]
.acomplete(prompt) → str .get_name(source) → str
.count_tokens(text) → int
Implementations:
LiteLlmAdapter(LlmPort) PyPdfAdapter(PdfReaderPort)
# constructor strips litellm/ prefix before passing model to LiteLLM
# static helpers (not on port):
# parse_json(content) → dict — extracts + cleans JSON from LLM response
# get_json_content(response) → str — strips ```json … ``` fences
DocumentStorePort (ABC)
.save(doc_id, doc)
.load(doc_id) → dict | None
.load_all_metadata() → dict
Implementation:
JsonDocumentStore(DocumentStorePort)
Config (shared/config.py — ConfigLoader)
ConfigLoader(default_path=config.yaml)
.load(user_opt: dict | config | None) → config (SimpleNamespace)
↓ load YAML defaults (default model: anthropic/claude-sonnet-4-6)
↓ validate user keys against known keys
↓ merge: defaults ← user overrides
↓ return config(model, retrieve_model, toc_check_page_num,
max_page_num_each_node, max_token_num_each_node,
if_add_node_id, if_add_node_summary,
if_add_doc_description, if_add_node_text)
# retrieve_model: "" in YAML → client falls back to model
Tracing (shared/tracing.py)
setup_tracing()
↓ create per-module TracerProviders (keyed by _MODULE_COMPONENTS)
↓ attach BatchSpanProcessor → OTLPSpanExporter (localhost:4318)
↓ register litellm.success/failure_callback = ["otel"]
@traced (decorator, wraps sync and async)
↓ start span named after function
↓ record args, result (or exception) as span attributes
Data Shapes
Structure Node (in-memory / on-disk)
{
"title": "Chapter 3",
"node_id": "0003",
"start_index": 42,
"end_index": 58,
"summary": "...",
"text": "...",
"nodes": [ ... ]
}
text is stripped from workspace JSON saves (redundant with the pages array).
Workspace Layout
{workspace}/
_meta.json # lightweight index: {doc_id: {type, doc_name, doc_description, path, page_count}}
{doc_id}.json # full document: structure (no text) + pages array
Typical Call Flow (agentic RAG)
client.index("paper.pdf") → doc_id
client.get_document(doc_id) → doc metadata
client.get_document_structure(doc_id) → tree (LLM picks relevant nodes)
client.get_page_content(doc_id, "12-15") → raw page text for answer generation