No description
Find a file
2026-05-03 17:36:46 +03:00
pginx update: docstrings numpy style 2026-05-03 17:32:29 +03:00
.gitignore ignore results and pdfs 2026-05-03 16:51:30 +03:00
cli.py fix cli errors and add memory 2026-05-03 16:51:57 +03:00
pyproject.toml add cli deps 2026-05-03 17:02:43 +03:00
README.md update: attributions 2026-05-03 17:36:46 +03:00

PGINX — Logic Flow Contract

Overview

This Project is a fork and refactor of https://github.com/VectifyAI/PageIndex. The Project was forked because it was heavily vibecoded with no concern for maintainability, testability, readability or usability. In order to fix the issues found while using the original a preliminary refactoring was done.

PGINX converts PDFs into hierarchical tree structures with page-level indexing, enabling structured retrieval without vector search.

The codebase is organised into two Bounded Contexts:

BC Package Responsibility
Ingestion pginx/ingestion/ Parse PDF → extract structure via LLM → enrich → persist
Querying pginx/querying/ Load persisted documents → answer metadata/structure/page queries

A Shared Kernel (pginx/shared/) holds cross-cutting concerns used by both BCs.


CLI

Installation

pip install pginx[cli]

This installs the core library plus the CLI dependencies (typer, rich, openai-agents, openai).

Configuration

Create a .env file in the project root with your API keys:

OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="sk-ant-..."
USE_OTEL_LITELLM_REQUEST_SPAN=true   # optional: enables OpenTelemetry tracing

Only the key for your chosen provider is required. The default model is anthropic/claude-sonnet-4-6 (set in pginx/config.yaml).

Commands

Index a PDF

python cli.py index paper.pdf
python cli.py index paper.pdf --workspace ./my-workspace --model anthropic/claude-sonnet-4-6

Parses the PDF, builds a hierarchical structure via LLM, and persists it to the workspace. Prints the assigned doc_id on success.

Chat with a document

python cli.py chat <doc_id>
python cli.py chat --pdf paper.pdf        # index then chat in one step
python cli.py chat <doc_id> --model gpt-4o

Starts an interactive Q&A session. The agent retrieves page content on demand — type exit or quit to stop.

Inspect document structure

python cli.py structure <doc_id>
python cli.py structure <doc_id> --depth 2

Prints the chapter/section tree for an already-indexed document. --depth limits how many levels are shown (default: 3).

Module Map

Public API

Module Role
pginx/client.py PageIndexClient — public facade; wires both BCs
pginx/__init__.py Re-exports PageIndexClient + three retrieval functions

Shared Kernel

Module Role
pginx/shared/config.py ConfigLoader — merges config.yaml defaults with user overrides
pginx/shared/tracing.py @traced decorator + setup_tracing() (OpenTelemetry)
pginx/shared/document_schema.py Frozen dataclasses: NodeRecord, DocumentRecord, PageEntry

Ingestion BC

Module Role
pginx/ingestion/domain/model.py PageSlice, TocEntry, IndexingDocument (aggregate root)
pginx/ingestion/domain/ports.py LlmPort, PdfReaderPort (ABCs — no infrastructure imports)
pginx/ingestion/domain/services/ Package: TocProcessor, StructureBuilder, Enricher + tree helpers
pginx/ingestion/domain/services/_helpers.py Pure helper functions (tree manipulation, JSON parsing, page chunking)
pginx/ingestion/domain/services/toc_processor.py TocProcessor — TOC detection, extraction, verification, fixing
pginx/ingestion/domain/services/structure_builder.py StructureBuilder — flat TOC list → tree, large-node subdivision
pginx/ingestion/domain/services/enricher.py Enricher — node summaries and document description
pginx/ingestion/application/index_document.py IndexDocumentService — thin use-case orchestrator
pginx/ingestion/infrastructure/litellm_adapter.py LiteLlmAdapter(LlmPort) — wraps LiteLLM
pginx/ingestion/infrastructure/pypdf_adapter.py PyPdfAdapter(PdfReaderPort) — wraps PyPDF2 / PyMuPDF

Querying BC

Module Role
pginx/querying/domain/model.py PageRange, IndexedDocument (read-only aggregate)
pginx/querying/domain/ports.py DocumentStorePort (ABC)
pginx/querying/application/query_service.py QueryService — get_document / get_structure / get_pages
pginx/querying/infrastructure/json_document_store.py JsonDocumentStore(DocumentStorePort) — workspace file I/O or in-memory store (workspace=None); load_workspace_into() for backward-compat

Backward-compat shims (do not import in new code)

pginx/page_index.py, pginx/retrieve.py, pginx/utils.py, pginx/tracing.py


Entry Points

PageIndexClient(api_key, model, retrieve_model, workspace)
  # stateful facade; wires both BCs
  # retrieve_model: if blank, defaults to model; routed via _normalize_retrieve_model()
  #   → strips openai/ prefix; passes litellm/ through; otherwise prepends litellm/
  # all query methods resolve doc_id via _resolve_doc_id():
  #   → matches doc_name or filename from _meta.json before treating as bare id

page_index(doc, **opts)          # legacy shim → IndexDocumentService (no persistence)

Ingestion Pipeline

PageIndexClient.index(file_path)
  ↓ IndexDocumentService.index(file_path, opt)
      ↓ PyPdfAdapter.read_pages(file_path)     # → list[PageSlice]
      ↓ IndexingDocument.record_pages(pages)   # status: PARSING → STRUCTURING
      ↓ asyncio.run(_build_structure())
          ↓ StructureBuilder.build(page_list, opt)
              ↓ TocProcessor.check_toc()        # LLM: detect TOC presence
              ↓ TocProcessor.process(page_list, mode='process_no_toc')
                  ↓ _process_no_toc()
                      ↓ _build_labelled_page_texts()
                      ↓ _page_list_to_group_text()   # chunk by token limit
                      ↓ _generate_toc_init()          # LLM: first chunk → flat TOC list
                      ↓ _generate_toc_continue()      # LLM: extend for remaining chunks
                  ↓ validate_and_truncate_physical_indices()
                  ↓ _verify_toc()                     # async: check_title_appearance per item
                  ↓ [if accuracy < 1.0] _fix_incorrect_toc_with_retries()
              ↓ add_preface_if_needed()
              ↓ check_title_appearance_in_start_concurrent()
              ↓ post_processing()              # flat list → tree (start/end indices)
              ↓ _process_large_node()          # recurse for nodes > max_page/token limit
          ↓ write_node_id(), add_node_text(), generate_summaries(), format_structure()
      ↓ IndexingDocument.record_structure(structure)  # status: STRUCTURING → ENRICHING
      ↓ [if_add_doc_description] Enricher.generate_doc_description()
      ↓ IndexingDocument.enrich(doc_description)      # status: ENRICHING → COMPLETE
      ↓ JsonDocumentStore.save(doc_id, doc)
  ↓ QueryService.invalidate(doc_id)
  ↓ return doc_id

Key invariant: every node in the final structure has start_index and end_index as 1-based physical PDF page numbers.


Querying

QueryService.get_document(doc_id)
  ↓ _resolve_doc_id(doc_id)            # name/filename → bare doc_id
  ↓ JsonDocumentStore.load(doc_id)     # from disk or memory
  ↓ IndexedDocument.get_metadata()     # → JSON: id, name, description, type, page_count

QueryService.get_document_structure(doc_id)
  ↓ _resolve_doc_id(doc_id)
  ↓ IndexedDocument.get_structure()    # → JSON tree with 'text' fields stripped

QueryService.get_page_content(doc_id, pages)
  ↓ _resolve_doc_id(doc_id)
  ↓ PageRange.from_string("5-7"|"3,8"|"12")
  ↓ IndexedDocument.get_pages(page_range)
      ↓ prefer cached doc['pages']
      ↓ fallback: open PDF with PyPDF2
  ↓ → JSON: [{page, content}, ...]

Ports

LlmPort (ABC)                          PdfReaderPort (ABC)
  .complete(prompt, ...) → str           .read_pages(source) → list[PageSlice]
  .acomplete(prompt) → str               .get_name(source) → str
  .count_tokens(text) → int

Implementations:
  LiteLlmAdapter(LlmPort)              PyPdfAdapter(PdfReaderPort)
    # constructor strips litellm/ prefix before passing model to LiteLLM
    # static helpers (not on port):
    #   parse_json(content) → dict   — extracts + cleans JSON from LLM response
    #   get_json_content(response) → str — strips ```json … ``` fences
  DocumentStorePort (ABC)
    .save(doc_id, doc)
    .load(doc_id) → dict | None
    .load_all_metadata() → dict

Implementation:
  JsonDocumentStore(DocumentStorePort)

Config (shared/config.py — ConfigLoader)

ConfigLoader(default_path=config.yaml)
  .load(user_opt: dict | config | None) → config (SimpleNamespace)
    ↓ load YAML defaults  (default model: anthropic/claude-sonnet-4-6)
    ↓ validate user keys against known keys
    ↓ merge: defaults ← user overrides
    ↓ return config(model, retrieve_model, toc_check_page_num,
                    max_page_num_each_node, max_token_num_each_node,
                    if_add_node_id, if_add_node_summary,
                    if_add_doc_description, if_add_node_text)
                    # retrieve_model: "" in YAML → client falls back to model

Tracing (shared/tracing.py)

setup_tracing()
  ↓ create per-module TracerProviders (keyed by _MODULE_COMPONENTS)
  ↓ attach BatchSpanProcessor → OTLPSpanExporter (localhost:4318)
  ↓ register litellm.success/failure_callback = ["otel"]

@traced  (decorator, wraps sync and async)
  ↓ start span named after function
  ↓ record args, result (or exception) as span attributes

Data Shapes

Structure Node (in-memory / on-disk)

{
  "title": "Chapter 3",
  "node_id": "0003",
  "start_index": 42,
  "end_index": 58,
  "summary": "...",
  "text": "...",
  "nodes": [ ... ]
}

text is stripped from workspace JSON saves (redundant with the pages array).

Workspace Layout

{workspace}/
  _meta.json          # lightweight index: {doc_id: {type, doc_name, doc_description, path, page_count}}
  {doc_id}.json       # full document: structure (no text) + pages array

Typical Call Flow (agentic RAG)

client.index("paper.pdf")              → doc_id
client.get_document(doc_id)            → doc metadata
client.get_document_structure(doc_id)  → tree (LLM picks relevant nodes)
client.get_page_content(doc_id, "12-15") → raw page text for answer generation