No description

Python 100%

Find a file

nocyphr ebb1437c4d update: attributions		2026-05-03 17:36:46 +03:00
pginx	update: docstrings numpy style	2026-05-03 17:32:29 +03:00
.gitignore	ignore results and pdfs	2026-05-03 16:51:30 +03:00
cli.py	fix cli errors and add memory	2026-05-03 16:51:57 +03:00
pyproject.toml	add cli deps	2026-05-03 17:02:43 +03:00
README.md	update: attributions	2026-05-03 17:36:46 +03:00

README.md

PGINX — Logic Flow Contract

Overview

This Project is a fork and refactor of https://github.com/VectifyAI/PageIndex. The Project was forked because it was heavily vibecoded with no concern for maintainability, testability, readability or usability. In order to fix the issues found while using the original a preliminary refactoring was done.

PGINX converts PDFs into hierarchical tree structures with page-level indexing, enabling structured retrieval without vector search.

The codebase is organised into two Bounded Contexts:

BC	Package	Responsibility
Ingestion	`pginx/ingestion/`	Parse PDF → extract structure via LLM → enrich → persist
Querying	`pginx/querying/`	Load persisted documents → answer metadata/structure/page queries

A Shared Kernel (pginx/shared/) holds cross-cutting concerns used by both BCs.

CLI

Installation

pip install pginx[cli]

This installs the core library plus the CLI dependencies (typer, rich, openai-agents, openai).

Configuration

Create a .env file in the project root with your API keys:

OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="sk-ant-..."
USE_OTEL_LITELLM_REQUEST_SPAN=true   # optional: enables OpenTelemetry tracing

Only the key for your chosen provider is required. The default model is anthropic/claude-sonnet-4-6 (set in pginx/config.yaml).

Commands

Index a PDF

python cli.py index paper.pdf
python cli.py index paper.pdf --workspace ./my-workspace --model anthropic/claude-sonnet-4-6

Parses the PDF, builds a hierarchical structure via LLM, and persists it to the workspace. Prints the assigned doc_id on success.

Chat with a document

python cli.py chat <doc_id>
python cli.py chat --pdf paper.pdf        # index then chat in one step
python cli.py chat <doc_id> --model gpt-4o

Starts an interactive Q&A session. The agent retrieves page content on demand — type exit or quit to stop.

Inspect document structure

python cli.py structure <doc_id>
python cli.py structure <doc_id> --depth 2

Prints the chapter/section tree for an already-indexed document. `--depth` limits how many levels are shown (default: 3).

Module Map

Public API

Module	Role
`pginx/client.py`	`PageIndexClient` — public facade; wires both BCs
`pginx/__init__.py`	Re-exports `PageIndexClient` + three retrieval functions

Shared Kernel

Module	Role
`pginx/shared/config.py`	`ConfigLoader` — merges `config.yaml` defaults with user overrides
`pginx/shared/tracing.py`	`@traced` decorator + `setup_tracing()` (OpenTelemetry)
`pginx/shared/document_schema.py`	Frozen dataclasses: `NodeRecord`, `DocumentRecord`, `PageEntry`

Ingestion BC

Module	Role
`pginx/ingestion/domain/model.py`	`PageSlice`, `TocEntry`, `IndexingDocument` (aggregate root)
`pginx/ingestion/domain/ports.py`	`LlmPort`, `PdfReaderPort` (ABCs — no infrastructure imports)
`pginx/ingestion/domain/services/`	Package: `TocProcessor`, `StructureBuilder`, `Enricher` + tree helpers
`pginx/ingestion/domain/services/_helpers.py`	Pure helper functions (tree manipulation, JSON parsing, page chunking)
`pginx/ingestion/domain/services/toc_processor.py`	`TocProcessor` — TOC detection, extraction, verification, fixing
`pginx/ingestion/domain/services/structure_builder.py`	`StructureBuilder` — flat TOC list → tree, large-node subdivision
`pginx/ingestion/domain/services/enricher.py`	`Enricher` — node summaries and document description
`pginx/ingestion/application/index_document.py`	`IndexDocumentService` — thin use-case orchestrator
`pginx/ingestion/infrastructure/litellm_adapter.py`	`LiteLlmAdapter(LlmPort)` — wraps LiteLLM
`pginx/ingestion/infrastructure/pypdf_adapter.py`	`PyPdfAdapter(PdfReaderPort)` — wraps PyPDF2 / PyMuPDF

Querying BC

Module	Role
`pginx/querying/domain/model.py`	`PageRange`, `IndexedDocument` (read-only aggregate)
`pginx/querying/domain/ports.py`	`DocumentStorePort` (ABC)
`pginx/querying/application/query_service.py`	`QueryService` — get_document / get_structure / get_pages
`pginx/querying/infrastructure/json_document_store.py`	`JsonDocumentStore(DocumentStorePort)` — workspace file I/O or in-memory store (`workspace=None`); `load_workspace_into()` for backward-compat

Backward-compat shims (do not import in new code)

pginx/page_index.py, pginx/retrieve.py, pginx/utils.py, pginx/tracing.py

Entry Points

PageIndexClient(api_key, model, retrieve_model, workspace)
  # stateful facade; wires both BCs
  # retrieve_model: if blank, defaults to model; routed via _normalize_retrieve_model()
  #   → strips openai/ prefix; passes litellm/ through; otherwise prepends litellm/
  # all query methods resolve doc_id via _resolve_doc_id():
  #   → matches doc_name or filename from _meta.json before treating as bare id

page_index(doc, **opts)          # legacy shim → IndexDocumentService (no persistence)

Ingestion Pipeline

PageIndexClient.index(file_path)
  ↓ IndexDocumentService.index(file_path, opt)
      ↓ PyPdfAdapter.read_pages(file_path)     # → list[PageSlice]
      ↓ IndexingDocument.record_pages(pages)   # status: PARSING → STRUCTURING
      ↓ asyncio.run(_build_structure())
          ↓ StructureBuilder.build(page_list, opt)
              ↓ TocProcessor.check_toc()        # LLM: detect TOC presence
              ↓ TocProcessor.process(page_list, mode='process_no_toc')
                  ↓ _process_no_toc()
                      ↓ _build_labelled_page_texts()
                      ↓ _page_list_to_group_text()   # chunk by token limit
                      ↓ _generate_toc_init()          # LLM: first chunk → flat TOC list
                      ↓ _generate_toc_continue()      # LLM: extend for remaining chunks
                  ↓ validate_and_truncate_physical_indices()
                  ↓ _verify_toc()                     # async: check_title_appearance per item
                  ↓ [if accuracy < 1.0] _fix_incorrect_toc_with_retries()
              ↓ add_preface_if_needed()
              ↓ check_title_appearance_in_start_concurrent()
              ↓ post_processing()              # flat list → tree (start/end indices)
              ↓ _process_large_node()          # recurse for nodes > max_page/token limit
          ↓ write_node_id(), add_node_text(), generate_summaries(), format_structure()
      ↓ IndexingDocument.record_structure(structure)  # status: STRUCTURING → ENRICHING
      ↓ [if_add_doc_description] Enricher.generate_doc_description()
      ↓ IndexingDocument.enrich(doc_description)      # status: ENRICHING → COMPLETE
      ↓ JsonDocumentStore.save(doc_id, doc)
  ↓ QueryService.invalidate(doc_id)
  ↓ return doc_id

Key invariant: every node in the final structure has start_index and end_index as 1-based physical PDF page numbers.

Querying

QueryService.get_document(doc_id)
  ↓ _resolve_doc_id(doc_id)            # name/filename → bare doc_id
  ↓ JsonDocumentStore.load(doc_id)     # from disk or memory
  ↓ IndexedDocument.get_metadata()     # → JSON: id, name, description, type, page_count

QueryService.get_document_structure(doc_id)
  ↓ _resolve_doc_id(doc_id)
  ↓ IndexedDocument.get_structure()    # → JSON tree with 'text' fields stripped

QueryService.get_page_content(doc_id, pages)
  ↓ _resolve_doc_id(doc_id)
  ↓ PageRange.from_string("5-7"|"3,8"|"12")
  ↓ IndexedDocument.get_pages(page_range)
      ↓ prefer cached doc['pages']
      ↓ fallback: open PDF with PyPDF2
  ↓ → JSON: [{page, content}, ...]

Ports

LlmPort (ABC)                          PdfReaderPort (ABC)
  .complete(prompt, ...) → str           .read_pages(source) → list[PageSlice]
  .acomplete(prompt) → str               .get_name(source) → str
  .count_tokens(text) → int

Implementations:
  LiteLlmAdapter(LlmPort)              PyPdfAdapter(PdfReaderPort)
    # constructor strips litellm/ prefix before passing model to LiteLLM
    # static helpers (not on port):
    #   parse_json(content) → dict   — extracts + cleans JSON from LLM response
    #   get_json_content(response) → str — strips ```json … ``` fences
  DocumentStorePort (ABC)
    .save(doc_id, doc)
    .load(doc_id) → dict | None
    .load_all_metadata() → dict

Implementation:
  JsonDocumentStore(DocumentStorePort)

Config (`shared/config.py — ConfigLoader`)

ConfigLoader(default_path=config.yaml)
  .load(user_opt: dict | config | None) → config (SimpleNamespace)
    ↓ load YAML defaults  (default model: anthropic/claude-sonnet-4-6)
    ↓ validate user keys against known keys
    ↓ merge: defaults ← user overrides
    ↓ return config(model, retrieve_model, toc_check_page_num,
                    max_page_num_each_node, max_token_num_each_node,
                    if_add_node_id, if_add_node_summary,
                    if_add_doc_description, if_add_node_text)
                    # retrieve_model: "" in YAML → client falls back to model

Tracing (`shared/tracing.py`)

setup_tracing()
  ↓ create per-module TracerProviders (keyed by _MODULE_COMPONENTS)
  ↓ attach BatchSpanProcessor → OTLPSpanExporter (localhost:4318)
  ↓ register litellm.success/failure_callback = ["otel"]

@traced  (decorator, wraps sync and async)
  ↓ start span named after function
  ↓ record args, result (or exception) as span attributes

Data Shapes

Structure Node (in-memory / on-disk)

{
  "title": "Chapter 3",
  "node_id": "0003",
  "start_index": 42,
  "end_index": 58,
  "summary": "...",
  "text": "...",
  "nodes": [ ... ]
}

text is stripped from workspace JSON saves (redundant with the pages array).

Workspace Layout

{workspace}/
  _meta.json          # lightweight index: {doc_id: {type, doc_name, doc_description, path, page_count}}
  {doc_id}.json       # full document: structure (no text) + pages array

Typical Call Flow (agentic RAG)

client.index("paper.pdf")              → doc_id
client.get_document(doc_id)            → doc metadata
client.get_document_structure(doc_id)  → tree (LLM picks relevant nodes)
client.get_page_content(doc_id, "12-15") → raw page text for answer generation