Competitive-exam-pipeline

░██████╗░██╗░░░██╗███████╗░██████╗████████╗██╗░█████╗░███╗░░██╗
██╔═══██╗██║░░░██║██╔════╝██╔════╝╚══██╔══╝██║██╔══██╗████╗░██║
██║██╗██║██║░░░██║█████╗░░╚█████╗░░░░██║░░░██║██║░░██║██╔██╗██║
╚██████╔╝██║░░░██║██╔══╝░░░╚═══██╗░░░██║░░░██║██║░░██║██║╚████║
░╚═██╔═╝░╚██████╔╝███████╗██████╔╝░░░██║░░░██║╚█████╔╝██║░╚███║
░░░╚═╝░░░░╚═════╝░╚══════╝╚═════╝░░░░╚═╝░░░╚═╝░╚════╝░╚═╝░░╚══╝

███████╗██╗░░██╗████████╗██████╗░░█████╗░░█████╗░████████╗██╗░█████╗░███╗░░██╗
██╔════╝╚██╗██╔╝╚══██╔══╝██╔══██╗██╔══██╗██╔══██╗╚══██╔══╝██║██╔══██╗████╗░██║
█████╗░░░╚███╔╝░░░░██║░░░██████╔╝███████║██║░░╚═╝░░░██║░░░██║██║░░██║██╔██╗██║
██╔══╝░░░██╔██╗░░░░██║░░░██╔══██╗██╔══██║██║░░██╗░░░██║░░░██║██║░░██║██║╚████║
███████╗██╔╝╚██╗░░░██║░░░██║░░██║██║░░██║╚█████╔╝░░░██║░░░██║╚█████╔╝██║░╚███║
╚══════╝╚═╝░░╚═╝░░░╚═╝░░░╚═╝░░╚═╝╚═╝░░╚═╝░╚════╝░░░░╚═╝░░░╚═╝░╚════╝░╚═╝░░╚══╝

██████╗░██╗██████╗░███████╗██╗░░░░░██╗███╗░░██╗███████╗
██╔══██╗██║██╔══██╗██╔════╝██║░░░░░██║████╗░██║██╔════╝
██████╔╝██║██████╔╝█████╗░░██║░░░░░██║██╔██╗██║█████╗░░
██╔═══╝░██║██╔═══╝░██╔══╝░░██║░░░░░██║██║╚████║██╔══╝░░
██║░░░░░██║██║░░░░░███████╗███████╗██║██║░╚███║███████╗
╚═╝░░░░░╚═╝╚═╝░░░░░╚══════╝╚══════╝╚═╝╚═╝░░╚══╝╚══════╝

Batch pipeline for exam question extraction and JSON dataset generation

Automatically downloads PDFs, performs OCR, extracts questions via multi-AI consensus, validates, and exports structured datasets.

What is Question-Pipeline?
Key Features
Architecture Overview
Supported Exams
Tech Stack
Directory Structure
Pipeline Flow — Detailed
JSON Schema
AI Provider System
Installation & Setup
Usage — CLI Commands
Web Dashboard
API Reference
Human Review Workflow
Cross-Validation System
Anti-Hallucination Design
ID Scheme
Topic Vocabulary
Validation — 32 Automated Checks
Error Handling & Self-Healing
Model Limits & Rate Limiting
Testing
Project Status
Environment Variables
Design Decisions

What is Question-Pipeline?

Question-Pipeline is a fully automated, zero-dependency Node.js/TypeScript system that transforms exam PDFs into high-quality structured JSON datasets.

The pipeline handles the entire lifecycle:

Raw PDF  →  OCR  →  AI Extraction (6 providers)  →  Multi-Provider Consensus  →  
Validation (32 checks)  →  Normalization  →  Export →  Human Review →  Sign-off

It is designed as both an autonomous AI-driven system (any AI agent can run it end-to-end) and a developer-managed pipeline (CLI + REST API + web dashboard).

The project was born from a simple observation: high-quality structured exam data is essential for EdTech platforms, tutoring apps, and AI training, yet extracting it from PDFs reliably at scale remains a hard unsolved problem. Question-Pipeline solves this with a multi-provider consensus architecture, anti-hallucination safeguards, and a human-in-the-loop review system.

Why This Exists

Exam PDFs are messy:

No standardized format across years or exam bodies
Answer keys are sometimes embedded, sometimes separate, sometimes missing entirely
Multi-language papers with mixed layouts
Diagrams embedded as images with no alt text
Handwritten annotations in scanned PDFs

Manual extraction doesn’t scale. Pure AI extraction hallucinates. Question-Pipeline bridges the gap with a defense-in-depth approach.

Key Features

Feature	Description
Multi-Provider AI Extraction	6 AI providers ranked by reliability; runs top 3 in parallel for consensus
Zero Runtime Dependencies	Pure Node.js built-ins (`fetch`, `http`, `crypto`, `fs`) — no Express, no MongoDB, no Docker
Mistral AI OCR	High-accuracy PDF-to-markdown with embedded image extraction
MinerU OCR	Dual-mode OCR (Agent + Precision APIs) — LaTeX formulas, HTML tables, image extraction, ZIP result parsing. Excels on complex NEET/JEE layouts
Multi-Provider Consensus	Majority-vote per field across 3 parallel AI providers with confidence scoring
32 Automated Validations	Per-type field checks, structural integrity, topic normalization, checksums
Auto-Repair	Detects missing answers, merged options, count mismatches and re-extracts intelligently
Human Review CLI + Dashboard	Interactive terminal review (vim-like keys) + web dashboard with SSE live updates
Cross-Validation	Compare two providers, generate HTML diff reports, review only disagreements
REST API + SSE	Native HTTP server with real-time event streaming, file browsing, and pipeline control
Checkpoint Resume	Every stage is checkpointed — resume from any failure
Anti-Hallucination by Design	Never fabricates data; fails loudly with honest errors
Topic Normalization	250+ aliases → controlled vocabulary, with Levenshtein + cosine similarity fallback
LaTeX Normalization	60+ LaTeX-to-Unicode mappings, OCR ligature fixes
Bilingual Support	`textHi` field for Hindi/English NEET papers

Architecture Overview

┌──────────────────────────────────────────────────────────────────────────┐
│                        QUESTION-PIPELINE                                 │
│                                                                          │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│  │ SCRAPERS │    │  INPUT   │    │   API    │    │  WEB DASHBOARD   │   │
│  │          │    │  FOLDER  │    │ SERVER   │    │                  │   │
│  │ NTA      │    │  input/  │    │ port 3456│    │ dashboard.html   │   │
│  │ GateOver │───▶│          │───▶│          │───▶│ pipeline-canvas  │   │
│  │ NCERT    │    │ *.pdf    │    │ REST+SSE │    │ flow-v2.html     │   │
│  │ Kaggle   │    │          │    │          │    │                  │   │
│  └──────────┘    └──────────┘    └──────────┘    └──────────────────┘   │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐    │
│  │                    CORE PIPELINE                                  │    │
│  │                                                                   │    │
│  │  ┌──────┐    ┌──────┐    ┌──────────┐    ┌──────────┐           │    │
│  │  │ OCR  │───▶│CHUNK │───▶│ EXTRACT  │───▶│  MERGE   │           │    │
│  │  │Mistral│   │15pg+5│    │6 Providers│   │Dedup+Pick│           │    │
│  │  │  API  │    │overlap│   │Distributed│   │  Best    │           │    │
│  │  └──────┘    └──────┘    └──────────┘    └──────────┘           │    │
│  │       │                                              │          │    │
│  │       ▼                                              ▼          │    │
│  │  ┌──────────┐                                    ┌──────────┐   │    │
│  │  │ DIAGRAM  │                                    │ CONSENSUS│   │    │
│  │  │ CACHER   │                                    │ 3 Par-   │   │    │
│  │  │ PNG base64│                                   │ allel AI │   │    │
│  │  └──────────┘                                    │ + Vote   │   │    │
│  │                                                   │          │   │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌─────┴────┐    │    │
│  │  │VALIDATE  │──▶│ AUTO-   │──▶│ FINALIZE │──▶│  EXPORT  │    │    │
│  │  │32 Checks │  │ REPAIR   │  │ Normalize │   │ paper.json│   │    │
│  │  │          │  │ Re-extract│ │ Assign IDs │   │ subjects/ │   │    │
│  │  │          │  │ Fix opts  │ │ Checksums  │   │ index.json│   │    │
│  │  └──────────┘  └──────────┘  └──────────┘  └───────────┘    │    │
│  └──────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ┌─────────────────────────────────────┐  ┌─────────────────────────┐   │
│  │         HUMAN REVIEW                 │  │   CROSS-VALIDATION      │   │
│  │                                      │  │                         │   │
│  │  ┌──────────┐  ┌───────────────┐    │  │  ┌────────┐  ┌────────┐ │   │
│  │  │REVIEW CLI│  │ BATCH SIGNOFF │    │  │  │PROVIDER│  │PROVIDER│ │   │
│  │  │j/k/a/e/s│  │ verified flag  │    │  │  │   A    │  │   B    │ │   │
│  │  │   f/q   │  │               │    │  │  └───┬────┘  └───┬────┘ │   │
│  │  └──────────┘  └───────────────┘    │  │      └────┬──────┘      │   │
│  │                                      │  │           ▼             │   │
│  │                                      │  │    ┌────────────┐      │   │
│  │                                      │  │    │ HTML DIFF  │      │   │
│  │                                      │  │    │ REPORT     │      │   │
│  │                                      │  │    └────────────┘      │   │
│  └─────────────────────────────────────┘  └─────────────────────────┘   │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐    │
│  │                        DATA LAYER                                 │    │
│  │                                                                   │    │
│  │  data/                                                             │    │
│  │  ├── index.json          ← Master registry of all datasets        │    │
│  │  ├── jeemain/2025/22jan-s1/                                       │    │
│  │  │   ├── paper.json      ← Merged (all subjects)                  │    │
│  │  │   ├── physics.json    ← Subject file (numbers reset to 1)      │    │
│  │  │   ├── chemistry.json                                           │    │
│  │  │   ├── mathematics.json                                         │    │
│  │  │   └── diagrams/       ← Extracted PNG images                   │    │
│  │  ├── .checkpoints.json   ← Stage-level progress tracking          │    │
│  │  └── .tombstones.json    ← Removed IDs (never reused)             │    │
│  └──────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘

Pipeline Stages in Detail

STAGE 1: SCRAPE ──────────────────────────────────────────────────────
  Input:  Exam name, year, shift number(s)
  Action: Download PDF from NTA official site / Gateoverflow / NCERT
  Output: PDF file path in data/{exam}/raw/
  Retry:  3x on failure
  Fallback: Gateoverflow mirrors when NTA 404s

STAGE 2: OCR ─────────────────────────────────────────────────────────
  Input:  PDF file path
  Action: POST to Mistral AI OCR API (mistral-ocr-latest)
          Auto-detects bilingual pages (Hindi/English ratio check)
          For PDFs >3.5MB: split and re-OCR
  Output: Per-page markdown + base64 embedded images
  Ratelimit: 60 req/min, exponential backoff on 429
  Two modes:
    - Standard: plain OCR → markdown
    - Enhanced: +structured annotations + bounding boxes (--use-enhanced-ocr)

STAGE 3: CHUNK ───────────────────────────────────────────────────────
  Input:  OCR pages array
  Action: Split into overlapping 15-page chunks with 5-page overlap
          Guarantees: no question spans across chunk boundaries
  Output: Array of chunk markdown strings
  Used:   For PDFs >12 pages (distributed extraction)

STAGE 4: EXTRACT (AI) ────────────────────────────────────────────────
  Input:  Chunk markdown + exam metadata
  Action: 
    - For ≤12 pages: single-provider extraction (NVIDIA Qwen3 Coder 480B)
    - For >12 pages: distributed across providers round-robin
    - For consensus mode: 3 providers in parallel
  Providers: Poolside (7) > LongCat Lite (6) > NVIDIA Qwen (5) > 
             LongCat Chat (3) > Gemini (2) > Cerebras (1) > Vanchin (0)
  Output: PartialQuestion[] structured by AI
  
STAGE 5: MERGE ───────────────────────────────────────────────────────
  Input:  Multiple ChunkResult[] from overlapping chunks or providers
  Action: Deduplicate by question number
          Selection priority:
            1. Completeness score (more filled fields)
            2. Non-empty answer > empty answer
            3. Longer options > shorter
            4. Provider reliability ranking
            5. Earlier chunk index
          Uses Mistral embeddings API for semantic similarity
          Falls back to Jaccard word-set similarity
  Output: Merged PartialQuestion[] with passages

STAGE 5b: CONSENSUS (Parallel) ───────────────────────────────────────
  Input:  OCR pages (when --use-consensus flag)
  Action: Run Poolside + LongCat Lite + NVIDIA Qwen3 Coder in parallel
          Each provider independently extracts questions from same pages
          Majority-vote per field across 3 results
          Conflict detection when agreement < 2/3
  Confidence: high (≥0.8), medium (≥0.5), low (<0.5)
  Output: ConsensusResult with confidence scores + conflict list

STAGE 6: DIAGRAM CACHE ───────────────────────────────────────────────
  Input:  OCR result with base64 images
  Action: Decode base64 → save as PNG files
  Output: data/{exam}/{year}/{shift}/diagrams/{subject}/q{num}-fig{n}.png

STAGE 7: VALIDATE ────────────────────────────────────────────────────
  Input:  Question[] array
  Action: 32 automated checks (see validation section)
  Per-type rules:
    - MCQ: 3-5 options, answer is 0-based index
    - MSQ: 4-6 options, answers sorted, ≥1 correct
    - NAT: options=null, negativeMarks=0, answer numeric
    - Assertion-Reason: options=null, answer in "0"|"1"|"2"|"3"
  Output: ValidationResult with passed/failed/warnings count

STAGE 7b: AUTO-REPAIR ───────────────────────────────────────────────
  Triggers: question count mismatch, missing answers, merged options
  Actions: Re-extract answer key pages, split merged options,
           re-extract with strict prompt on count mismatch

STAGE 8: FINALIZE ────────────────────────────────────────────────────
  Input:  PartialQuestion[] + metadata
  Actions:
    1. normalizeText() — LaTeX→Unicode, OCR ligature fixes
    2. normalizeTopic() — free-form → controlled vocabulary
    3. assignIds() — globally unique IDs with tombstone tracking
    4. build QuestionFile wrapper
    5. compute SHA-256 checksum
    6. write subject JSON files FIRST (physics.json etc.)
    7. write paper.json SECONDARY (merged from subjects)
    8. update data/index.json master registry
  Output: Structured JSON files on disk

STAGE 9: REVIEW ──────────────────────────────────────────────────────
  Input:  Exported QuestionFile
  Action: Human reviews each question
  Keys: a=accept, e=edit, s=skip, f=flag, j/k=navigate, q=quit
  Supports: Resume from saved progress (.review-progress.json)
  Sign-off: Mark shift as verified / needs-review

STAGE 10: CROSS-VALIDATE ─────────────────────────────────────────────
  Input:  Two QuestionFile from different providers
  Action: Compare field-by-field, find agreements and disagreements
  Output: HTML diff report
  Benefit: Review only 5-15 disagreements instead of 90 questions

Supported Exams (Configuration)

You define your exam in src/types.ts. Example configurations:

Exam	Code	Subjects	Duration	Marking
JEE Main	`jeemain`	physics, chemistry, mathematics	180 min	+4 / -1 / 0
NEET UG	`neet`	physics, chemistry, biology	200 min	+4 / -1 / 0
Your Exam	`your-exam`	your subjects	your duration	your marking

Add your own exam config in src/types.ts → add to the ExamCode type and define sections.

Tech Stack

Layer	Technology
Runtime	Node.js 18+ (native `fetch`, `http`, `crypto`)
Language	TypeScript 5.7 — strict mode, ES2022 target
Module System	ES Modules (`"type": "module"`)
Compiler	`tsc` + `tsx` for direct execution
Testing	Vitest 4.1.7
OCR	Mistral AI OCR API (`mistral-ocr-latest`) or MinerU OCR (Agent + Precision APIs)
AI Providers	NVIDIA Qwen3 Coder 480B, LongCat Flash Lite, Poolside Laguna M.1, Vanchin KAT-Coder, Gemini 3.1 Flash Lite, Cerebras GPT-OSS-120B
Database	JSON files on disk (no Docker, no SQL, no MongoDB)
Server	Native `http` module (no Express/Fastify)
Human Review	Terminal CLI (`readline-sync`) + Web Dashboard
Platform	Windows (PowerShell 5.1) / Linux / macOS

Zero Runtime Dependencies

The project intentionally has zero runtime npm dependencies. Every API call uses Node.js built-in fetch(). The HTTP server uses the native http module. File operations use fs/promises. The only devDependencies are TypeScript tooling:

typescript    — compilation
vitest        — test runner
@types/node   — type definitions
readline-sync — CLI review input (only dev dependency at runtime)

Directory Structure

C:\QUESTION-PIPELINE\
│
├── package.json                # Project manifest, 11 npm scripts
├── tsconfig.json               # TypeScript: strict, ES2022, NodeNext
├── vitest.config.ts            # Vitest test runner config
├── .env.example                # API key template (7 providers)
├── .env                        # Actual API keys (gitignored)
├── .gitignore
├── .checkpoints.json           # Auto-tracked processing state
│
├── AGENT.md                    # AI agent operational manual (root-level for agent discovery)
├── prompts/                    # Supporting AI prompts
│   ├── one-shot-prompt.md      # One-shot prompt for AI chat apps (PDF only, no diagrams)
│   └── AI-START-COMMAND.md     # AI session start instructions
│
├── docs/                       # User & developer documentation
│   ├── getting-started.md      # Quick-start for humans
│   ├── human-intervention.md   # When to pause and ask for help
│   ├── model-limits.md         # Provider rate limits reference
│   └── previous-plans/         # Historical design documents
│       ├── PLAN.md
│       ├── PLAN_V2.md
│       ├── PLAN_V3.md
│       └── PLAN_V4.md
│
├── src/                        # SOURCE CODE
│   ├── index.ts                # CLI entry point — command router
│   ├── types.ts                # ALL TypeScript interfaces (canonical schema)
│   ├── vocabulary.ts           # 250+ topic aliases, controlled tags, normalization
│   │
│   ├── scrapers/               # PDF downloaders
│   │   ├── nta-scraper.ts      # NTA official site (JEE Main / NEET)
│   │   ├── gateoverflow-scraper.ts  # Gateoverflow mirrors (JEE Adv, JEE Main)
│   │   ├── ncert-scraper.ts    # NCERT Exemplar Class 11/12
│   │   └── kaggle-importer.ts  # Import existing Kaggle datasets
│   │
│   ├── extractors/             # Core extraction pipeline
│   │   ├── ocr-stage.ts        # OCR router — Mistral + MinerU support
│   │   ├── chunker.ts          # Split large PDFs into overlapping chunks
│   │   ├── structurer.ts       # Single-provider AI extraction
│   │   ├── consensus-extractor.ts  # Multi-provider consensus (3 in parallel)
│   │   ├── merger.ts           # Merge overlapping chunk results
│   │   ├── diagram-cacher.ts   # Decode/save diagram images from OCR
│   │   ├── mineru-ocr.ts       # MinerU OCR (Agent + Precision APIs, ZIP extraction)
│   │   ├── auto-repair.ts      # Auto-detect and fix extraction issues
│   │   └── progressive-review.ts   # Chunk-by-chunk human-in-loop
│   │
│   ├── validators/
│   │   ├── field-checker.ts    # Per-type field validation
│   │   └── auto-validator.ts   # 32 automated validation checks
│   │
│   ├── finalizers/
│   │   ├── id-assigner.ts      # Global unique ID generation
│   │   ├── normalizer.ts       # LaTeX→Unicode, OCR ligature fixes
│   │   ├── topic-normalizer.ts # Map free-form topics to controlled vocabulary
│   │   └── exporter.ts         # Write JSON files, compute checksums, update index
│   │
│   ├── cross-validate/
│   │   ├── cross-validator.ts  # Compare two model outputs, generate diff report
│   │   └── diff-viewer.ts      # HTML diff viewer
│   │
│   ├── review/
│   │   ├── pdf-renderer.ts     # Terminal-based question renderer
│   │   ├── review-cli.ts       # Interactive terminal review
│   │   └── batch-signoff.ts    # Mark shifts as verified/needs-review
│   │
│   ├── api/
│   │   └── server.ts           # Native http server (port 3456), SSE, dashboard
│   │

│   └── utils/
│       ├── logger.ts           # Structured logging (debug/info/warn/error)
│       ├── rate-limiter.ts     # Queue + window-based API throttling
│       ├── checkpoints.ts      # Stage-level progress tracking for resume
│       ├── embeddings.ts       # Mistral embeddings API + semantic similarity
│       ├── hash-utils.ts       # SHA-256 checksum computation
│       ├── integrity.ts        # Walk data/ and verify checksums
│       ├── metrics.ts          # Accuracy tracking against golden datasets
│       └── pdf-downloader.ts   # Download PDF with retry + validation
│
├── scripts/                    # Executable scripts (via npm run)
│   ├── interactive.ts          # Interactive TUI wizard (OCR select, env check, progress)
│   ├── process-pdf.ts          # Main entry for manual PDF processing
│   ├── batch-process.ts        # Process all PDFs in input/
│   ├── test-models.ts          # Health check for all AI providers
│   ├── test-mistral-structured.ts  # Test Mistral structured annotations
│   ├── test-full-pipeline.ts   # End-to-end pipeline test
│   ├── verify-all.ts           # Verify all processed datasets
│   ├── rebuild-index.ts        # Regenerate index.json from disk
│   ├── export-for-opensource.ts  # Export with license for sharing
│   └── stats.ts                # Print dataset statistics
│
├── tests/                      # Test suites
│   ├── unit/
│   │   ├── merger.test.ts
│   │   ├── consensus.test.ts
│   │   ├── auto-repair.test.ts
│   │   ├── chunker.test.ts
│   │   └── topic-normalizer.test.ts
│   ├── integration/
│   │   └── golden-dataset.test.ts
│   └── fixtures/
│       └── golden-jeemain-sample.json
│
├── dashboard.html              # Web dashboard UI
├── pipeline-canvas.html        # Pipeline visualization
└── pipeline-flow-v2.html       # Pipeline flow diagram

File Role Summary

Path	Role
`src/types.ts`	The Schema — all interfaces, enums, literals. Schema version `"v4"`.
`src/vocabulary.ts`	Controlled Vocabulary — 250+ aliases, 4 subject tag lists, multi-strategy normalization
`src/scrapers/*`	PDF Downloaders — NTA official, Gateoverflow mirrors, NCERT, Kaggle import
`src/extractors/ocr-stage.ts`	OCR Engine — Mistral AI OCR with rate limiting, bilingual detection, enhanced mode
`src/extractors/chunker.ts`	PDF Chunker — 15-page overlapping chunks, no question-spanning guarantees
`src/extractors/structurer.ts`	AI Structurer — Single-provider + distributed extraction with provider priority chain
`src/extractors/consensus-extractor.ts`	Consensus Engine — 3 parallel providers, majority-vote, confidence scoring
`src/extractors/merger.ts`	Chunk Merger — Dedup with multi-criteria scoring, semantic similarity
`src/extractors/auto-repair.ts`	Self-Healing — Detects/fixes missing answers, merged options, count mismatches
`src/validators/auto-validator.ts`	32-Check Validator — IDs, types, options, diagrams, passages, topics, Unicode
`src/finalizers/exporter.ts`	Output Generator — Normalize, assign IDs, write files, compute checksums
`src/review/review-cli.ts`	Human Review — Interactive terminal with vim-like keys, progress save/resume
`src/api/server.ts`	API Server — Native http, REST endpoints, SSE streaming, file serving
`src/utils/rate-limiter.ts`	Rate Limiter — Queue + sliding window, per-provider configs
`src/utils/embeddings.ts`	Embeddings — Mistral embeddings API with LRU cache, cosine similarity

JSON Schema

`QuestionFile` — Top-Level Wrapper

interface QuestionFile {
  schema: string;           // "v4"
  exam: string;             // "jeemain" | "neet" | "jeeadv" | "ncert-exemplar"
  year: number | null;
  shift: string | null;     // "22jan-s1", "04may", "p1"
  paper: string | null;
  subjects: string[];       // ["physics", "chemistry", "mathematics"]
  total: number;            // Total question count
  duration: number;         // Exam duration in minutes
  marksCorrect: number;     // e.g. 4
  marksIncorrect: number;   // e.g. -1
  marksUnanswered: number;  // e.g. 0
  sections: Record<string, SectionConfig>;
  scrapedAt: string;        // ISO 8601
  answerKeyFound: boolean;
  checksum: string;         // SHA-256 (computed before this field is added)
  questions: Question[];
  passages: Passage[];
}

`Question` — Individual Question

interface Question {
  id: string;               // "jeemain-2025-22jan-s1-ph-001"
  number: number;           // 1-N within subject file
  numberLabel: string|null; // "1(a)", "1(b)" for JEE Advanced sub-questions
  subject: string;          // "physics" | "chemistry" | "mathematics" | "biology"
  topic: string;            // Controlled vocabulary
  section: string | null;   // "a", "b", "section-1", etc.
  type: "mcq" | "msq" | "nat" | "assertion-reason";
  text: string;             // Question text (English)
  textHi: string | null;    // Hindi text (NEET bilingual)
  options: string[] | null; // 3-5 for MCQ, 4-6 for MSQ, null for NAT/AR
  answer: string;           // MCQ: "0"-"3". NAT: numeric. AR: "0"-"3"
  answers: string[] | null; // MSQ: sorted indices ["1","3"]
  answerPrecision: { type: "exact" | "integer-range" | "decimal-range"; min?: number; max?: number; unit?: string } | null;
  marks: number;
  negativeMarks: number;    // 0 for NAT
  passageId: string | null;
  solution: string | null;
  solutionFormat: "plain" | "html" | "markdown" | "latex" | null;
  hasDiagram: boolean;
  diagrams: Diagram[] | null;
  difficulty: "easy" | "medium" | "hard" | null;
  tags: string[];           // Controlled vocabulary tags
  revision: number;         // Starts at 1, increments on edit
  source: "official-pdf" | "reconstructed" | "imported-kaggle" | "imported-dataset";
  confidence: "high" | "medium" | "low" | null;
}

`Passage` — Passage-Based Questions

interface Passage {
  id: string;
  text: string;
  textHi: string | null;
  diagrams: Diagram[] | null;
  questions: string[];       // Question IDs that reference this passage
}

`SectionConfig` — Exam Section Structure

interface SectionConfig {
  label: string;            // "Section A"
  total: number;            // Total questions in section
  required: number;         // Questions to attempt
  mandatory: boolean;       // Must attempt?
}

Output Structure on Disk

data/{exam}/{year}/{shift}/
├── physics.json              ← PRIMARY: questions[1-N] for physics
├── chemistry.json            ← PRIMARY: questions[1-N] for chemistry
├── mathematics.json          ← PRIMARY: questions[1-N] for mathematics
├── biology.json              ← PRIMARY: NEET only
├── paper.json                ← SECONDARY: merged from all subject files
└── diagrams/
    ├── physics/
    │   ├── q001-fig1.png
    │   └── ...
    ├── chemistry/
    └── biology/

Subject files are written FIRST with question numbers reset to 1-N within each subject. paper.json is built SECONDARY by merging subject files. IDs remain globally unique across all files.

AI Provider System

⚠️ Living document — free-tier models change frequently. This pipeline is built on free-tier AI providers. Model names, rate limits, and availability change over time. The ranking, priorities, and strategies below reflect the current landscape. When a provider drops or changes its free tier, use AI agentic coding tools to update provider configs in src/extractors/structurer.ts, src/utils/rate-limiter.ts, and tests/ — the architecture stays the same, only the endpoints and limits shift.

Provider Ranking (by extraction reliability)

Rank  Provider            Model                  RPM   Context   Daily Free
────────────────────────────────────────────────────────────────────────────
  Poolside            Laguna M.1            100    131K      Unlimited(preview)
  LongCat (Lite)      Flash-Lite             30    256K      50M tokens
  NVIDIA (Qwen)       Qwen3 Coder 480B       40    262K      2,400 RPD
  NVIDIA (Mistral)    Mistral-Large-3        40    262K      2,400 RPD
  LongCat (Chat)      Flash-Chat             30    256K      500K tokens
  Gemini              3.1 Flash Lite         15      1M      500 RPD
  Cerebras            GPT-OSS-120B            5     65K      2,400 RPD
  Vanchin             KAT-Coder-Air-V1       20      2M      28,800 RPD

How Extraction Works

Single-Provider Mode (structurer.ts):

Tries providers in ranking order
If primary fails → next provider (auto-failover)
For PDFs >12 pages: splits into 15-page overlapping chunks, distributes across providers round-robin in parallel, merges results

Consensus Mode (consensus-extractor.ts, --use-consensus):

Runs Poolside + LongCat Lite + NVIDIA Qwen3 Coder in parallel on the same pages
Each provider independently extracts questions
Per-question, per-field majority vote (2/3 agreement)
Confidence scoring: high (≥0.8), medium (≥0.5), low (<0.5)
Conflict detection with detailed report for human resolution

Distributed Extraction (PDFs >12 pages):

splitIntoChunks(pages, chunkSize=15, overlap=5) → overlapping groups
Each chunk assigned to different providers
Results merged via mergeChunks() with dedup and quality scoring

Answer Key Detection

The system automatically detects answers using two strategies:

Strategy 1: Answer Key at End (separate pages)

Scans last 10 pages for answer key patterns using regex:
- answer key / answer : / ans: headings
- Numbered answer tables (e.g., | Q | Ans |)
- NTA-style patterns: 1(2), 2(4), 3(1)
- Sequential number + answer letter patterns
If 20+ answer patterns found on a page → classified as answer key page
Detected answer key pages are appended to ALL chunks during distributed extraction for 99% accuracy
User is prompted to confirm auto-detected answer key pages (can be skipped with --skip-answer-key-prompt)

Strategy 2: Inline Answers (throughout document)

If no answer key pages found on last 10 pages, the system scans the full document for inline answer markers
Detects patterns like [Ans: 2], (Ans: 3), {Ans: 1} appearing next to question options
When inline answers are detected, each chunk extracts answers independently from the markers found within it
No separate answer key page needed — answers are embedded in the question text itself

When detected: answers are extracted from the key or inline markers and matched by question number. When NOT detected: ALL answers set to empty string (anti-hallucination rule).

Installation & Setup

Prerequisites

Node.js 18+ (for native fetch)
npm
At minimum: Mistral AI API key (for Mistral OCR) or MinerU API key (for MinerU OCR)

1. Clone & Install

cd C:\QUESTION-PIPELINE
npm install
npx tsc --noEmit      # Verify compilation — must pass with 0 errors

2. Set API Keys

Copy .env.example to .env and add your keys:

MISTRAL_API_KEY=sk-...           # Required for Mistral OCR
MINERU_API_KEY=eyJ...            # Token for MinerU Precision API (optional; Agent API fallback)
NVIDIA_API_KEY=nvapi-...         # Optional — primary extraction
LONGCAT_API_KEY=sk-...           # Optional — 50M tokens/day free
POOLSIDE_API_KEY=...             # Optional — unlimited(preview) free
VC_API_KEY=...                   # Optional — Vanchin KAT-Coder
GEMINI_API_KEY=AIzaSy...         # Optional — 500 RPD
CEREBRAS_API_KEY=...             # Optional — fallback

Or set them as environment variables:

$env:MISTRAL_API_KEY = "sk-..."
$env:NVIDIA_API_KEY = "nvapi-..."

3. Verify Connectivity

npm run test-models
# Tests all configured providers and reports which are reachable

Usage — CLI Commands

Quick Start: Process a PDF

# Drop a PDF in input/ folder, then:
npm run process-pdf -- --input "input/neet-2025-04may-s1.pdf"

# With MinerU OCR (better for complex layouts, formulas, tables):
npm run process-pdf -- --input "input/neet-2025-04may-s1.pdf" --ocr mineru

# With multi-provider consensus (3 providers in parallel):
npm run process-pdf -- --input "input/neet-2025-04may-s1.pdf" --use-consensus

# With enhanced OCR (structured annotations):
npm run process-pdf -- --input "input/neet-2025-04may-s1.pdf" -e

# Full power: consensus + enhanced OCR:
npm run process-pdf -- --input "input/neet-2025-04may-s1.pdf" -c -e

# PDF + separate answer key PDF (official NTA):
npm run process-pdf -- --input "question.pdf" --answer-key "answer-key.pdf"

# Interactive TUI wizard (no flags needed):
npm run interactive

All Commands

# ─── SETUP ──────────────────────────────────────────────────────────
npm install                        # Install dev dependencies
npx tsc --noEmit                   # Verify TypeScript compilation

# ─── SCRAPING ───────────────────────────────────────────────────────
npm run scrape -- --exam jeemain --year 2025 --shifts 2
npm run scrape -- --exam neet --year 2024
npm run scrape -- --exam jeeadv --year 2024

# ─── FULL PIPELINE ──────────────────────────────────────────────────
npm run batch -- --exam jeemain --year 2025 --shift 22jan-s1

# ─── MANUAL PDF PROCESSING ──────────────────────────────────────────
npm run process-pdf -- --input "input/paper.pdf"
npm run process-pdf -- --input "input/paper.pdf" --ocr mineru
npm run process-pdf -- --input "input/paper.pdf" --use-consensus
npm run process-pdf -- --input "input/paper.pdf" -c -e
npm run interactive              # Launch interactive TUI wizard

# ─── INDIVIDUAL STAGES ──────────────────────────────────────────────
npx tsx src/extractors/ocr-stage.ts --input data/jeemain/raw/file.pdf --output data/jeemain/ocr/
npx tsx src/extractors/structurer.ts --input data/jeemain/ocr/file.json --output data/jeemain/extracted/
npx tsx src/extractors/diagram-cacher.ts --input data/jeemain/ocr/file.json
npx tsx src/validators/auto-validator.ts --path data/jeemain/2025/22jan-s1/
npx tsx src/finalizers/exporter.ts --exam jeemain --year 2025 --shift 22jan-s1

# ─── REVIEW ─────────────────────────────────────────────────────────
npm run review -- --exam jeemain --year 2025 --shift 22jan-s1
npm run signoff -- --exam jeemain --year 2025 --shift 22jan-s1 --status verified

# ─── VERIFICATION ───────────────────────────────────────────────────
npm run verify                      # Verify all dataset checksums
npm run rebuild-index               # Regenerate data/index.json
npm run stats                       # Print dataset statistics
npm run status                      # Show checkpoint table

# ─── API SERVER ─────────────────────────────────────────────────────
npm run api                         # Start on http://localhost:3456

# ─── CROSS-VALIDATION ──────────────────────────────────────────────
npx tsx src/extractors/structurer.ts --model cerebras --input ... --output cerebras.json
npx tsx src/extractors/structurer.ts --model gemini --input ... --output gemini.json
npx tsx src/cross-validate/cross-validator.ts --a cerebras.json --b gemini.json

# ─── EXPORT ─────────────────────────────────────────────────────────
npm run export -- --license cc-by-4.0 --output ./export

# ─── TESTING ────────────────────────────────────────────────────────
npm run test                        # Run all Vitest tests
npm run test-mistral -- "input/paper.pdf"  # Test Mistral structured annotations
npm run test-full-pipeline -- "input/paper.pdf"  # End-to-end test

Filename Pattern Recognition

The process-pdf script automatically parses exam/year/shift from filenames:

Filename Pattern	Parsed As
`JEE-Main-2025-22-Jan-Shift-1.pdf`	jeemain, 2025, `22jan-s1`
`neet-2024-04-may.pdf`	neet, 2024, `04may-s1`
`jee-advanced-2024-paper-1.pdf`	jeeadv, 2024, `p1`
`NCERT-Exemplar-11-Physics.pdf`	ncert-exemplar, class 11

If parsing fails, the script asks for --exam, --year, --shift flags.

Checkpoint System

After processing, a checkpoint is recorded in .checkpoints.json. Running the same shift again will skip it unless --force is passed:

npm run process-pdf -- --input "input/paper.pdf" --force
npm run status   # Shows what's been processed

Interactive TUI Wizard

The interactive TUI (npm run interactive) provides a guided, stylized terminal experience for the pipeline:

Features

Step	Feature	Description
1	OCR Engine Selection	Choose Mistral OCR or MinerU OCR via interactive menu with feature comparison
2	Environment Health Check	Scans all 9 API keys, shows green/red status per provider, reports OCR/AI readiness
3	AI Provider Test	Pings each configured provider with a real API call — spinner animation while testing, shows HTTP status result
4	Help Menu	Type `?` at any prompt to open the help overlay — shows all CLI commands, OCR engine comparison, environment variables reference, PDF type legend, and about info
5	PDF File Selection	Lists all PDFs in `input/` directory with file sizes, or enter a custom path. Each file shows a status column (`✓` processed, `○` not processed) and a type column (Question Paper, Answer Key, Combined). Processed indicator parsed from `.checkpoints.json` via filename pattern matching
6	Already-Processed Detection	When a PDF is selected, checks `.checkpoints.json`. If processed: shows exam/year/shift, stage progress bar (`OCR → EXT → DIA → VAL → EXP`), and offers Force (re-process with `--force`), Skip, or Cancel
7	PDF Type Detection	Inline filename analysis: detects `answer`/`key`/`sol` keywords → labels PDF as Question Paper, Answer Key, or Combined. Shown in file listing and config summary
8	Live Pipeline Execution	Spawns `process-pdf.ts` as child process with real-time colorized log output — cyan INFO, yellow WARN, red ERROR, bold step headers
9	Completion Summary	Shows exit code, output location, next steps (review, signoff, stats)

Visual Design

ANSI box-drawing header (╔═╗) in cyan
Typewriter animation for welcome text
Unicode braille spinner (⠋⠙⠹...) during AI provider tests
Color-coded stage progress bar: OCR → EXT → DIA → VAL → EXP
Dimmed timestamps on log output
Hide/show cursor during animations

Usage

npm run interactive

The wizard loads .env automatically and guides through each step with prompts. All flags from process-pdf.ts (like --force, --ocr) are applied based on your choices.

Web Dashboard

Start the API server and open the dashboard:

npm run api
# Open http://localhost:3456/dashboard

Dashboard Features

Feature	Description
Live Pipeline Control	Run stages (scrape, OCR, extract, validate, finalize, verify, stats) with one click
Custom Commands	Run any shell command with live SSE streaming output
Real-Time Logs	See logs stream via Server-Sent Events as pipeline runs
File Browser	Browse `data/` directory tree interactively
Question Browser	Query questions by exam, year, shift, subject, type
Review Interface	Accept/edit/skip/flag questions with keyboard shortcuts
Pipeline Status	Current stage, running time, last 100 log entries
Stats Dashboard	Total questions, by exam/subject/type, diagram count, verification rate

Dashboard UI Files

File	Description
`dashboard.html`	Main web dashboard (pipeline control, review, file browser, stats)
`pipeline-canvas.html`	Pipeline visualization (canvas-based flow diagram)
`pipeline-flow-v2.html`	Pipeline flow diagram (v2, detailed stage view)

API Reference

The API server runs on port 3456 (configurable via PORT env). CORS is fully open.

Endpoints

Method	Endpoint	Description
`GET`	`/api/v1/events`	SSE real-time stream (status, logs, files, review)
`GET`	`/api/v1/pipeline/status`	Current pipeline state + last 100 logs
`GET`	`/api/v1/pipeline/stages`	List all available pipeline stages
`POST`	`/api/v1/pipeline/run`	Trigger a pipeline stage
`POST`	`/api/v1/pipeline/custom`	Run any shell command (output streams live via SSE)
`POST`	`/api/v1/pipeline/stop`	Kill running process
`POST`	`/api/v1/review/start`	Start human review session
`GET`	`/api/v1/review/current`	Get current review question
`POST`	`/api/v1/review/action`	Accept / edit / skip / flag a question
`POST`	`/api/v1/review/cancel`	Cancel review session
`GET`	`/api/v1/files/list`	Real file listing from `data/` with metadata
`GET`	`/api/v1/files/tree`	Full directory tree of `data/`
`GET`	`/api/v1/questions`	Query questions (filter by exam, year, subject, type)
`GET`	`/api/v1/questions/count`	Question count summary
`GET`	`/api/v1/exams`	List all exams in the dataset
`GET`	`/api/v1/stats`	Dataset statistics
`GET`	`/api/v1/diagrams/:path`	Serve diagram images
`GET`	`/dashboard`	Web dashboard UI

SSE Events

The server pushes real-time events via Server-Sent Events:

Event	Data	When
`status`	`{ stage, status, startedAt, finishedAt }`	Pipeline state changes
`log`	`{ ts, type, msg }`	New log entry (max 500 in history)
`review`	`{ action, questionId }`	Review action performed
`file-change`	`{ path, action }`	File created/deleted in `data/`

Human Review Workflow

CLI Review (`npm run review`)

Interactive terminal with vim-like keyboard shortcuts:

Keys:
  a  = Accept question
  e  = Edit question (opens $EDITOR)
  s  = Skip question (come back later)
  f  = Flag question (with note)
  j  = Next question
  k  = Previous question
  q  = Quit (save progress)

Question display format:

┌──────────────────────────────────────────────────────────────────────┐
│  Question 12 of 90                    ID: jeemain-2025-22jan-s1-ph-012│
│  Subject: physics   Type: mcq         Topic: electrostatics          │
│  Section: a         Marks: 4          Negative: -1                   │
├──────────────────────────────────────────────────────────────────────┤
│  Two point charges +q and -q are placed at a distance d apart.      │
│  The electric field at the midpoint is:                              │
│                                                                      │
│  Options:                                                            │
│    0: Zero                                                           │
│    1: 2kq/d² directed towards +q                                     │
│    2: 2kq/d² directed towards -q                                     │
│    3: 4kq/d² directed towards -q                                     │
│                                                                      │
│  AI Answer: 2 (index 2)                                              │
│                                                                      │
│  [a]ccept  [e]dit  [s]kip  [f]lag  [j]next  [k]prev  [q]uit        │
└──────────────────────────────────────────────────────────────────────┘

Progress persistence: Review progress is saved to .review-progress.json automatically. You can quit and resume later.

Batch Sign-Off

After all questions reviewed:

npm run signoff -- --exam jeemain --year 2025 --shift 22jan-s1 --status verified

Web Dashboard Review

The web dashboard provides a GUI version of the review interface with the same actions (accept/edit/skip/flag) plus mouse navigation.

Cross-Validation System

Phase 9 of the project — designed to reduce human review time by 80%.

How It Works

Run TWO different AI providers on the same OCR output
Cross-validator compares each field of each question
Matched questions (80-95%) → auto-accepted
Disagreements only → presented to human for resolution
buildConsensus() merges results with human resolutions

# Run two providers
npx tsx src/extractors/structurer.ts --model cerebras --input ocr.json --output cerebras.json
npx tsx src/extractors/structurer.ts --model gemini --input ocr.json --output gemini.json

# Cross-validate
npx tsx src/cross-validate/cross-validator.ts --a cerebras.json --b gemini.json

# Opens HTML report with diff viewer

Diff Report (HTML)

The generated HTML report shows:

Side-by-side comparison of each question
Color-coded fields (green = match, red = disagreement)
Agreement percentage per field and overall
Confidence scores for each provider
Quick navigation to disagreements only

This reduces review workload from 90 questions to typically 5-15 disagreements.

Anti-Hallucination Design

The system has a zero-tolerance policy toward fabricated data, embedded at every level:

Core Rule

NEVER fabricate pipeline output. If the pipeline fails at any stage, tell the user the exact failure. Do NOT generate fake paper.json from your training data, do NOT fill gaps with “common questions everyone knows.” A loud honest error is worth 1000× more than silent corrupted data.

Implementation

Safeguard	Where	Description
No answer key → empty answers	`consensus-extractor.ts`	If answer key not detected, ALL answers forced to empty string
Answer key detection	`consensus-extractor.ts`	12 regex patterns verify key presence before extraction
Source tracking	`types.ts`	Every question tagged with source (`official-pdf`, `reconstructed`, etc.)
Confidence scoring	`consensus-extractor.ts`	Multi-provider agreement → confidence field
Field-level validation	`auto-validator.ts`	32 checks catch impossible/invalid data
Validation: FATAL checks	`auto-validator.ts`	If no answer key found but answers exist → all cleared
Checkpoint system	`checkpoints.ts`	Never re-extract — always resume from last real state
Revision tracking	`types.ts`	Every question has `revision` field, incremented on edit
Tombstone IDs	`id-assigner.ts`	Removed IDs never reused — no stale data
Integrity verification	`integrity.ts`	SHA-256 checksums on all output files
Provider diversity	`structurer.ts`	Auto-failover between 6+ providers — no single point of hallucination
Human review gate	`review-cli.ts`	Every question requires human acceptance before sign-off

Agent Protocol

The AGENT.md (821 lines) and one-shot-prompt.md (575 lines) both begin with the HARD RULE against fabrication. Any AI agent running the pipeline is explicitly instructed to:

Actually run the pipeline (not simulate it)
Never generate questions from training data
Never fall back to “mock mode” when APIs fail
Never skip pipeline stages
Fail loudly with exact error messages

ID Scheme

Format

{exam}-{year}-{shift-shorthand}-{subject-code}-{3-digit-number}

Examples

ID	Meaning
`jeemain-2025-22jan-s1-ph-001`	JEE Main 2025, 22 Jan Shift 1, Physics, Q1
`neet-2025-04may-bi-045`	NEET 2025, 04 May, Biology, Q45
`jeeadv-2025-p1-ch-012`	JEE Advanced 2025, Paper 1, Chemistry, Q12
`ncert-exemplar-11-ph-023`	NCERT Exemplar Class 11, Physics, Q23

Subject Codes

Subject	Code
physics	`ph`
chemistry	`ch`
mathematics	`ma`
biology	`bi`

Shift Shorthand

Original	Shorthand
“22 january shift 1”	`22jan-s1`
“4 may”	`04may`
“paper 1” (JEE Adv)	`p1`

Tombstone Tracking

Removed IDs are tracked in data/.tombstones.json and never reused. This ensures that any external references to a specific ID remain stable.

Numbering

Within each subject file: numbers reset to 1-N (first physics question = number 1)
Within paper.json: numbers are the original question numbers from the paper
Within aggregate files (across shifts): numbers reset again
IDs are always globally unique regardless of context

Topic Vocabulary

Controlled Vocabulary by Subject

Physics (32 tags): kinematics, newtons-laws, friction, work-energy-power, rotational-motion, gravitation, fluid-mechanics, properties-of-matter, thermal-physics, thermodynamics, kinetic-theory, oscillations, waves, electrostatics, capacitance, current-electricity, magnetic-effects, electromagnetic-induction, alternating-current, electromagnetic-waves, ray-optics, wave-optics, modern-physics, atoms, nuclei, semiconductors, communication-systems, experimental-physics, units-and-dimensions, vectors, error-analysis, measurement

Chemistry (32 tags): mole-concept, atomic-structure, periodic-classification, chemical-bonding, states-of-matter, chemical-thermodynamics, equilibrium, redox-reactions, hydrogen, s-block, p-block, organic-chemistry, hydrocarbons, environmental-chemistry, solid-state, solutions, electrochemistry, chemical-kinetics, surface-chemistry, metallurgy, d-block, coordination-compounds, halogen-derivatives, alcohols-phenols-ethers, aldehydes-ketones, carboxylic-acids, amines, biomolecules, polymers, chemistry-in-everyday-life, analytical-chemistry, nuclear-chemistry, green-chemistry

Mathematics (27 tags): sets, relations-and-functions, trigonometry, inverse-trigonometry, matrices, determinants, continuity-and-differentiability, application-of-derivatives, integrals, application-of-integrals, differential-equations, vector-algebra, three-d-geometry, linear-programming, probability, binomial-theorem, sequences-and-series, complex-numbers, quadratic-equations, permutations-and-combinations, statistics, mathematical-reasoning, limits, number-theory, graph-theory, inequalities, logarithms, modulus-function, greatest-integer-function

Biology (18 tags): diversity-in-living-world, plant-kingdom, animal-kingdom, morphology-of-flowering-plants, anatomy-of-flowering-plants, cell-biology, biomolecules-biology, plant-physiology, human-physiology, reproduction, genetics, evolution, human-health-and-disease, biotechnology, ecology, microbiology, immunology, bioinformatics

Topic Normalization Strategy

When AI extracts a free-form topic string, the system normalizes it through 4 layers:

Exact match (fast path) — check against 250+ aliases
Fuzzy match — Levenshtein distance ≤ 2
Semantic similarity — word-vector cosine similarity (threshold 0.3)
Fallback — "general-{subject}" if no match

Example alias mappings:

"nlm"           → "newtons-laws"
"aod"           → "application-of-derivatives"
"emi"           → "electromagnetic-induction"
"thermo"        → "thermodynamics"
"rotation"      → "rotational-motion"
"biotech"       → "biotechnology"
"motion in 1d"  → "kinematics"
"redox"         → "redox-reactions"

Tag Validation

isValidTag(subject, tag) checks if a tag belongs to the controlled vocabulary for that subject. Max 5 tags per question.

Validation — 32 Automated Checks

The auto-validator (src/validators/auto-validator.ts) runs 32 checks on every question:

#	Check	Severity
1	ID present	error
2	ID format (lowercase alphanumeric + hyphens)	error
3	Number is positive integer	error
4	Subject is valid enum value	error
5	Type is valid enum value	error
6	Text is not empty	error
7	Text has no placeholders (`[image]`, `[figure]`, etc.)	warning
8	MCQ: options count 3-5	error
9	MCQ: no duplicate options	error
10	MCQ: answer is valid index string	error
11	MSQ: options count 4-6	error
12	MSQ: answers array sorted, ≥1	error
13	NAT: options is null	error
14	NAT: negativeMarks is 0	error
15	NAT: answer is numeric	error
16	Assertion-Reason: options is null	error
17	Assertion-Reason: answer in “0”\|”1”\|”2”\|”3”	error
18	No HTML/script tags in text	warning
19	Text Unicode is valid (no garbled characters)	warning
20	Diagram file exists on disk when hasDiagram=true	error
21	Diagram references follow naming convention	warning
22	passageId references valid entry in passages array	error
23	tags follow controlled vocabulary	warning
24	ID format matches `{exam}-{year}-{shift}-{subject}-{3digit}`	error
25	No duplicate IDs across dataset	error
26	source is valid enum value	error
27	confidence is valid when set	warning
28	difficulty is valid when set	warning
29	solutionFormat is valid when set	warning
30	marks is positive number	error
31	revision is positive integer	warning
32	Maximum 5 tags per question	warning

Per-Type Validation Rules

MCQ (Multiple Choice — Single Correct):

options: string[], length 3-5
answer: 0-based index string (e.g. “0”, “1”, “2”, “3”)
answers: null
negativeMarks: usually -1 (JEE Main) or 0 (NEET)

MSQ (Multiple Select Questions):

options: string[], length 4-6
answers: sorted array of correct indices (e.g. ["1", "3"])
answer: concatenated string like "1,3" or first answer
No null options

NAT (Numerical Answer Type):

options: null
answer: numeric string (e.g. "4", "2.5")
negativeMarks: MUST be 0
answerPrecision: set if decimal places or range specified

Assertion-Reason:

options: null (auto-generated by display layer)
answer: one of "0", "1", "2", "3"
- "0" = Both A and R are true, R is correct explanation of A
- "1" = Both A and R are true, R is NOT correct explanation of A
- "2" = A is true but R is false
- "3" = A is false but R is true

Error Handling & Self-Healing

Common Scenarios and Automatic Recovery

Symptom	Likely Cause	Automatic Fix
404 on scrape	URL pattern changed or shift not released	Try gateoverflow mirror, or report to user
PDF >3.5MB	Large file	Auto-split and re-OCR in parallel
Mistral OCR timeout	API rate limit	Exponential backoff (rate-limiter handles)
Mistral returns empty	Scanned/image-only PDF	Proceed with empty pages; still extract what’s possible
AI JSON parse fails	Markdown-wrapped JSON	Strip ```json fences, retry parse
AI returns 0 questions	Context window exceeded	Split into smaller chunks, retry
AI returns garbage	Unclear instructions	Re-prompt with stricter schema
Missing answers	Answer key page missed	Auto-repair re-extracts answer key pages
Merged options	AI concatenated options	`repairOptions()` splits them intelligently
Count mismatch	Extracted ≠ expected	Re-extract with strict prompt about count
Invalid topic	Unknown topic string	Run topic-normalizer → fallback to `general-{subject}`
Missing required field	AI omitted field	Auto-validator detects, exporter fills defaults
Checksum mismatch	File modified after export	Re-run exporter
Port 3456 busy	Previous server instance	Kill process: `Stop-Process -Id (Get-NetTCPConnection -LocalPort 3456).OwningProcess`

Checkpoint Resume

Every pipeline stage writes checkpoints to .checkpoints.json. On failure:

Read .checkpoints.json to find last successful stage
Resume from next stage without redoing completed work
Use --force to override and reprocess

Fatal Errors (No Automatic Recovery)

These situations trigger a halt and explicit user notification:

All 6 AI providers fail (no extraction possible)
Compilation errors (tsc --noEmit fails)
PDF is password-protected
Mistral OCR returns no pages at all
Answer key PDF doesn’t match question paper

Model Limits & Rate Limiting

⚠️ Rate limits are set by providers and change without notice. The limits below were current at the time of writing. If you hit 429 errors, check each provider’s free-tier page. Update src/utils/rate-limiter.ts with the new values — the queuing and backoff logic stays solid.

Per-Provider Rate Limits

Provider	Model	RPM	Context	Daily Free	Notes
Mistral OCR	mistral-ocr-latest	60 req/min	—	50K TPM	OCR only, 1 RPS enforcement
NVIDIA	Qwen3 Coder 480B	40	262K	2,400 RPD	Primary extraction, 35B active params
NVIDIA	Mistral-Large-3	40	262K	2,400 RPD	Multimodal, 675B MoE
NVIDIA	Llama-4 Maverick	40	1M	2,400 RPD	Multimodal, 400B MoE
LongCat Lite	Flash-Lite	30	256K	50M tokens/day	Best for bulk processing
LongCat Chat	Flash-Chat	30	256K	500K tokens/day	General purpose
Poolside	Laguna M.1	30/100	131K	Unlimited(preview)	Free preview
Vanchin	KAT-Coder-Air-V1	20	2M	28,800 RPD	Code validation
Gemini	3.1 Flash Lite	15	1M	500 RPD	Validation, 250K TPM
Cerebras	GPT-OSS-120B	5	65K	2,400 RPD	Fallback, 30K TPM

Rate Limiter Architecture

The RateLimiter class (src/utils/rate-limiter.ts) uses a queue + sliding window approach:

All API calls are queued
Timestamps of recent requests are maintained in a sliding window
When window is full, new requests wait until oldest timestamp expires
Exponential backoff on HTTP 429 responses
Per-provider instances with independent windows

const nvidiaLimiter = new RateLimiter({ maxRequests: 40, windowMs: 60_000 });
const longcatLimiter = new RateLimiter({ maxRequests: 30, windowMs: 60_000 });
const poolsideLimiter = new RateLimiter({ maxRequests: 100, windowMs: 60_000 });
const geminiLimiter = new RateLimiter({ maxRequests: 15, windowMs: 60_000 });

Testing

Unit Tests

Test File	What It Tests
`tests/unit/merger.test.ts`	Chunk dedup, pickBetter logic, semantic similarity
`tests/unit/consensus.test.ts`	Majority vote, confidence scoring, conflict detection
`tests/unit/auto-repair.test.ts`	Missing answer detection, option splitting, count fix
`tests/unit/chunker.test.ts`	Overlapping chunk boundaries, no question-spanning
`tests/unit/topic-normalizer.test.ts`	Exact match, fuzzy match, semantic fallback

Integration Tests

Test File	What It Tests
`tests/integration/golden-dataset.test.ts`	End-to-end against known golden dataset

Run Tests

npm run test              # Run all tests
npm run test:watch        # Watch mode
npm run test-mistral      # Test Mistral structured annotations
npm run test-full-pipeline  # Full end-to-end pipeline test
npm run test-models       # Test all AI provider connectivity

Test Fixtures

tests/fixtures/golden-jeemain-sample.json contains a known-correct dataset used for integration testing.

Project Status

All 9 phases are complete with zero TypeScript compilation errors across 32 source files.

Phase	Module	Files	Status
P1	Foundation	`types.ts`, `vocabulary.ts`, `utils/*`, `index.ts`	✅ Complete
P2	Scrapers	`nta-scraper.ts`, `gateoverflow-scraper.ts`, `ncert-scraper.ts`, `kaggle-importer.ts`	✅ Complete
P3	Extraction	`ocr-stage.ts`, `structurer.ts`, `consensus-extractor.ts`, `chunker.ts`, `merger.ts`, `diagram-cacher.ts`, `auto-repair.ts`, `progressive-review.ts`	✅ Complete
P4	Validation	`field-checker.ts`, `auto-validator.ts`	✅ Complete
P5	Finalization	`id-assigner.ts`, `normalizer.ts`, `topic-normalizer.ts`, `exporter.ts`	✅ Complete
P6	Review	`pdf-renderer.ts`, `review-cli.ts`, `batch-signoff.ts`	✅ Complete
P7	Scripts	`batch-process.ts`, `verify-all.ts`, `rebuild-index.ts`, `export-for-opensource.ts`, `stats.ts`	✅ Complete
P8	API + Adapter	`server.ts`	✅ Complete
P9	Cross-Validate	`cross-validator.ts`, `diff-viewer.ts`	✅ Complete

Environment Variables

Variable	Required	Default	Purpose
`MISTRAL_API_KEY`	✅ Yes*	—	Mistral AI OCR and embeddings (*or MINERU_API_KEY)
`MINERU_API_KEY`	❌ No	—	MinerU Precision API token (falls back to Agent API)
`NVIDIA_API_KEY`	❌ No	—	NVIDIA NIM (Qwen3 Coder 480B, primary)
`LONGCAT_API_KEY`	❌ No	—	LongCat Flash Lite (50M tokens/day free)
`POOLSIDE_API_KEY`	❌ No	—	Poolside Laguna M.1 (unlimited(preview) free)
`VC_API_KEY`	❌ No	—	Vanchin KAT-Coder-Air-V1
`GEMINI_API_KEY`	❌ No	—	Gemini 3.1 Flash Lite (500 RPD)
`CEREBRAS_API_KEY`	❌ No	—	Cerebras GPT-OSS-120B (fallback)
`KAGGLE_USERNAME`	❌ No	—	Kaggle API username
`KAGGLE_KEY`	❌ No	—	Kaggle API key
`EDITOR`	❌ No	`notepad`	Editor for review edit mode
`LOG_LEVEL`	❌ No	`info`	`debug` \| `info` \| `warn` \| `error`
`PORT`	❌ No	`3456`	API server port

Design Decisions

All lowercase — no casing bugs anywhere in the system
No license in JSON — added only via --export flag
AR options auto-generated — never stored, generated by display layer
Passage = any type + passageId — passage is a relationship, not a separate question type
Match-columns = MCQ with 4 pairing options — not a separate type
Difficulty = null from AI — human assigns via rubric, never guessed by AI
Checksum = SHA-256 before adding checksum field — self-verifying files
Human review = accuracy guarantee — AI achieves 80-95%, validation adds 5%, human catches the rest
Free tier only — all providers have free tiers, no paid API required. The entire strategy (model selection, priority ranking, rate limits) is built around the free-tier landscape. When providers change their offerings, update the configs — the pipeline architecture is designed to absorb these changes with minimal effort.
No Docker, no database — JSON files ARE the database. Portable, inspectable, git-able
Subject files written FIRST — paper.json is secondary merge, not primary
Tombstone IDs — removed IDs never reused; external references stay valid
Anti-hallucination by architecture — not just documentation, enforced in code
Checkpoints at every stage — never redo completed work on failure
SSE for real-time updates — no polling, no WebSocket dependency

Glossary

Term	Definition
MCQ	Multiple Choice Question — single correct answer from 3-5 options
MSQ	Multiple Select Question — ≥1 correct answers from 4-6 options
NAT	Numerical Answer Type — numeric answer, no options
AR	Assertion-Reason — two statements, choose relationship between them
SSE	Server-Sent Events — HTTP-based real-time streaming
RPD	Requests Per Day
RPM	Requests Per Minute
TPM	Tokens Per Minute
Shift	A specific exam session (e.g., “22 Jan Shift 1”)
Consensus	Multi-provider agreement with majority voting
Golden Dataset	A human-verified correct dataset used as ground truth
Tombstone	Record of a deleted ID to prevent reuse
NTA	National Testing Agency — conducts JEE Main and NEET
Gateoverflow	Community mirror site for exam PDFs
Your Exam	Configure your own exam in `src/types.ts` — exam code, subjects, marking scheme

License — Dual-Licensing

This software is available under two options (your choice):

Option	License	When to Use
Open Source	GPLv3 — full source must remain open	You’re building open-source software
Commercial	Private license (contact author)	You’re a proprietary company that can’t open-source

GPLv3: Anyone can use, modify, and distribute for free, including commercially, but any derived work must also be GPLv3 (copyleft).

Commercial: Proprietary companies that don’t want to open-source their code buy a private license. Contact the author to purchase.

Question-Pipeline — From PDF to structured dataset. GPLv3 / Commercial dual-license. ```

█▄▄ █▄█ █▄░█ ▄▀█ █▀▄▀█ ▄▀█ █▄░█ █▄█ ░█░ █░▀█ █▀█ █░▀░█ █▀█ █░▀█ ```

GitHub: namandhakad712

This site is open source. Improve this page.