data-doc & data-sim Project
Agentic data cleansing and normalization system
Project maintained by realSergiy
Hosted on GitHub Pages — Theme by mattgraham
data-doc Application
← Back to Project Overview
Overview
The data-doc is an agentic application that diagnoses and repairs data quality issues in manufacturing databases. It uses LangGraph-orchestrated agents powered by Claude API to scan, analyze, and fix data problems with minimal human intervention.
Technology Stack
- Backend: Python with FastAPI
- Agent Framework: LangGraph with Claude API
- Frontend: React Vite web interface
- Database Connection: psycopg2 for Supabase (PostgreSQL)
- File Storage: Google Drive API integration
Agentic Framework Design (LangGraph)
A multi-agent graph where each agent has a specialized role, managed by LangGraph for state transitions and complex logic.
Agent 1: ScannerAgent
- Role: Data Exploration & Profiling
- Tools: SQL Executor (read-only), Google Drive File Lister/Reader
- Function: Connects to data sources and performs systematic inventory. It profiles tables (column types, null counts, value distributions), checks for broken foreign key relationships, and cross-references database records with files in Google Drive. Compiles raw findings into a structured state object
Agent 2: DiagnosisAgent
- Role: Issue Identification & Plan Formulation
- Tools: Claude LLM (for reasoning), Python/SQL Code Generator
- Function: Receives structured findings from ScannerAgent. Uses predefined rules and LLM-driven reasoning to interpret raw data profile and identify specific business problems. Generates two key artifacts:
- A human-readable Health Report
- A machine-executable Cleansing Plan, consisting of a sequence of SQL and Python scripts with explanations
Agent 3: ExecutionAgent
- Role: Applying Fixes
- Tools: SQL Executor (read-write), Database Schema Cloner
- Function: Takes user-approved Cleansing Plan. First action is to create a backup clone of any table it will modify. Executes each step of the plan sequentially, logging all actions, successes, and failures. Provides a final execution summary
Core Functionality
1. Connection & Discovery
The system securely connects to Supabase and Google Drive using API keys. The ScannerAgent dynamically discovers all tables and files in the target environment.
2. Intelligent Scanning
The ScannerAgent executes a pre-defined sequence of checks:
- Schema validation
- Referential integrity checks
- Outlier detection
- Cross-system reconciliation (DB vs. files)
3. Reporting & Planning
The DiagnosisAgent produces:
- Health Report: A markdown report detailing each issue found, its severity, and the specific records affected
- Example: “Found 15 parts with inconsistent descriptions between Oracle Cloud and Oracle Legacy sources”
- Cleansing Plan: A JSON object containing a list of steps. Each step includes:
- Natural language description
- Generated code (SQL/Python) to fix it
- Expected outcome
4. Interactive Cleansing
The web UI displays the Cleansing Plan as a checklist. Users can:
- Review the code for each step
- Approve/reject individual fixes
- Trigger the ExecutionAgent with a single click
- View real-time logging streamed to the UI
Demo Walkthrough
Demo Scenario (15 minutes)
(0:00-2:00) The Problem
Start on the Supabase dashboard. Show the messy tables (parts, bill_of_materials). Point out a clear duplicate part and an orphaned BOM record. Briefly show the Google Drive folder, highlighting an ECO PDF that has no corresponding database entry. This establishes the “pain.”
(2:00-5:00) Diagnosis
Switch to the data-doc web app. Kick off a “New Scan.” The UI shows the ScannerAgent and DiagnosisAgent running in sequence. A “Health Report” is generated.
(5:00-9:00) The Plan
Walk through the Health Report, showing how the agent correctly identified the exact issues we saw manually. Click “Generate Cleansing Plan.” Review the proposed plan, showing the generated SQL for merging duplicate parts and deleting orphaned records. Highlight the step that proposes creating a new entry in the engineering_changes table based on the PDF content.
(9:00-12:00) Execution
The user clicks “Approve & Execute Plan.” A modal appears showing the ExecutionAgent logging its actions in real-time: “Backing up parts table… Executing step 1/5: Merging duplicates… Success.”
(12:00-15:00) The Result
The UI shows a “Cleansing Complete” summary. Switch back to the Supabase dashboard and refresh the parts table. Show that the duplicates are gone and the data is clean. Show the new record in the engineering_changes table. Conclude by reiterating how the agentic system automated a complex, error-prone manual task in minutes.
Web Interface Design
The React Vite frontend is clean, modern, and focused on clarity and user control.
Main Dashboard
- List of previous scans and their outcomes (e.g., “Scan - Oct 4, 2025: 12 issues found”)
- Prominent “Start New Scan” button
- High-level health score/KPI for the connected data source (e.g., “Data Integrity Score: 65%”)
Scan Results View
Health Report Tab
Displays the generated markdown report with collapsible sections for each issue category (e.g., “Inconsistent Parts,” “Divergent ECOs”)
Cleansing Plan Tab
An interactive checklist of proposed actions. Each item shows:
- [Checkbox] A natural language description (e.g., “Merge duplicate part P-1001”)
- A “View Code” button that opens a modal with the generated SQL/Python script
- An estimated risk level (Low, Medium)
- A global “Approve & Execute” button at the top
Execution View
- Live-scrolling console log showing the ExecutionAgent’s progress
- Progress bar for the overall plan
- Final summary report detailing what was changed