data-doc & data-sim Project

Agentic data cleansing and normalization system

Project maintained by realSergiy Hosted on GitHub Pages — Theme by mattgraham

data-doc Application

Overview

The data-doc is an agentic application that diagnoses and repairs data quality issues in manufacturing databases. It uses LangGraph-orchestrated agents powered by Claude API to scan, analyze, and fix data problems with minimal human intervention.

Technology Stack

Backend: Python with FastAPI
Agent Framework: LangGraph with Claude API
Frontend: React Vite web interface
Database Connection: psycopg2 for Supabase (PostgreSQL)
File Storage: Google Drive API integration

Agentic Framework Design (LangGraph)

A multi-agent graph where each agent has a specialized role, managed by LangGraph for state transitions and complex logic.

Agent 1: ScannerAgent

Role: Data Exploration & Profiling
Tools: SQL Executor (read-only), Google Drive File Lister/Reader
Function: Connects to data sources and performs systematic inventory. It profiles tables (column types, null counts, value distributions), checks for broken foreign key relationships, and cross-references database records with files in Google Drive. Compiles raw findings into a structured state object

Agent 2: DiagnosisAgent

Role: Issue Identification & Plan Formulation
Tools: Claude LLM (for reasoning), Python/SQL Code Generator
Function: Receives structured findings from ScannerAgent. Uses predefined rules and LLM-driven reasoning to interpret raw data profile and identify specific business problems. Generates two key artifacts:
1. A human-readable Health Report
2. A machine-executable Cleansing Plan, consisting of a sequence of SQL and Python scripts with explanations

Agent 3: ExecutionAgent

Role: Applying Fixes
Tools: SQL Executor (read-write), Database Schema Cloner
Function: Takes user-approved Cleansing Plan. First action is to create a backup clone of any table it will modify. Executes each step of the plan sequentially, logging all actions, successes, and failures. Provides a final execution summary

Core Functionality

1. Connection & Discovery

The system securely connects to Supabase and Google Drive using API keys. The ScannerAgent dynamically discovers all tables and files in the target environment.

2. Intelligent Scanning

The ScannerAgent executes a pre-defined sequence of checks:

Schema validation
Referential integrity checks
Outlier detection
Cross-system reconciliation (DB vs. files)

3. Reporting & Planning

The DiagnosisAgent produces:

Health Report: A markdown report detailing each issue found, its severity, and the specific records affected
- Example: “Found 15 parts with inconsistent descriptions between Oracle Cloud and Oracle Legacy sources”
Cleansing Plan: A JSON object containing a list of steps. Each step includes:
- Natural language description
- Generated code (SQL/Python) to fix it
- Expected outcome

4. Interactive Cleansing

The web UI displays the Cleansing Plan as a checklist. Users can:

Review the code for each step
Approve/reject individual fixes
Trigger the ExecutionAgent with a single click
View real-time logging streamed to the UI

Demo Walkthrough

Demo Scenario (15 minutes)

(0:00-2:00) The Problem

Start on the Supabase dashboard. Show the messy tables (parts, bill_of_materials). Point out a clear duplicate part and an orphaned BOM record. Briefly show the Google Drive folder, highlighting an ECO PDF that has no corresponding database entry. This establishes the “pain.”

(2:00-5:00) Diagnosis

Switch to the data-doc web app. Kick off a “New Scan.” The UI shows the ScannerAgent and DiagnosisAgent running in sequence. A “Health Report” is generated.

(5:00-9:00) The Plan

Walk through the Health Report, showing how the agent correctly identified the exact issues we saw manually. Click “Generate Cleansing Plan.” Review the proposed plan, showing the generated SQL for merging duplicate parts and deleting orphaned records. Highlight the step that proposes creating a new entry in the engineering_changes table based on the PDF content.

(9:00-12:00) Execution

The user clicks “Approve & Execute Plan.” A modal appears showing the ExecutionAgent logging its actions in real-time: “Backing up parts table… Executing step 1/5: Merging duplicates… Success.”

(12:00-15:00) The Result

The UI shows a “Cleansing Complete” summary. Switch back to the Supabase dashboard and refresh the parts table. Show that the duplicates are gone and the data is clean. Show the new record in the engineering_changes table. Conclude by reiterating how the agentic system automated a complex, error-prone manual task in minutes.

Web Interface Design

The React Vite frontend is clean, modern, and focused on clarity and user control.

Main Dashboard

List of previous scans and their outcomes (e.g., “Scan - Oct 4, 2025: 12 issues found”)
Prominent “Start New Scan” button
High-level health score/KPI for the connected data source (e.g., “Data Integrity Score: 65%”)

Scan Results View

Health Report Tab

Displays the generated markdown report with collapsible sections for each issue category (e.g., “Inconsistent Parts,” “Divergent ECOs”)

Cleansing Plan Tab

An interactive checklist of proposed actions. Each item shows:

[Checkbox] A natural language description (e.g., “Merge duplicate part P-1001”)
A “View Code” button that opens a modal with the generated SQL/Python script
An estimated risk level (Low, Medium)
A global “Approve & Execute” button at the top

Execution View

Live-scrolling console log showing the ExecutionAgent’s progress
Progress bar for the overall plan
Final summary report detailing what was changed