data-doc & data-sim Project
Agentic data cleansing and normalization system
Project maintained by realSergiy
Hosted on GitHub Pages — Theme by mattgraham
Project Plan: Agentic “data-doc” Internal Demo
← Back to Home
This document outlines the project plan for developing an internal technology demonstration of the “data-doc,” an agentic data cleansing and normalization system.
Project Components
The project consists of two independent applications:
- data-sim - A standalone application for generating and populating demo environments with faulty data
- data-doc - An agentic application for diagnosing and repairing data quality issues
High-Level System Architecture
The end-to-end architecture is composed of two independent applications that interact via a shared data layer hosted on Supabase (PostgreSQL) and Google Drive.
System Components
data-sim
A standalone application responsible for generating and populating the demo environment:
- Synthesizes tabular data using libraries like pandas and Faker
- Connects to Supabase via the psycopg2 library to create schemas and insert faulty data
- Uses the Google Drive API to upload unstructured documents (e.g., PDFs)
- Technology Stack: Python backend with LangGraph agents, React Vite web interface
See data-sim Documentation for detailed specifications.
data-doc
The core agentic application with three main components:
- Backend (Python/FastAPI): Hosts the core business logic and exposes RESTful APIs. It orchestrates the agentic workflows built with LangGraph
- Agent Core (LangGraph/Claude API): The “brain” of the system. It connects to Supabase and Google Drive to read data, analyze it, and execute cleansing tasks
- Frontend (React Vite): A web-based user interface for monitoring the system, reviewing reports, and interactively approving cleansing plans
- Technology Stack: Python backend with LangGraph agents, React Vite web interface
See data-doc Documentation for detailed specifications.
Architectural Flow
- The data-sim is executed to set up the demo scenario, populating the Supabase DB and Google Drive with a faulty dataset
- The data-doc frontend connects to its backend
- The user initiates a “scan” from the web UI
- The data-doc backend triggers the LangGraph agent workflow
- The agents connect to the Supabase DB and Google Drive (read-only) to perform discovery and diagnosis
- The results (Health Report, Cleansing Plan) are sent back to the frontend for user review
- The user approves the plan
- The data-doc backend triggers the execution agent, which connects to the Supabase DB (read-write) to apply the approved fixes
Mismatch Analysis
This demo plan specifies a Supabase (PostgreSQL) stack and not the Oracle ecosystem. Why this technical difference does not invalidate the demo’s purpose as a proof-of-concept:
- Problem Equivalence: The core data problems—disparate data sources, inconsistent master data, lack of synchronization between systems (PLM/ERP), and poor traceability—are logical patterns, not technology-specific issues. A duplicate part number is a logical error whether it resides in Oracle SQL or PostgreSQL. An inconsistency between a PDF in a file store and a record in a database is a cross-system reconciliation challenge regardless of the specific vendors
- Focus on Agentic Logic: The primary goal of this demo is to showcase the agentic framework’s capability to connect, reason, plan, and execute. The data-doc’s intelligence lies in the LangGraph orchestration and the LLM’s ability to generate corrective code, not in the specific database connector it uses
- Proof-of-Concept Validity: By successfully diagnosing and repairing these canonical manufacturing data issues in a controlled PostgreSQL environment, we effectively prove that the core agentic engine works. Adapting it to an Oracle environment would primarily involve swapping the database connector (e.g., from psycopg2 to cx_Oracle) and adjusting the SQL dialect in the agent’s prompts—a configuration change, not a fundamental architectural redesign. This demo serves as a valid and powerful validation of the core technology before investing in enterprise-specific integration