data-doc & data-sim Project

Agentic data cleansing and normalization system

Project maintained by realSergiy Hosted on GitHub Pages — Theme by mattgraham

Project Plan: Agentic “data-doc” Internal Demo

This document outlines the project plan for developing an internal technology demonstration of the “data-doc,” an agentic data cleansing and normalization system.

Project Components

The project consists of two independent applications:

data-sim - A standalone application for generating and populating demo environments with faulty data
data-doc - An agentic application for diagnosing and repairing data quality issues

High-Level System Architecture

The end-to-end architecture is composed of two independent applications that interact via a shared data layer hosted on Supabase (PostgreSQL) and Google Drive.

System Components

data-sim

A standalone application responsible for generating and populating the demo environment:

Synthesizes tabular data using libraries like pandas and Faker
Connects to Supabase via the psycopg2 library to create schemas and insert faulty data
Uses the Google Drive API to upload unstructured documents (e.g., PDFs)
Technology Stack: Python backend with LangGraph agents, React Vite web interface

See data-sim Documentation for detailed specifications.

data-doc

The core agentic application with three main components:

Backend (Python/FastAPI): Hosts the core business logic and exposes RESTful APIs. It orchestrates the agentic workflows built with LangGraph
Agent Core (LangGraph/Claude API): The “brain” of the system. It connects to Supabase and Google Drive to read data, analyze it, and execute cleansing tasks
Frontend (React Vite): A web-based user interface for monitoring the system, reviewing reports, and interactively approving cleansing plans
Technology Stack: Python backend with LangGraph agents, React Vite web interface

See data-doc Documentation for detailed specifications.

Architectural Flow

The data-sim is executed to set up the demo scenario, populating the Supabase DB and Google Drive with a faulty dataset
The data-doc frontend connects to its backend
The user initiates a “scan” from the web UI
The data-doc backend triggers the LangGraph agent workflow
The agents connect to the Supabase DB and Google Drive (read-only) to perform discovery and diagnosis
The results (Health Report, Cleansing Plan) are sent back to the frontend for user review
The user approves the plan
The data-doc backend triggers the execution agent, which connects to the Supabase DB (read-write) to apply the approved fixes

Mismatch Analysis

This demo plan specifies a Supabase (PostgreSQL) stack and not the Oracle ecosystem. Why this technical difference does not invalidate the demo’s purpose as a proof-of-concept:

Problem Equivalence: The core data problems—disparate data sources, inconsistent master data, lack of synchronization between systems (PLM/ERP), and poor traceability—are logical patterns, not technology-specific issues. A duplicate part number is a logical error whether it resides in Oracle SQL or PostgreSQL. An inconsistency between a PDF in a file store and a record in a database is a cross-system reconciliation challenge regardless of the specific vendors
Focus on Agentic Logic: The primary goal of this demo is to showcase the agentic framework’s capability to connect, reason, plan, and execute. The data-doc’s intelligence lies in the LangGraph orchestration and the LLM’s ability to generate corrective code, not in the specific database connector it uses
Proof-of-Concept Validity: By successfully diagnosing and repairing these canonical manufacturing data issues in a controlled PostgreSQL environment, we effectively prove that the core agentic engine works. Adapting it to an Oracle environment would primarily involve swapping the database connector (e.g., from psycopg2 to cx_Oracle) and adjusting the SQL dialect in the agent’s prompts—a configuration change, not a fundamental architectural redesign. This demo serves as a valid and powerful validation of the core technology before investing in enterprise-specific integration