Data Quality Validator

Pro v1.0.0 1 view

Validate CSV and Excel files before analysis. Detects missing values, duplicates, outliers, and format errors. Generates quality scores, severity-classified issues, and actionable remediation recommendations.

What You Get

Catch data quality issues before analysis with automated validation and severity-classified remediation guidance.

The Problem

You need to ensure your CSV or Excel data is clean before analysis, migration, or reporting - but manually checking for missing values, duplicates, outliers, and format errors across thousands of rows is time-consuming and error-prone. Many organizations discover critical data problems only after investing significant time in analysis or encountering failed migrations.

The Solution

This skill performs comprehensive validation across seven systematic steps to identify and prioritize data quality issues before they impact your work. It loads your dataset, profiles the structure, then runs validation checks across multiple dimensions including missing values, duplicate rows and IDs, statistical outliers, format validation, and range checking. Each finding is classified by severity level based on business impact. Critical issues break calculations, high priority issues affect analysis reliability, medium issues are recommended fixes, and low issues are informational only. The skill calculates an overall quality score from 0-100 to provide an at-a-glance assessment. The resulting quality report includes detailed statistics by column, data completeness percentages with visual progress bars, specific row numbers and values for each issue, and prioritized action recommendations. For audit scenarios, it can generate audit-ready reports that demonstrate compliance with data quality standards. The skill supports re-validation after corrections to confirm improvements, handles files up to 500MB and 1M rows, works with CSV and Excel formats, and can apply custom business rules such as requiring positive revenue values or dates within specific ranges.

How It Works

  1. 1 Request data access via file path, pasted sample, or dataset description with context about intended use
  2. 2 Load and profile the dataset using pandas to understand structure, data types, and basic statistics
  3. 3 Run comprehensive validation checks detecting missing values, duplicates, outliers, format errors, and range violations
  4. 4 Classify all findings by severity level (Critical/High/Medium/Low) and calculate overall quality score from 0-100
  5. 5 Generate text-based distribution summaries with quartile analysis and completeness bars for each column
  6. 6 Compile comprehensive quality report with executive summary, detailed issues by severity, and column statistics
  7. 7 Provide prioritized action recommendations ordered by urgency and offer re-validation after corrections

What You'll Need

  • CSV or Excel file up to 500MB and 1 million rows
  • Python environment with pandas, numpy, and scipy libraries
  • Context about intended use of the data (analysis, migration, reporting, audit)
  • Optional: Custom business rules for domain-specific validation
  • Optional: Columns that should be unique identifiers (e.g., order_id, patient_mrn)