All examples for Data Quality Checker

Customer Database Quality Check

Validates a customer database CSV with common issues: missing email addresses, duplicate customer IDs, invalid email formats, and impossible dates. Demonstrates completeness, validity, and uniqueness dimension scoring with specific row-level issue identification.

Input

What you provide to the skill

Check quality of this customer data CSV: customer_id,email,signup_date
1,john@test.com,2024-01-15
2,,2024-02-20
1,jane@test.com,2024-01-15
3,invalid-email,2025-13-45

Output

What the skill generates for you

DATA QUALITY HEALTH CHECK REPORT
================================
Dataset: customer_data.csv
Records: 4
Columns: 3
Analysis Date: 2025-12-03
OVERALL QUALITY SCORE: 82/100 - GOOD
DIMENSION SCORES:
Completeness: 91/100 (Excellent)
Validity: 71/100 (Moderate)
Consistency: 100/100 (Excellent)
Uniqueness: 75/100 (Moderate)
Accuracy: 71/100 (Moderate)
TOP CRITICAL ISSUES (by impact)
----------------------------------
1. CRITICAL - Invalid Email Format
Impact: HIGH | Affected: 1 rows (33.3%)
Issue: Email addresses don't match valid format
Location: Column 'email', rows [3]
Example: Row 3 has "invalid-email"
Fix: Apply email validation: re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email)
2. CRITICAL - Missing Email Values
Impact: HIGH | Affected: 1 rows (25.0%)
Issue: Column 'email' has NULL values
Location: Column 'email', 1 missing values
Fix: Required field - investigate data source and collect missing values
3. CRITICAL - Invalid Date Format
Impact: HIGH | Affected: 1 rows (25.0%)
Issue: signup_date contains invalid date values
Location: Column 'signup_date', rows [3]
Example: Row 3 has "2025-13-45" (invalid date)
Fix: pd.to_datetime(df['signup_date'], errors='coerce') - will convert invalid dates to NaT
4. CRITICAL - Duplicate Customer IDs
Impact: HIGH | Affected: 1 rows (25.0%)
Issue: customer_id column contains duplicate values
Location: Rows [0, 2]
Example: customer_id 1 appears 2 times
Fix: df.drop_duplicates(subset=['customer_id'], keep='first') OR investigate data entry errors
ACTIONABLE RECOMMENDATIONS
--------------------------
IMMEDIATE ACTIONS (High Priority):
1. Deduplicate Customer IDs
Command: df.drop_duplicates(subset=['customer_id'], keep='first')
Impact: Improves Uniqueness from 75 -> 100
2. Fix Invalid Email Formats
Command: Validate and correct email formats to match pattern
Impact: Critical for data quality - 1 records need attention
3. Fix Invalid Dates
Command: pd.to_datetime(df['signup_date'], errors='coerce')
Impact: Improves Validity - 1 records need correction
4. Collect Missing Emails
Command: Required field - contact customers or source system
Impact: Improves Completeness - 1 records need data
ESTIMATED IMPROVEMENT IMPACT
----------------------------
Current Score: 82/100
Projected Score: 99/100 (after implementing fixes)
Quality Gate Status: PASS (with fixes)