Customer Database Quality Check

Validates a customer database CSV with common issues: missing email addresses, duplicate customer IDs, invalid email formats, and impossible dates. Demonstrates completeness, validity, and uniqueness dimension scoring with specific row-level issue identification.

Input

What you provide to the skill

Check quality of this customer data CSV: customer_id,email,signup_date
1,john@test.com,2024-01-15
2,,2024-02-20
1,jane@test.com,2024-01-15
3,invalid-email,2025-13-45

Output

What the skill generates for you

DATA QUALITY HEALTH CHECK REPORT
================================

Dataset: customer_data.csv
Records: 4
Columns: 3
Analysis Date: 2025-12-03

OVERALL QUALITY SCORE: 82/100 - GOOD

DIMENSION SCORES:
  Completeness:     91/100  (Excellent)
  Validity:         71/100  (Moderate)
  Consistency:      100/100  (Excellent)
  Uniqueness:       75/100  (Moderate)
  Accuracy:         71/100  (Moderate)

TOP CRITICAL ISSUES (by impact)
----------------------------------

1. CRITICAL - Invalid Email Format
   Impact: HIGH | Affected: 1 rows (33.3%)
   Issue: Email addresses don't match valid format
   Location: Column 'email', rows [3]
   Example: Row 3 has "invalid-email"
   Fix: Apply email validation: re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email)

2. CRITICAL - Missing Email Values
   Impact: HIGH | Affected: 1 rows (25.0%)
   Issue: Column 'email' has NULL values
   Location: Column 'email', 1 missing values
   Fix: Required field - investigate data source and collect missing values

3. CRITICAL - Invalid Date Format
   Impact: HIGH | Affected: 1 rows (25.0%)
   Issue: signup_date contains invalid date values
   Location: Column 'signup_date', rows [3]
   Example: Row 3 has "2025-13-45" (invalid date)
   Fix: pd.to_datetime(df['signup_date'], errors='coerce') - will convert invalid dates to NaT

4. CRITICAL - Duplicate Customer IDs
   Impact: HIGH | Affected: 1 rows (25.0%)
   Issue: customer_id column contains duplicate values
   Location: Rows [0, 2]
   Example: customer_id 1 appears 2 times
   Fix: df.drop_duplicates(subset=['customer_id'], keep='first') OR investigate data entry errors

ACTIONABLE RECOMMENDATIONS
--------------------------

IMMEDIATE ACTIONS (High Priority):

1. Deduplicate Customer IDs
   Command: df.drop_duplicates(subset=['customer_id'], keep='first')
   Impact: Improves Uniqueness from 75 -> 100

2. Fix Invalid Email Formats
   Command: Validate and correct email formats to match pattern
   Impact: Critical for data quality - 1 records need attention

3. Fix Invalid Dates
   Command: pd.to_datetime(df['signup_date'], errors='coerce')
   Impact: Improves Validity - 1 records need correction

4. Collect Missing Emails
   Command: Required field - contact customers or source system
   Impact: Improves Completeness - 1 records need data

ESTIMATED IMPROVEMENT IMPACT
----------------------------

Current Score:     82/100
Projected Score:   99/100 (after implementing fixes)

Quality Gate Status: PASS (with fixes)

Customer Database Quality Check

Input

Output

About This Skill

More Examples

Employee Directory Referential Integrity Check

Sales Transaction Calculation Validation