Something went wrong!
Hang in there while we get back on track
Customer Database Quality Check
Validates a customer database CSV with common issues: missing email addresses, duplicate customer IDs, invalid email formats, and impossible dates. Demonstrates completeness, validity, and uniqueness dimension scoring with specific row-level issue identification.
Input
What you provide to the skill
Check quality of this customer data CSV: customer_id,email,signup_date
1,john@test.com,2024-01-15
2,,2024-02-20
1,jane@test.com,2024-01-15
3,invalid-email,2025-13-45
Output
What the skill generates for you
DATA QUALITY HEALTH CHECK REPORT
================================
Dataset: customer_data.csv
Records: 4
Columns: 3
Analysis Date: 2025-12-03
OVERALL QUALITY SCORE: 82/100 - GOOD
DIMENSION SCORES:
Completeness: 91/100 (Excellent)
Validity: 71/100 (Moderate)
Consistency: 100/100 (Excellent)
Uniqueness: 75/100 (Moderate)
Accuracy: 71/100 (Moderate)
TOP CRITICAL ISSUES (by impact)
----------------------------------
1. CRITICAL - Invalid Email Format
Impact: HIGH | Affected: 1 rows (33.3%)
Issue: Email addresses don't match valid format
Location: Column 'email', rows [3]
Example: Row 3 has "invalid-email"
Fix: Apply email validation: re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email)
2. CRITICAL - Missing Email Values
Impact: HIGH | Affected: 1 rows (25.0%)
Issue: Column 'email' has NULL values
Location: Column 'email', 1 missing values
Fix: Required field - investigate data source and collect missing values
3. CRITICAL - Invalid Date Format
Impact: HIGH | Affected: 1 rows (25.0%)
Issue: signup_date contains invalid date values
Location: Column 'signup_date', rows [3]
Example: Row 3 has "2025-13-45" (invalid date)
Fix: pd.to_datetime(df['signup_date'], errors='coerce') - will convert invalid dates to NaT
4. CRITICAL - Duplicate Customer IDs
Impact: HIGH | Affected: 1 rows (25.0%)
Issue: customer_id column contains duplicate values
Location: Rows [0, 2]
Example: customer_id 1 appears 2 times
Fix: df.drop_duplicates(subset=['customer_id'], keep='first') OR investigate data entry errors
ACTIONABLE RECOMMENDATIONS
--------------------------
IMMEDIATE ACTIONS (High Priority):
1. Deduplicate Customer IDs
Command: df.drop_duplicates(subset=['customer_id'], keep='first')
Impact: Improves Uniqueness from 75 -> 100
2. Fix Invalid Email Formats
Command: Validate and correct email formats to match pattern
Impact: Critical for data quality - 1 records need attention
3. Fix Invalid Dates
Command: pd.to_datetime(df['signup_date'], errors='coerce')
Impact: Improves Validity - 1 records need correction
4. Collect Missing Emails
Command: Required field - contact customers or source system
Impact: Improves Completeness - 1 records need data
ESTIMATED IMPROVEMENT IMPACT
----------------------------
Current Score: 82/100
Projected Score: 99/100 (after implementing fixes)
Quality Gate Status: PASS (with fixes)
About This Skill
Automated data quality assessment across 5 dimensions with actionable fix recommendations
View Skill DetailsMore Examples
Employee Directory Referential Integrity Check
Validates an employee directory for referential integrity issues including circular manager references (employee manages themselves), orphaned manager IDs pointing to non-existent employees, and duplicate records. Critical for HR system migrations.
Sales Transaction Calculation Validation
Analyzes a sales orders dataset to detect calculation mismatches where total doesn't equal quantity times price. Demonstrates the consistency dimension by catching arithmetic errors and missing values in transactional data.