Layoff Data Cleaning Project: Preparing Workforce Data for Analysis

JP Giraldo
Dec 9, 2024
3 min read

Updated: Jan 12, 2025

This SQL data cleaning project focuses on preparing layoff data for accurate and actionable analysis. Using techniques such as duplicate removal, standardization, handling null values, and unnecessary column elimination, this project ensures data quality and readiness for exploratory analysis. The structured approach demonstrates proficiency in SQL for cleaning raw datasets, a crucial step in delivering reliable insights for workforce management and decision-making.

Link to Code:

https://github.com/jpg109/jp_portfolio/blob/main/SQL%20Data%20Cleaning%20-%20Layoff.sql

BACKGROUND:

Received Layoff data across the globe from a client. Before conducting any analysis, the data required cleaning to ensure accuracy and reliability.

Process:

Utilized MySQL to clean and standardize the dataset, focusing on removing duplicates, standardizing values, addressing null entries, and eliminating unnecessary columns using the process below.

-- Step 1: Identify Duplicates

ROW_NUMBER() with PARTITION BY is used here to assign a unique row number to each record within a group defined by the specified columns.

This helps to identify duplicate records, as duplicates within the same group will have row numbers greater than 1.

The use of Common Table Expressions (CTEs) here enables a modular and reusable way to handle duplicate detection. By creating `duplicate_cte`, the process of isolating duplicates becomes more structured and readable, allowing for further operations like filtering or deletion to be applied efficiently.

WITH duplicate_cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
company, location, industry, total_laid_off, `date`, stage, country, funds_raised_millions) AS row_num
FROM layoffs_staging)

-- Step 2: Create a staging table and delete duplicates

The staging table `layoffs_staging_2` is created to hold the cleaned data. Temporary columns such as `row_num` help manage and remove duplicates before they are dropped to ensure a clean dataset.

CREATE TABLE `layoffs_staging_2` (
  `company` text,
  `location` text,
  `industry` text,
  `total_laid_off` int DEFAULT NULL,
  `percentage_laid_off` text,
  `date` text,
  `stage` text,
  `country` text,
  `funds_raised_millions` int DEFAULT NULL,
  `row_num` int
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO layoffs_staging_2
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
company, location, industry, total_laid_off, `date`, stage, country, funds_raised_millions) AS row_num
FROM layoffs_staging;

SELECT * 
FROM layoffs_staging_2
WHERE row_num > 1;

DELETE 
FROM layoffs_staging_2
WHERE row_num > 1;

-- Step 3: Standardize Data

Data standardization improves consistency and ensures accurate analysis. For example, trimming removes unwanted spaces, and unifying terms like 'Crypto%' ensures categories are comparable.

Standardizing company names

SELECT company, TRIM(company)
FROM layoffs_staging_2;
UPDATE layoffs_staging_2
SET company = TRIM(company);

Standardizing Industry

SELECT DISTINCT(industry)
FROM layoffs_staging_2
ORDER BY 1;

SELECT industry
FROM layoffs_staging_2
WHERE industry LIKE 'Crypto%';

UPDATE layoffs_staging_2
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%';

Standardizing Countries

SELECT DISTINCT(country)
FROM layoffs_staging_2
ORDER BY 1;

SELECT country, TRIM(TRAILING '.' FROM country)
FROM layoffs_staging_2
WHERE country LIKE 'United States%';

UPDATE layoffs_staging_2
SET country = TRIM(TRAILING '.' FROM country)
WHERE country LIKE 'United States%';

Standardizing dates

Converting the date field to a proper DATE format ensures consistency across the dataset, which is critical for time-series analysis. This enables more accurate filtering, grouping, and trend analysis based on time-related metrics.

SELECT date, 
STR_TO_DATE(`date`, '%m/%d/%Y')
FROM layoffs_staging_2;

UPDATE layoffs_staging_2
SET date = STR_TO_DATE(`date`, '%m/%d/%Y');

ALTER TABLE layoffs_staging_2
MODIFY COLUMN `date` DATE;

-- Step 4: Handle Null Values

Handling null values ensures dataset completeness and consistency. Logical updates fill gaps, while deletions maintain data integrity when information is irretrievable.

industry

SELECT * 
FROM layoffs_staging_2
WHERE industry ='';

UPDATE layoffs_staging_2
SET industry = NULL
WHERE industry = '';

SELECT st1.company, st1.industry, st2.company, st2.industry
FROM layoffs_staging_2 st1
JOIN layoffs_staging_2 st2
	ON st1.company = st2.compan
WHERE st1.industry IS NULL
AND st2.industry IS NOT NULL;

UPDATE layoffs_staging_2 st1
JOIN layoffs_staging_2 st2
	ON st1.company = st2.company
SET st1.industry = st2.industry
WHERE  st1.industry IS NULL
AND st2.industry IS NOT NULL;

total laid off and percentage laid off

SELECT * 
FROM layoffs_staging_2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;

This information is not useful, so we will delete it.

DELETE 
FROM layoffs_staging_2
WHERE total_laid_off IS NULL
AND percentage_laid_off IS NULL;

-- Step 5: Assess unnecessary columns

Removing unnecessary columns like `row_num` ensures the dataset remains focused on relevant information for further analysis.

ALTER TABLE layoffs_staging_2
DROP COLUMN row_num;

I’d love to hear your thoughts on this project! Feel free to connect on LinkedIn or explore more of my work on Dream to Compete.