Exploratory Data Analysis (EDA) Using SQL
Introduction
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process. It helps us understand the dataset’s underlying structure, detect anomalies, test hypotheses, and check assumptions. This article will guide you through performing EDA using SQL, covering essential steps from understanding the data to deriving meaningful insights.
1. Understanding the Data
Before diving into the analysis, it’s important to understand the context of the dataset. Working on companies lay-offs data. Our objective is to gain insights into lay-off patterns, identify companies with the highest layoffs, and detect any anomalies.
Dataset Used: Layoffs
Layoffs Dataset
2. Data Loading and Schema Understanding
First, we need to load the data and examine its schema. This step involves creating duplicate tables so as not to affect the raw data and understanding the structure of the main table we’re interested in.
SELECT * FROM layoffs_staging; CREATE TABLE layoffs_staging LIKE layoffs; INSERT layoffs_staging SELECT * FROM layoffs;
3. Data Cleaning and Preparation
Data cleaning is essential to ensure the accuracy of our analysis. This step includes handling missing values, removing duplicates, and correcting data types.
/*Identifying Duplicate*/ SELECT *, ROW_NUMBER() OVER( PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised) AS row_num FROM layoffs_staging; WITH duplicate_cte AS( SELECT *, ROW_NUMBER() OVER( PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised) AS row_num FROM layoffs_staging) SELECT * FROM duplicate_cte WHERE row_num>1; /*Creating duplicate table*/ CREATE TABLE `layoffs_staging2` ( `company` text, `location` text, `industry` text, `total_laid_off` text, `percentage_laid_off` text, `date` text, `stage` text, `country` text, `funds_raised` text, `row_num` INT) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci; INSERT INTO layoffs_staging2 SELECT *, ROW_NUMBER() OVER( PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised) AS row_num FROM layoffs_staging; SELECT * FROM layoffs_staging2; SELECT * FROM layoffs_staging2 WHERE row_num>1; /*Deleting Duplicates*/ DELETEFROM layoffs_staging2 WHERE row_num>1; /*Standardizing Data*/ SELECT * FROM layoffs_staging2; SELECT company FROM layoffs_staging2; SELECT TRIM(company), company FROM layoffs_staging2; UPDATE layoffs_staging2 SET company = TRIM(company); /*Fault was only found in the company column except excess spacing*/ /*Updating - Changing the data type for date , total_laid_off, percentage_laid_off columns*/ SELECT * FROM layoffs_staging2; ALTER TABLE layoffs_staging2 MODIFY COLUMN `date` DATE; SELECT *FROM layoffs_staging2 WHERE total_laid_off IS NULLOR total_laid_off = ''; UPDATE layoffs_staging2 SET total_laid_off = null WHERE total_laid_off = ''; ALTER TABLE layoffs_staging2 MODIFY COLUMN total_laid_off INT; UPDATE layoffs_staging2 SET percentage_laid_off = NULL WHERE percentage_laid_off = ''; ALTER TABLE layoffs_staging2 MODIFY COLUMN percentage_laid_off INT; /*Null & Blank Values*/ SELECT * FROM layoffs_staging2; SELECT * FROM layoffs_staging2 WHERE total_laid_off IS NULLAND percentage_laid_off IS NULL; /*No Null values in the total_laid_off and percentage_laid_off columns*/ SELECT * FROM layoffs_staging2; SELECT * FROM layoffs_staging2 WHERE industry IS NULL OR industry = ''; SELECT * FROM layoffs_staging2 WHERE company = 'Appsmith'; UPDATE layoffs_staging2 SET industry = NULLWHERE industry = ''; SELECT * FROM layoffs_staging2 WHERE industry IS NULL; ALTER TABLE layoffs_staging2 DROP COLUMN row_num; /*Removing where there are both null values in total laid off and percentage laid off*/ DELETEFROM layoffs_staging2 WHERE total_laid_off IS NULLAND percentage_laid_off IS NULL;
4. Descriptive Statistics
Descriptive statistics provide a summary of the data, including maximum and total number of laid offs both by company, industry, and date as well.
SELECT * FROM layoffs_staging2; /* Maximum number of laid_off and the percentage*/ SELECT MAX(total_laid_off) AS MAX_Laid_off, MAX(percentage_laid_off) AS Percent_laid_off FROM layoffs_staging2;
The above query shows the maximum amount of total laid offs and percentage laid off.
/*Sum of laid_off by Company*/ SELECT company, SUM(total_laid_off) AS Total_laidoffs FROM layoffs_staging2 GROUP BY company ORDER BY Total_laidoffs DESC; /*Amazon had the highest sum of laid offs with a total of 27,840 SHOCKING*/
The above query shows total amount of total laid offs by company and showed that Amazon had the highest amounts of laid offs with a total of 27,840 people.
SELECT MIN(`date`), MAX(`date`) FROM layoffs_staging2;
The above query shows the oldest date and the recent/new date. That is, it shows the maximum date and the minimum date.
SELECT industry, SUM(total_laid_off) AS Total_laidoffs FROM layoffs_staging2 GROUP BY industry ORDER BY Total_laidoffs DESC; /*This shows that the retail industry had the highest amount of laid offs*/
The above query shows that the Retail industry was most affected and had the highest amount of lay offs.
SELECT country, SUM(total_laid_off) AS Total_laidoffs FROM layoffs_staging2 GROUP BY country ORDER BY Total_laidoffs DESC; /*This shows that United States had the highest amount of laid offs*/
This query showed that by Country, the United States had the highest amount of lay offs.
5. Conclusion
EDA is a powerful process that provides a deep understanding of the dataset. Through this SQL-based EDA, we’ve cleaned the data, calculated descriptive statistics, visualized key metrics, and identified potential correlations. These insights form the foundation for further analysis and decision-making.