ETL Project

Introduction:

Github Repository

ETL-and-Data-Visualization-Python-SQL-and-Powerbi

Dataset Source

What is an ETL?

ETL stands for Extract, Transform, and Load. It is a crucial process in data warehousing that involves three distinct steps:

1. Extract: In this step, data is gathered from one or more sources. For my project, I have chosen to use a public dataset from Kaggle, provided in the form of a CSV file.

2. Transform: The extracted data then undergoes various manipulations. These manipulations can include cleaning, filtering, validating, and aggregating data.

3. Load: The final step involves loading the transformed data into a target system, which is often a data warehouse, database, or data lake.

Project Structure:

This project is organized into three main folders, each serving a specific purpose:

1. Extract & Transform: Contains a Python script responsible for extracting and transforming the data from a single large CSV file to several smaller CSV files, each representing a table of the SQL database to be created.

2. Database: Focuses on SQL database creation and loading the transformed data into the database.

3. Dashboard: Includes a copy of the dashboard created to analyze the cleaned and organized data.

Getting Started

To start, download the whole project and locate it in your files. In the database folder, delete every CSV file before starting (they will be recreated by the extract and transform script in the first folder), be sure to create a new folder for the project and move everything to it, then open this project folder in VS Code or your preferred code editor and follow the instructions provided in the respective README file in each of the folders.

Extract & Transform Script Overview

*Note You can skip this step and use the database dump ,provided in folder 2, instead to create the database and insert all the data.

In this folder, you can find the SalesReport.csv, which is the Kaggle public dataset obtained from the link provided in the Project Overview README file. Additionally, you'll find the Extract and Transform Python Script, which is responsible for transforming the data from the original CSV file and preparing it for loading into the SQL database.

Main .csv downloaded from Kaggle:

Python Script Preview:

Python Script Overview

The script imports the following libraries:

- Pandas: Used for reading, cleaning, and writing CSV files.

- Datetime: Utilized to reformat date and time data from the main CSV to the accepted SQL date and time format.

- Random: Used to assign prices to items in the dataset. The script allows for setting a seed for randomness, enabling data changes reflected in the final dashboard.

CSVs Created by the Python Script

The script creates 5 .csv with data from the original dataset, each following normalization principles. each of this files will later be inserted in a SQL later created with the script in folder 2. Here's an explanation for each table:

1. Ship Modes:

- ID: Primary key identifying each ship mode.

- Ship Mode: Describes how the order was shipped, including different modes like same-day delivery or standard delivery.

2. Customers:

- Customer ID: Primary key for identifying each customer.

- Customer Name, Segment: Details customer information, such as name and segment (e.g., consumer, home office purchase, corporate purchase).

3. Products:

- Product ID: Primary key for identifying each product.

- Product Name, Category, Sub-Category, Price: Includes details about each product, including its category, sub-category, and price.

4. Orders:

- Order ID: Primary key for identifying each order.

- Order Date, Ship Date, Address, State: Contains order-related information, such as order and ship dates, address, and state.

- Ship Mode ID: Foreign key linking to the Ship Modes table.

- Customer ID: Foreign key linking to the Customers table.

5. Order Details:

- Order ID, Product ID Foreign keys connecting to Orders and Products tables. This table does not have a primary key because this is an intermediary table between a many to many relationship between orders and products.

- Quantity Ordered, Total Cost: Contains order details, such as quantity ordered and total cost.

SQL Database Setup and Table Creation

*Note If you skipped the first step, go ahead and download the data base dump and run each SQL script in order on MySQL and jump to step 3, the dashboard setup

MySQL User Creation Script

The SQL script provided creates a new MySQL user with the following credentials:

- Username: ADMIN

- Password: Password

This user will only be able to interact with databases created with ADMIN_ as a prefix, ensuring isolation from other projects.

Extracted Data

After running the Extract and Transform script in the previous folder, you should find five CSV files in this folder. Each CSV file contains cleaned, filtered, and organized data from the original dataset, ready to be used for creating database tables.

Table Creation Python Script

Additionally, you'll find a new Python script that utilizes the MySQL connector library. It performs the following tasks:

1. Connects to the newly created MySQL user.

2. Uses SQL language to create each of the tables in MySQL.

3. Inserts all the data into the created tables.

Created Database Flowchart:

Dashboard: Power BI Analysis

In this folder, you'll find the Power BI dashboard created to analyze and understand the data extracted, transformed, and loaded in the previous steps. The Power BI dashboard provides visual insights into the cleaned and organized data from the Superstore Sales dataset. It offers various visualizations and interactive elements to explore the dataset effectively.

Connecting MySQL Server Database to Power BI

To connect the MySQL server database to Power BI and access the data for visualization, follow these steps:

*Note Ensure that the MySQL server is running and accessible from your machine before attempting to connect. Also, it's worth mentioning that I chose the import option instead of direct query because the data in this project will not be changing.

1. Open Power BI Desktop.

2. Click on "Get Data" from the Home tab.

3. Select "MySQL Database" from the list of available data sources.

4. Enter the server details, including server name, database name, and credentials (username and password).

5. Click "OK" to establish the connection.

6. Choose the tables you want to import into Power BI.

7. Click "Load" to import the selected tables into Power BI for analysis.

ETL and Data Visualization