← back

Database Design Project

Our task was to take data stored in an inefficient format and design a database to store it. I found it very useful to apply my learning to actual data with errors and unexpected anomalies. Moving this kind of inefficiently stored data into a structure database is a very common task for data scientists, so this experience will be very useful for future projects.

I used SQL to show the tables and their relationships, but actually an entity relationship diagram like this would have been a clearer format - I definitely learned from this for future projects.

Here's my code for the Database Design Project. My team took sales data in spreadsheet form and converted it into a SQL database.

This Python code is for cleaning the data.

First the Python code lists unique values in the columns "Ship Mode", "Segment", "Country", "Region", "Category" and "Sub-Category".

import pandas as pd
df = pd.read_csv("superstore.csv")

df["Ship Mode"].unique()
array(['Second Class', 'Standard Class', 'First Class', 'Same Day'], dtype=object)

df["Segment"].unique()
array(['Consumer', 'Corporate', 'Home Office'], dtype=object)

df["Country"].unique()
array(['United States'], dtype=object)

df["Region"].unique()
array(['South', 'West', 'Central', 'East'], dtype=object)

df["Category"].unique()
array(['Furniture', 'Office Supplies', 'Technology'], dtype=object)

df["Sub-Category"].unique()
array(['Bookcases', 'Chairs', 'Labels', 'Tables', 'Storage', 'Furnishings', 'Art', 'Phones', 'Binders', 'Appliances', 'Paper', 'Accessories', 'Envelopes', 'Fasteners', 'Supplies', 'Machines', 'Copiers'], dtype=object)

The column "Product Name" contained some very long strings - this Python code finds the length of the longest string.

len(max(df["Product Name"], key=len))
127

The columns "Profit", "Sales" and "Discount" were decimal values - this Python code finds the maximum number of decimal places in each column.

def decimal_places(d):
  s = str(d)
  return len(s) - s.find(".") - 1

decimal_places(max(df["Profit"], key=decimal_places))
4

decimal_places(max(df["Sales"], key=decimal_places))
4

decimal_places(max(df["Discount"], key=decimal_places))
2

This SQL code creates tables in the new database.

CREATE TABLE items (
  ID int UNSIGNED NOT NULL,
  order_ID varchar(30) NOT NULL,
  product_ID varchar(30) NOT NULL,
  sales decimal(20,4) NOT NULL,
  quantity int UNSIGNED NOT NULL,
  discount decimal(3,2) DEFAULT 0,
  profit decimal(20,4) NOT NULL,
  PRIMARY KEY (ID),
  FOREIGN KEY (order_ID) REFERENCES orders(ID),
  FOREIGN KEY (product_ID) REFERENCES products(ID)
)

CREATE TABLE orders (
  ID varchar(30) NOT NULL,
  order_date date NOT NULL,
  ship_date date DEFAULT NULL,
  ship_mode enum('Same Day','First Class','Second Class','Standard Class') NOT NULL,
  customer_ID varchar(30) NOT NULL,
  PRIMARY KEY (ID),
  FOREIGN KEY (customer_ID) REFERENCES customers(ID)
)

CREATE TABLE products (
  ID varchar(30) NOT NULL,
  product_name varchar(150) NOT NULL,
  category varchar(30) NOT NULL,
  sub_category varchar(30) NOT NULL,
  PRIMARY KEY (ID)
)

CREATE TABLE customers (
  ID varchar(30) NOT NULL,
  customer_name varchar(30) NOT NULL,
  segment enum('Consumer','Corporate','Home Office') NOT NULL,
  city_ID varchar(30) NOT NULL,
  PRIMARY KEY (ID),
  FOREIGN KEY (city_ID) REFERENCES cities(ID)
)

CREATE TABLE cities (
  ID varchar(30) NOT NULL,
  city varchar(30) NOT NULL,
  state varchar(30) NOT NULL,
  region enum('West','Central','East','South') NOT NULL,
  country varchar(30) NOT NULL default 'United States',
  PRIMARY KEY (ID)
)