📚 Lesson 01 — Databases & Data Foundations

From Relational Algebra to SQL Mastery

Understand the theory first. Then every SQL query will make perfect sense.

📋 What we cover

Part 1 — Relational Algebra: the theory behind SQL
- Selection (σ), Projection (π), Rename (ρ)
- Set operations: Union, Intersection, Difference, Cartesian Product
- Join (⋈) and the RA → SQL mapping
Part 2 — SQL, Step by Step
- Step 1 — SELECT & FROM: choosing columns
- Step 2 — WHERE: filtering rows
- Step 3 — ORDER BY & LIMIT: sorting and paging
- Step 4 — JOINs: combining tables
- Step 5 — Aggregation functions: COUNT, SUM, AVG, MIN, MAX
- Step 6 — GROUP BY & HAVING
- Step 7 — Nested Queries (Subqueries & CTEs)
Part 3 — Database Design & Normalization
🧪 Practice Exercises on Kaggle

Part 1 — Relational Algebra

Before SQL existed, mathematicians developed Relational Algebra — a formal language for querying tables. Every SQL statement is secretly translated into relational algebra operations by the database engine. Learning the theory gives you a deep, lasting intuition for why SQL works the way it does.

🔑 Core idea

A relation is a table: a set of rows (tuples) with named columns (attributes). Relational algebra defines operations that take one or two relations as input and produce a new relation as output. Every SQL query is a composition of these operations.

We use this Students table throughout Part 1:

id	name	age	city	grade
1	Alice	20	Paris	A
2	Bob	22	Lyon	B
3	Carol	20	Paris	A
4	Dave	24	Bordeaux	C
5	Eve	21	Lyon	B

1.1 — Selection σ — Filter Rows

Selection (σ, sigma) keeps only the rows that satisfy a condition. It is a horizontal filter — the columns stay the same, only rows are reduced.

σ — Selection operator σ_condition(Relation)
σ_{city = 'Paris'}(Students) → Keeps rows: Alice (id=1), Carol (id=3)
σ_{age < 22 AND grade = 'A'}(Students) → Keeps rows: Alice, Carol

💡 SQL equivalent Selection σ maps directly to the WHERE clause. Every condition in WHERE is a selection predicate.

1.2 — Projection π — Choose Columns

Projection (π, pi) keeps only certain columns. It is a vertical filter — all rows survive but only chosen attributes remain. Because a relation is a set, duplicate rows are automatically eliminated.

π — Projection operator π_{attr₁, attr₂, …}(Relation)
π_{name, city}(Students) → {Alice,Paris} {Bob,Lyon} {Carol,Paris} {Dave,Bordeaux} {Eve,Lyon}
π_city(Students) → {Paris} {Lyon} {Bordeaux} — duplicates removed automatically

💡 SQL equivalent Projection π maps to the column list after SELECT. SELECT DISTINCT enforces the set semantics that RA implies.

1.3 — Composing σ and π

Operations nest naturally: apply selection first to reduce rows, then project to reduce columns. This is the algebraic basis for the common SELECT cols FROM table WHERE condition pattern.

Composition — σ then π π_{name, grade}( σ_{city = 'Lyon'}(Students) )
Step 1 — σ: keep only Lyon rows → Bob, Eve
Step 2 — π: keep only name and grade → {Bob, B}, {Eve, B}

SQL: SELECT name, grade FROM students WHERE city = 'Lyon';

1.4 — Rename ρ — Give New Names

Rename (ρ, rho) gives a new name to a relation or its attributes. This is essential when joining a table to itself (self-join), where you need two distinguishable copies.

ρ — Rename operator ρ_{NewName(old_col → new_col)}(Relation)
ρ_{S(id → student_id, name → student_name)}(Students)
SQL equivalent: SELECT id AS student_id, name AS student_name FROM students;

1.5 — Set Operations

Both relations must be union-compatible: same number of columns, compatible types.

R ∪ S

Union

All tuples from R or S, duplicates removed. SQL: UNION

R ∩ S

Intersection

Only tuples in both R and S. SQL: INTERSECT

R − S

Difference

Tuples in R but not in S. SQL: EXCEPT

R × S

Cartesian Product

Every row of R paired with every row of S. SQL: CROSS JOIN

1.6 — Join ⋈ — The Core Operation

The natural join (⋈) combines relations by matching rows on columns sharing the same name. It equals a Cartesian product filtered by the equality condition, then duplicated columns projected away.

Natural Join — formal definition R ⋈ S ≡ π_{non-duplicate attrs}( σ_{R.key = S.key}(R × S) )

This decomposition shows why large Cartesian products are expensive.
The engine uses indexes and smarter algorithms (hash join, merge join) to avoid materialising the full product.

Theta-join (any condition) and outer join (keep unmatched rows, fill NULLs) extend pure RA — both are available in SQL.

📐 Complete RA → SQL mapping

σ (selection) → WHERE π (projection) → SELECT col1, col2 ρ (rename) → AS alias ⋈ (join) → JOIN … ON ∪ → UNION ∩ → INTERSECT − → EXCEPT × → CROSS JOIN

Part 2 — SQL, Step by Step

Now that you understand the theory, we build SQL knowledge one layer at a time — starting with the simplest possible query and adding one concept per step.

employee_id	name	department	salary	hire_date	manager_id
1	Alice	Engineering	95 000	2018-03-10	NULL
2	Bob	Engineering	82 000	2020-06-15	1
3	Carol	Marketing	67 000	2019-09-01	1
4	Dave	Marketing	71 000	2021-01-20	3
5	Eve	Engineering	88 000	2020-11-05	1
6	Frank	HR	58 000	2022-04-03	NULL
7	Grace	Finance	91 000	2017-07-22	NULL

Step 1 — SELECT & FROM: Choosing Columns (Projection π)

Beginner

The minimal SQL query needs only two clauses: FROM names the table (the relation), and SELECT lists which columns to return (the projection). SELECT * is the special case that projects all columns.

-- Return every column and every row (no filter yet)
SELECT *
FROM   employees;

-- Project only name and salary (π_{name, salary})
SELECT name, salary
FROM   employees;

-- Rename columns with aliases (ρ in RA)
SELECT name   AS employee_name,
       salary AS annual_salary
FROM   employees;

-- DISTINCT enforces set semantics (no duplicates)
SELECT DISTINCT department
FROM   employees;
-- Result: Engineering, Marketing, HR, Finance (4 rows, not 7)

Step 2 — WHERE: Filtering Rows (Selection σ)

Beginner

WHERE is selection σ expressed in SQL. It filters rows before any computation. You can combine conditions with AND, OR, NOT, and use a rich set of predicates.

-- Equality filter
SELECT name, salary
FROM   employees
WHERE  department = 'Engineering';

-- Combine conditions
SELECT name, salary, department
FROM   employees
WHERE  salary > 70000
  AND  department <> 'HR';

-- BETWEEN: inclusive range (equivalent to >= AND <=)
SELECT name, salary
FROM   employees
WHERE  salary BETWEEN 60000 AND 90000;

-- IN: match any value from a list
SELECT name, department
FROM   employees
WHERE  department IN ('Engineering', 'Finance');

-- LIKE: pattern matching (% = any chars, _ = one char)
SELECT name
FROM   employees
WHERE  name LIKE 'A%';       -- starts with A

-- NULL check: always IS NULL, never = NULL
SELECT name
FROM   employees
WHERE  manager_id IS NULL;   -- top-level managers only

⚠️ The NULL trap NULL = NULL evaluates to NULL, not TRUE. Any arithmetic or comparison involving NULL yields NULL. Always use IS NULL or IS NOT NULL. Use COALESCE(col, default) to substitute a fallback value.

Step 3 — ORDER BY & LIMIT: Sorting and Paging

Beginner → Intermediate

Relations in relational algebra are sets — they have no inherent order. ORDER BY imposes an order on the final result after all other clauses are evaluated. LIMIT and OFFSET then restrict how many rows are returned, enabling pagination.

-- Sort by salary, highest first
SELECT name, salary
FROM   employees
ORDER BY salary DESC;    -- ASC is the default

-- Sort by multiple columns: primary key is department, secondary is salary
SELECT name, department, salary
FROM   employees
ORDER BY department ASC, salary DESC;

-- LIMIT: top 3 earners only
SELECT name, salary
FROM   employees
ORDER BY salary DESC
LIMIT  3;

-- OFFSET: pagination — skip first 3 rows, get next 3
SELECT name, salary
FROM   employees
ORDER BY salary DESC
LIMIT  3  OFFSET 3;   -- "page 2" of results

💡 Rule ORDER BY is always the last clause evaluated (before LIMIT). You cannot rely on row order anywhere else in a query. Never assume a SELECT without ORDER BY returns rows in a predictable sequence.

Step 4 — JOINs: Combining Tables (⋈)

Intermediate

A JOIN merges rows from two tables on a matching condition — relational algebra's join ⋈ in SQL syntax. The type of join determines what happens to unmatched rows.

INNER JOIN

Only matched rows. Unmatched rows from both tables are dropped.

LEFT JOIN

All left rows; NULL columns for unmatched right rows.

RIGHT JOIN

All right rows; NULL columns for unmatched left rows.

FULL OUTER

All rows from both sides; NULL wherever there's no match.

CROSS JOIN

Every row × every row. Cartesian product ×. No ON clause.

-- departments(dept_id, dept_name, budget)

-- INNER JOIN: only employees whose department exists in departments
SELECT e.name, d.dept_name, d.budget
FROM   employees   e
INNER JOIN departments d ON e.department = d.dept_name;

-- LEFT JOIN: all employees, NULL budget if no department record
SELECT e.name, d.dept_name, d.budget
FROM   employees   e
LEFT JOIN departments d ON e.department = d.dept_name;

-- Self-join: employee name + their manager's name (uses ρ via alias)
SELECT e.name                         AS employee,
       COALESCE(m.name, 'No Manager') AS manager
FROM   employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;

Step 5 — Aggregation Functions

Intermediate

Aggregation functions collapse a set of rows into a single value. They are always applied after WHERE filtering (and after JOINs), and before ORDER BY. When used without GROUP BY, they summarise the entire table.

Function	Computes	NULL behaviour
`COUNT(*)`	Number of rows	Counts rows even if all columns are NULL
`COUNT(col)`	Non-NULL values in col	Ignores NULLs
`COUNT(DISTINCT col)`	Unique non-NULL values	Ignores NULLs and duplicates
`SUM(col)`	Total of all values	Ignores NULLs
`AVG(col)`	Arithmetic mean	Ignores NULLs — denominator is non-NULL count
`MIN(col)`	Smallest value	Ignores NULLs
`MAX(col)`	Largest value	Ignores NULLs

-- Total headcount
SELECT COUNT(*) AS total_employees
FROM   employees;

-- Complete salary summary for the whole company
SELECT SUM(salary)            AS total_payroll,
       ROUND(AVG(salary), 2) AS avg_salary,
       MIN(salary)            AS lowest,
       MAX(salary)            AS highest
FROM   employees;

-- Count distinct departments
SELECT COUNT(DISTINCT department) AS num_departments
FROM   employees;

⚠️ AVG ignores NULLs — careful! AVG(salary) over [80 000, NULL, 60 000] returns 70 000, not 46 667. The NULL row is excluded from both numerator and denominator. Use AVG(COALESCE(salary, 0)) if you want NULLs treated as zero.

Step 6 — GROUP BY & HAVING

Intermediate

GROUP BY partitions rows into groups and applies aggregation per group. HAVING then filters those groups — it is the WHERE of aggregated results.

🔑 Logical execution order — memorise this

FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT / OFFSET

Consequences: (1) You cannot use a SELECT alias in WHERE or HAVING — the alias is created after those steps. (2) You cannot use an aggregate function in WHERE — use HAVING instead. (3) Only columns in GROUP BY or inside an aggregate can appear in SELECT.

-- Headcount and average salary per department
SELECT   department,
         COUNT(*)                AS headcount,
         ROUND(AVG(salary), 2)  AS avg_salary
FROM     employees
GROUP BY department
ORDER BY avg_salary DESC;

-- HAVING: keep only departments with avg salary > 75 000
SELECT   department,
         ROUND(AVG(salary), 2) AS avg_salary
FROM     employees
GROUP BY department
HAVING   AVG(salary) > 75000  -- use the function, not the alias
ORDER BY avg_salary DESC;

-- WHERE (row filter) + GROUP BY + HAVING (group filter)
SELECT   department,
         COUNT(*) AS recent_hires
FROM     employees
WHERE    hire_date >= '2020-01-01'    -- filter rows first
GROUP BY department
HAVING   COUNT(*) >= 2               -- then filter groups
ORDER BY recent_hires DESC;

Step 7 — Nested Queries: Subqueries & CTEs

Advanced

A subquery is a complete SELECT statement embedded inside another query. It directly expresses relational algebra's ability to compose operations. Subqueries can appear in WHERE, FROM, or SELECT.

━━━ TYPE 1 — Scalar subquery in WHERE ━━━
-- Employees earning above the overall average salary
SELECT name, salary
FROM   employees
WHERE  salary > (
           SELECT AVG(salary)   -- returns one value
           FROM   employees
       );


━━━ TYPE 2 — IN subquery (returns a list) ━━━
-- Employees who manage at least one person
SELECT name, department
FROM   employees
WHERE  employee_id IN (
           SELECT DISTINCT manager_id
           FROM   employees
           WHERE  manager_id IS NOT NULL
       );


━━━ TYPE 3 — EXISTS (short-circuits on first match) ━━━
-- Same result as above, often faster on large tables
SELECT e.name
FROM   employees e
WHERE  EXISTS (
           SELECT 1
           FROM   employees sub
           WHERE  sub.manager_id = e.employee_id
       );


━━━ TYPE 4 — Subquery in FROM (derived table) ━━━
-- Rank departments by headcount using a nested SELECT
SELECT ds.department, ds.headcount
FROM  (
          SELECT   department,
                   COUNT(*) AS headcount
          FROM     employees
          GROUP BY department
      ) AS ds              -- ← derived table must be aliased
ORDER BY ds.headcount DESC;


━━━ TYPE 5 — Correlated subquery (references outer row) ━━━
-- Employees earning above THEIR OWN department average
-- The subquery re-runs for every row of the outer query
SELECT e.name, e.department, e.salary
FROM   employees e
WHERE  e.salary > (
           SELECT AVG(e2.salary)
           FROM   employees e2
           WHERE  e2.department = e.department  -- ← refers to outer alias
       )
ORDER BY e.department, e.salary DESC;

⚡ Subqueries vs JOINs vs EXISTS Scalar and IN subqueries can usually be rewritten as JOINs, which the engine often optimises better. EXISTS is ideal for "does at least one row exist?" checks — it stops as soon as the first match is found. Correlated subqueries run once per outer row, so use them carefully on large tables. For deep nesting, prefer CTEs below for readability.

━━━ CTE — Common Table Expression (WITH clause) ━━━
-- Cleaner way to write the "above dept average" query
WITH dept_avg AS (
    SELECT   department,
             AVG(salary) AS avg_sal
    FROM     employees
    GROUP BY department
)
SELECT e.name, e.department, e.salary
FROM   employees    e
JOIN   dept_avg     d  ON e.department = d.department
WHERE  e.salary > d.avg_sal
ORDER BY e.department, e.salary DESC;

-- CTEs can be chained — each builds on the previous
WITH
  dept_stats AS (
      SELECT department, AVG(salary) AS avg_sal, COUNT(*) AS cnt
      FROM   employees  GROUP BY department
  ),
  large_depts AS (
      SELECT * FROM dept_stats WHERE cnt >= 2
  )
SELECT e.name, e.salary, ld.avg_sal
FROM   employees e
JOIN   large_depts ld ON e.department = ld.department
WHERE  e.salary > ld.avg_sal;

Part 3 — Database Design & Normalization

Good schema design eliminates data anomalies — unintended side effects when inserting, updating, or deleting. Normalization is the systematic process for achieving this.

3.1 — Keys & Relationships

Concept	Definition	Example
Primary Key (PK)	Uniquely identifies each row. Never NULL.	`employee_id`
Foreign Key (FK)	References a PK in another table. Enforces referential integrity.	`manager_id → employee_id`
Candidate Key	Any column(s) that could serve as PK.	`email`
Composite Key	PK made of two or more columns.	`(student_id, course_id)`
Surrogate Key	Artificial PK (auto-increment or UUID).	`id SERIAL`

3.2 — The Three Normal Forms

1NF

Every column holds atomic values. No repeating groups or arrays in a cell. Rows are uniquely identified.

2NF

Meets 1NF. Every non-key column depends on the whole primary key. No partial dependencies.

3NF

Meets 2NF. No transitive dependencies — non-key columns depend directly on the PK, not through other non-key columns.

3.3 — Normalization Walkthrough

Start with this un-normalized orders table:

order_id	customer_name	customer_city	product	price	qty
1	Alice	Paris	Laptop	1200	1
1	Alice	Paris	Mouse	25	2
2	Bob	Lyon	Keyboard	75	1

Problems: Alice's city is duplicated (update anomaly). Deleting order 2 loses Bob's city (delete anomaly). Each fact should live in one place only.

-- 3NF: four clean, linked tables
CREATE TABLE customers   (customer_id SERIAL PRIMARY KEY, name VARCHAR(100), city VARCHAR(100));
CREATE TABLE products    (product_id  SERIAL PRIMARY KEY, name VARCHAR(100), price NUMERIC(10,2));
CREATE TABLE orders      (order_id    SERIAL PRIMARY KEY, customer_id INT REFERENCES customers, order_date DATE);
CREATE TABLE order_items (
  order_id   INT REFERENCES orders,
  product_id INT REFERENCES products,
  quantity   INT NOT NULL CHECK (quantity > 0),
  PRIMARY KEY (order_id, product_id)
);

✅ What we gained Update Alice's city once → reflected everywhere. Delete an order → customer data is safe. Every fact lives in exactly one place: the Single Source of Truth principle.

🧪 Practice on W3school

Self-paced exercises covering all three parts — RA, SQL step-by-step, and normalization.

→ Open W3school

SQL Foundations, Database Design & Relational Algebra

From Relational Algebra to SQL Mastery

📋 What we cover

Part 1 — Relational Algebra

1.1 — Selection σ — Filter Rows

1.2 — Projection π — Choose Columns

1.3 — Composing σ and π

1.4 — Rename ρ — Give New Names

1.5 — Set Operations

Union

Intersection

Difference

Cartesian Product

1.6 — Join ⋈ — The Core Operation

Part 2 — SQL, Step by Step

Step 1 — SELECT & FROM: Choosing Columns (Projection π)

Step 2 — WHERE: Filtering Rows (Selection σ)

Step 3 — ORDER BY & LIMIT: Sorting and Paging

Step 4 — JOINs: Combining Tables (⋈)

Step 5 — Aggregation Functions

Step 6 — GROUP BY & HAVING

Step 7 — Nested Queries: Subqueries & CTEs

Part 3 — Database Design & Normalization

3.1 — Keys & Relationships

3.2 — The Three Normal Forms

1NF

2NF

3NF

3.3 — Normalization Walkthrough

🧪 Practice on W3school

Comments

Post a Comment

Popular posts from this blog