SQL Foundations, Database Design & Relational Algebra

๐Ÿ“š Lesson 01 — Databases & Data Foundations

From Relational Algebra to SQL Mastery

Understand the theory first. Then every SQL query will make perfect sense.

๐Ÿ“‹ What we cover

  1. Part 1 — Relational Algebra: the theory behind SQL
    • Selection (ฯƒ), Projection (ฯ€), Rename (ฯ)
    • Set operations: Union, Intersection, Difference, Cartesian Product
    • Join (⋈) and the RA → SQL mapping
  2. Part 2 — SQL, Step by Step
    • Step 1 — SELECT & FROM: choosing columns
    • Step 2 — WHERE: filtering rows
    • Step 3 — ORDER BY & LIMIT: sorting and paging
    • Step 4 — JOINs: combining tables
    • Step 5 — Aggregation functions: COUNT, SUM, AVG, MIN, MAX
    • Step 6 — GROUP BY & HAVING
    • Step 7 — Nested Queries (Subqueries & CTEs)
  3. Part 3 — Database Design & Normalization
  4. ๐Ÿงช Practice Exercises on Kaggle

Part 1 — Relational Algebra

Before SQL existed, mathematicians developed Relational Algebra — a formal language for querying tables. Every SQL statement is secretly translated into relational algebra operations by the database engine. Learning the theory gives you a deep, lasting intuition for why SQL works the way it does.

๐Ÿ”‘ Core idea

A relation is a table: a set of rows (tuples) with named columns (attributes). Relational algebra defines operations that take one or two relations as input and produce a new relation as output. Every SQL query is a composition of these operations.

We use this Students table throughout Part 1:

idnameagecitygrade
1Alice20ParisA
2Bob22LyonB
3Carol20ParisA
4Dave24BordeauxC
5Eve21LyonB

1.1 — Selection ฯƒ — Filter Rows

Selection (ฯƒ, sigma) keeps only the rows that satisfy a condition. It is a horizontal filter — the columns stay the same, only rows are reduced.

ฯƒ — Selection operator ฯƒcondition(Relation)
ฯƒcity = 'Paris'(Students) → Keeps rows: Alice (id=1), Carol (id=3)
ฯƒage < 22 AND grade = 'A'(Students) → Keeps rows: Alice, Carol
๐Ÿ’ก SQL equivalent Selection ฯƒ maps directly to the WHERE clause. Every condition in WHERE is a selection predicate.

1.2 — Projection ฯ€ — Choose Columns

Projection (ฯ€, pi) keeps only certain columns. It is a vertical filter — all rows survive but only chosen attributes remain. Because a relation is a set, duplicate rows are automatically eliminated.

ฯ€ — Projection operator ฯ€attr₁, attr₂, …(Relation)
ฯ€name, city(Students) → {Alice,Paris} {Bob,Lyon} {Carol,Paris} {Dave,Bordeaux} {Eve,Lyon}
ฯ€city(Students) → {Paris} {Lyon} {Bordeaux} — duplicates removed automatically
๐Ÿ’ก SQL equivalent Projection ฯ€ maps to the column list after SELECT. SELECT DISTINCT enforces the set semantics that RA implies.

1.3 — Composing ฯƒ and ฯ€

Operations nest naturally: apply selection first to reduce rows, then project to reduce columns. This is the algebraic basis for the common SELECT cols FROM table WHERE condition pattern.

Composition — ฯƒ then ฯ€ ฯ€name, grade( ฯƒcity = 'Lyon'(Students) )
Step 1 — ฯƒ: keep only Lyon rows → Bob, Eve
Step 2 — ฯ€: keep only name and grade → {Bob, B}, {Eve, B}

SQL: SELECT name, grade FROM students WHERE city = 'Lyon';

1.4 — Rename ฯ — Give New Names

Rename (ฯ, rho) gives a new name to a relation or its attributes. This is essential when joining a table to itself (self-join), where you need two distinguishable copies.

ฯ — Rename operator ฯNewName(old_col → new_col)(Relation)
ฯS(id → student_id, name → student_name)(Students)
SQL equivalent: SELECT id AS student_id, name AS student_name FROM students;

1.5 — Set Operations

Both relations must be union-compatible: same number of columns, compatible types.

R ∪ S

Union

All tuples from R or S, duplicates removed. SQL: UNION

R ∩ S

Intersection

Only tuples in both R and S. SQL: INTERSECT

R − S

Difference

Tuples in R but not in S. SQL: EXCEPT

R × S

Cartesian Product

Every row of R paired with every row of S. SQL: CROSS JOIN

1.6 — Join ⋈ — The Core Operation

The natural join (⋈) combines relations by matching rows on columns sharing the same name. It equals a Cartesian product filtered by the equality condition, then duplicated columns projected away.

Natural Join — formal definition R ⋈ S ≡ ฯ€non-duplicate attrs( ฯƒR.key = S.key(R × S) )

This decomposition shows why large Cartesian products are expensive.
The engine uses indexes and smarter algorithms (hash join, merge join) to avoid materialising the full product.

Theta-join (any condition) and outer join (keep unmatched rows, fill NULLs) extend pure RA — both are available in SQL.
๐Ÿ“ Complete RA → SQL mapping

ฯƒ (selection)WHERE    ฯ€ (projection)SELECT col1, col2    ฯ (rename)AS alias    ⋈ (join)JOIN … ON    UNION    INTERSECT    EXCEPT    ×CROSS JOIN

Part 2 — SQL, Step by Step

Now that you understand the theory, we build SQL knowledge one layer at a time — starting with the simplest possible query and adding one concept per step.

employee_idnamedepartmentsalaryhire_datemanager_id
1AliceEngineering95 0002018-03-10NULL
2BobEngineering82 0002020-06-151
3CarolMarketing67 0002019-09-011
4DaveMarketing71 0002021-01-203
5EveEngineering88 0002020-11-051
6FrankHR58 0002022-04-03NULL
7GraceFinance91 0002017-07-22NULL

Step 1 — SELECT & FROM: Choosing Columns (Projection ฯ€)

Beginner

The minimal SQL query needs only two clauses: FROM names the table (the relation), and SELECT lists which columns to return (the projection). SELECT * is the special case that projects all columns.

-- Return every column and every row (no filter yet)
SELECT *
FROM   employees;

-- Project only name and salary (ฯ€_{name, salary})
SELECT name, salary
FROM   employees;

-- Rename columns with aliases (ฯ in RA)
SELECT name   AS employee_name,
       salary AS annual_salary
FROM   employees;

-- DISTINCT enforces set semantics (no duplicates)
SELECT DISTINCT department
FROM   employees;
-- Result: Engineering, Marketing, HR, Finance (4 rows, not 7)

Step 2 — WHERE: Filtering Rows (Selection ฯƒ)

Beginner

WHERE is selection ฯƒ expressed in SQL. It filters rows before any computation. You can combine conditions with AND, OR, NOT, and use a rich set of predicates.

-- Equality filter
SELECT name, salary
FROM   employees
WHERE  department = 'Engineering';

-- Combine conditions
SELECT name, salary, department
FROM   employees
WHERE  salary > 70000
  AND  department <> 'HR';

-- BETWEEN: inclusive range (equivalent to >= AND <=)
SELECT name, salary
FROM   employees
WHERE  salary BETWEEN 60000 AND 90000;

-- IN: match any value from a list
SELECT name, department
FROM   employees
WHERE  department IN ('Engineering', 'Finance');

-- LIKE: pattern matching (% = any chars, _ = one char)
SELECT name
FROM   employees
WHERE  name LIKE 'A%';       -- starts with A

-- NULL check: always IS NULL, never = NULL
SELECT name
FROM   employees
WHERE  manager_id IS NULL;   -- top-level managers only
⚠️ The NULL trap NULL = NULL evaluates to NULL, not TRUE. Any arithmetic or comparison involving NULL yields NULL. Always use IS NULL or IS NOT NULL. Use COALESCE(col, default) to substitute a fallback value.

Step 3 — ORDER BY & LIMIT: Sorting and Paging

Beginner → Intermediate

Relations in relational algebra are sets — they have no inherent order. ORDER BY imposes an order on the final result after all other clauses are evaluated. LIMIT and OFFSET then restrict how many rows are returned, enabling pagination.

-- Sort by salary, highest first
SELECT name, salary
FROM   employees
ORDER BY salary DESC;    -- ASC is the default

-- Sort by multiple columns: primary key is department, secondary is salary
SELECT name, department, salary
FROM   employees
ORDER BY department ASC, salary DESC;

-- LIMIT: top 3 earners only
SELECT name, salary
FROM   employees
ORDER BY salary DESC
LIMIT  3;

-- OFFSET: pagination — skip first 3 rows, get next 3
SELECT name, salary
FROM   employees
ORDER BY salary DESC
LIMIT  3  OFFSET 3;   -- "page 2" of results
๐Ÿ’ก Rule ORDER BY is always the last clause evaluated (before LIMIT). You cannot rely on row order anywhere else in a query. Never assume a SELECT without ORDER BY returns rows in a predictable sequence.

Step 4 — JOINs: Combining Tables (⋈)

Intermediate

A JOIN merges rows from two tables on a matching condition — relational algebra's join ⋈ in SQL syntax. The type of join determines what happens to unmatched rows.

INNER JOIN

Only matched rows. Unmatched rows from both tables are dropped.

LEFT JOIN

All left rows; NULL columns for unmatched right rows.

RIGHT JOIN

All right rows; NULL columns for unmatched left rows.

FULL OUTER

All rows from both sides; NULL wherever there's no match.

CROSS JOIN

Every row × every row. Cartesian product ×. No ON clause.

-- departments(dept_id, dept_name, budget)

-- INNER JOIN: only employees whose department exists in departments
SELECT e.name, d.dept_name, d.budget
FROM   employees   e
INNER JOIN departments d ON e.department = d.dept_name;

-- LEFT JOIN: all employees, NULL budget if no department record
SELECT e.name, d.dept_name, d.budget
FROM   employees   e
LEFT JOIN departments d ON e.department = d.dept_name;

-- Self-join: employee name + their manager's name (uses ฯ via alias)
SELECT e.name                         AS employee,
       COALESCE(m.name, 'No Manager') AS manager
FROM   employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;

Step 5 — Aggregation Functions

Intermediate

Aggregation functions collapse a set of rows into a single value. They are always applied after WHERE filtering (and after JOINs), and before ORDER BY. When used without GROUP BY, they summarise the entire table.

FunctionComputesNULL behaviour
COUNT(*)Number of rowsCounts rows even if all columns are NULL
COUNT(col)Non-NULL values in colIgnores NULLs
COUNT(DISTINCT col)Unique non-NULL valuesIgnores NULLs and duplicates
SUM(col)Total of all valuesIgnores NULLs
AVG(col)Arithmetic meanIgnores NULLs — denominator is non-NULL count
MIN(col)Smallest valueIgnores NULLs
MAX(col)Largest valueIgnores NULLs
-- Total headcount
SELECT COUNT(*) AS total_employees
FROM   employees;

-- Complete salary summary for the whole company
SELECT SUM(salary)            AS total_payroll,
       ROUND(AVG(salary), 2) AS avg_salary,
       MIN(salary)            AS lowest,
       MAX(salary)            AS highest
FROM   employees;

-- Count distinct departments
SELECT COUNT(DISTINCT department) AS num_departments
FROM   employees;
⚠️ AVG ignores NULLs — careful! AVG(salary) over [80 000, NULL, 60 000] returns 70 000, not 46 667. The NULL row is excluded from both numerator and denominator. Use AVG(COALESCE(salary, 0)) if you want NULLs treated as zero.

Step 6 — GROUP BY & HAVING

Intermediate

GROUP BY partitions rows into groups and applies aggregation per group. HAVING then filters those groups — it is the WHERE of aggregated results.

๐Ÿ”‘ Logical execution order — memorise this

FROMJOINWHEREGROUP BYHAVINGSELECTDISTINCTORDER BYLIMIT / OFFSET

Consequences: (1) You cannot use a SELECT alias in WHERE or HAVING — the alias is created after those steps. (2) You cannot use an aggregate function in WHERE — use HAVING instead. (3) Only columns in GROUP BY or inside an aggregate can appear in SELECT.

-- Headcount and average salary per department
SELECT   department,
         COUNT(*)                AS headcount,
         ROUND(AVG(salary), 2)  AS avg_salary
FROM     employees
GROUP BY department
ORDER BY avg_salary DESC;

-- HAVING: keep only departments with avg salary > 75 000
SELECT   department,
         ROUND(AVG(salary), 2) AS avg_salary
FROM     employees
GROUP BY department
HAVING   AVG(salary) > 75000  -- use the function, not the alias
ORDER BY avg_salary DESC;

-- WHERE (row filter) + GROUP BY + HAVING (group filter)
SELECT   department,
         COUNT(*) AS recent_hires
FROM     employees
WHERE    hire_date >= '2020-01-01'    -- filter rows first
GROUP BY department
HAVING   COUNT(*) >= 2               -- then filter groups
ORDER BY recent_hires DESC;

Step 7 — Nested Queries: Subqueries & CTEs

Advanced

A subquery is a complete SELECT statement embedded inside another query. It directly expresses relational algebra's ability to compose operations. Subqueries can appear in WHERE, FROM, or SELECT.

━━━ TYPE 1 — Scalar subquery in WHERE ━━━
-- Employees earning above the overall average salary
SELECT name, salary
FROM   employees
WHERE  salary > (
           SELECT AVG(salary)   -- returns one value
           FROM   employees
       );


━━━ TYPE 2 — IN subquery (returns a list) ━━━
-- Employees who manage at least one person
SELECT name, department
FROM   employees
WHERE  employee_id IN (
           SELECT DISTINCT manager_id
           FROM   employees
           WHERE  manager_id IS NOT NULL
       );


━━━ TYPE 3 — EXISTS (short-circuits on first match) ━━━
-- Same result as above, often faster on large tables
SELECT e.name
FROM   employees e
WHERE  EXISTS (
           SELECT 1
           FROM   employees sub
           WHERE  sub.manager_id = e.employee_id
       );


━━━ TYPE 4 — Subquery in FROM (derived table) ━━━
-- Rank departments by headcount using a nested SELECT
SELECT ds.department, ds.headcount
FROM  (
          SELECT   department,
                   COUNT(*) AS headcount
          FROM     employees
          GROUP BY department
      ) AS ds              -- ← derived table must be aliased
ORDER BY ds.headcount DESC;


━━━ TYPE 5 — Correlated subquery (references outer row) ━━━
-- Employees earning above THEIR OWN department average
-- The subquery re-runs for every row of the outer query
SELECT e.name, e.department, e.salary
FROM   employees e
WHERE  e.salary > (
           SELECT AVG(e2.salary)
           FROM   employees e2
           WHERE  e2.department = e.department  -- ← refers to outer alias
       )
ORDER BY e.department, e.salary DESC;
⚡ Subqueries vs JOINs vs EXISTS Scalar and IN subqueries can usually be rewritten as JOINs, which the engine often optimises better. EXISTS is ideal for "does at least one row exist?" checks — it stops as soon as the first match is found. Correlated subqueries run once per outer row, so use them carefully on large tables. For deep nesting, prefer CTEs below for readability.
━━━ CTE — Common Table Expression (WITH clause) ━━━
-- Cleaner way to write the "above dept average" query
WITH dept_avg AS (
    SELECT   department,
             AVG(salary) AS avg_sal
    FROM     employees
    GROUP BY department
)
SELECT e.name, e.department, e.salary
FROM   employees    e
JOIN   dept_avg     d  ON e.department = d.department
WHERE  e.salary > d.avg_sal
ORDER BY e.department, e.salary DESC;

-- CTEs can be chained — each builds on the previous
WITH
  dept_stats AS (
      SELECT department, AVG(salary) AS avg_sal, COUNT(*) AS cnt
      FROM   employees  GROUP BY department
  ),
  large_depts AS (
      SELECT * FROM dept_stats WHERE cnt >= 2
  )
SELECT e.name, e.salary, ld.avg_sal
FROM   employees e
JOIN   large_depts ld ON e.department = ld.department
WHERE  e.salary > ld.avg_sal;

Part 3 — Database Design & Normalization

Good schema design eliminates data anomalies — unintended side effects when inserting, updating, or deleting. Normalization is the systematic process for achieving this.

3.1 — Keys & Relationships

ConceptDefinitionExample
Primary Key (PK)Uniquely identifies each row. Never NULL.employee_id
Foreign Key (FK)References a PK in another table. Enforces referential integrity.manager_id → employee_id
Candidate KeyAny column(s) that could serve as PK.email
Composite KeyPK made of two or more columns.(student_id, course_id)
Surrogate KeyArtificial PK (auto-increment or UUID).id SERIAL

3.2 — The Three Normal Forms

1NF

Every column holds atomic values. No repeating groups or arrays in a cell. Rows are uniquely identified.

2NF

Meets 1NF. Every non-key column depends on the whole primary key. No partial dependencies.

3NF

Meets 2NF. No transitive dependencies — non-key columns depend directly on the PK, not through other non-key columns.

3.3 — Normalization Walkthrough

Start with this un-normalized orders table:

order_idcustomer_namecustomer_cityproductpriceqty
1AliceParisLaptop12001
1AliceParisMouse252
2BobLyonKeyboard751

Problems: Alice's city is duplicated (update anomaly). Deleting order 2 loses Bob's city (delete anomaly). Each fact should live in one place only.

-- 3NF: four clean, linked tables
CREATE TABLE customers   (customer_id SERIAL PRIMARY KEY, name VARCHAR(100), city VARCHAR(100));
CREATE TABLE products    (product_id  SERIAL PRIMARY KEY, name VARCHAR(100), price NUMERIC(10,2));
CREATE TABLE orders      (order_id    SERIAL PRIMARY KEY, customer_id INT REFERENCES customers, order_date DATE);
CREATE TABLE order_items (
  order_id   INT REFERENCES orders,
  product_id INT REFERENCES products,
  quantity   INT NOT NULL CHECK (quantity > 0),
  PRIMARY KEY (order_id, product_id)
);
✅ What we gained Update Alice's city once → reflected everywhere. Delete an order → customer data is safe. Every fact lives in exactly one place: the Single Source of Truth principle.

๐Ÿงช Practice on W3school

Self-paced exercises covering all three parts — RA, SQL step-by-step, and normalization.

→ Open W3school

Comments

Popular posts from this blog

Automate Blog Content Creation with n8n and Grok 3 API

DAX: The Complete Guide

Hello world !