SQL Foundations, Database Design & Relational Algebra
From Relational Algebra to SQL Mastery
Understand the theory first. Then every SQL query will make perfect sense.
๐ What we cover
- Part 1 — Relational Algebra: the theory behind SQL
- Selection (ฯ), Projection (ฯ), Rename (ฯ)
- Set operations: Union, Intersection, Difference, Cartesian Product
- Join (⋈) and the RA → SQL mapping
- Part 2 — SQL, Step by Step
- Step 1 — SELECT & FROM: choosing columns
- Step 2 — WHERE: filtering rows
- Step 3 — ORDER BY & LIMIT: sorting and paging
- Step 4 — JOINs: combining tables
- Step 5 — Aggregation functions: COUNT, SUM, AVG, MIN, MAX
- Step 6 — GROUP BY & HAVING
- Step 7 — Nested Queries (Subqueries & CTEs)
- Part 3 — Database Design & Normalization
- ๐งช Practice Exercises on Kaggle
Part 1 — Relational Algebra
Before SQL existed, mathematicians developed Relational Algebra — a formal language for querying tables. Every SQL statement is secretly translated into relational algebra operations by the database engine. Learning the theory gives you a deep, lasting intuition for why SQL works the way it does.
A relation is a table: a set of rows (tuples) with named columns (attributes). Relational algebra defines operations that take one or two relations as input and produce a new relation as output. Every SQL query is a composition of these operations.
We use this Students table throughout Part 1:
| id | name | age | city | grade |
|---|---|---|---|---|
| 1 | Alice | 20 | Paris | A |
| 2 | Bob | 22 | Lyon | B |
| 3 | Carol | 20 | Paris | A |
| 4 | Dave | 24 | Bordeaux | C |
| 5 | Eve | 21 | Lyon | B |
1.1 — Selection ฯ — Filter Rows
Selection (ฯ, sigma) keeps only the rows that satisfy a condition. It is a horizontal filter — the columns stay the same, only rows are reduced.
ฯcity = 'Paris'(Students) → Keeps rows: Alice (id=1), Carol (id=3)
ฯage < 22 AND grade = 'A'(Students) → Keeps rows: Alice, Carol
WHERE clause.
Every condition in WHERE is a selection predicate.
1.2 — Projection ฯ — Choose Columns
Projection (ฯ, pi) keeps only certain columns. It is a vertical filter — all rows survive but only chosen attributes remain. Because a relation is a set, duplicate rows are automatically eliminated.
ฯname, city(Students) → {Alice,Paris} {Bob,Lyon} {Carol,Paris} {Dave,Bordeaux} {Eve,Lyon}
ฯcity(Students) → {Paris} {Lyon} {Bordeaux} — duplicates removed automatically
SELECT.
SELECT DISTINCT enforces the set semantics that RA implies.
1.3 — Composing ฯ and ฯ
Operations nest naturally: apply selection first to reduce rows, then project to reduce columns.
This is the algebraic basis for the common SELECT cols FROM table WHERE condition pattern.
Step 1 — ฯ: keep only Lyon rows → Bob, Eve
Step 2 — ฯ: keep only name and grade → {Bob, B}, {Eve, B}
SQL: SELECT name, grade FROM students WHERE city = 'Lyon';
1.4 — Rename ฯ — Give New Names
Rename (ฯ, rho) gives a new name to a relation or its attributes. This is essential when joining a table to itself (self-join), where you need two distinguishable copies.
ฯS(id → student_id, name → student_name)(Students)
SQL equivalent: SELECT id AS student_id, name AS student_name FROM students;
1.5 — Set Operations
Both relations must be union-compatible: same number of columns, compatible types.
Union
All tuples from R or S, duplicates removed. SQL: UNION
Intersection
Only tuples in both R and S. SQL: INTERSECT
Difference
Tuples in R but not in S. SQL: EXCEPT
Cartesian Product
Every row of R paired with every row of S. SQL: CROSS JOIN
1.6 — Join ⋈ — The Core Operation
The natural join (⋈) combines relations by matching rows on columns sharing the same name. It equals a Cartesian product filtered by the equality condition, then duplicated columns projected away.
This decomposition shows why large Cartesian products are expensive.
The engine uses indexes and smarter algorithms (hash join, merge join) to avoid materialising the full product.
Theta-join (any condition) and outer join (keep unmatched rows, fill NULLs) extend pure RA — both are available in SQL.
ฯ (selection) → WHERE
ฯ (projection) → SELECT col1, col2
ฯ (rename) → AS alias
⋈ (join) → JOIN … ON
∪ → UNION
∩ → INTERSECT
− → EXCEPT
× → CROSS JOIN
Part 2 — SQL, Step by Step
Now that you understand the theory, we build SQL knowledge one layer at a time — starting with the simplest possible query and adding one concept per step.
| employee_id | name | department | salary | hire_date | manager_id |
|---|---|---|---|---|---|
| 1 | Alice | Engineering | 95 000 | 2018-03-10 | NULL |
| 2 | Bob | Engineering | 82 000 | 2020-06-15 | 1 |
| 3 | Carol | Marketing | 67 000 | 2019-09-01 | 1 |
| 4 | Dave | Marketing | 71 000 | 2021-01-20 | 3 |
| 5 | Eve | Engineering | 88 000 | 2020-11-05 | 1 |
| 6 | Frank | HR | 58 000 | 2022-04-03 | NULL |
| 7 | Grace | Finance | 91 000 | 2017-07-22 | NULL |
Step 1 — SELECT & FROM: Choosing Columns (Projection ฯ)
The minimal SQL query needs only two clauses: FROM names the table (the relation),
and SELECT lists which columns to return (the projection). SELECT * is
the special case that projects all columns.
-- Return every column and every row (no filter yet)
SELECT *
FROM employees;
-- Project only name and salary (ฯ_{name, salary})
SELECT name, salary
FROM employees;
-- Rename columns with aliases (ฯ in RA)
SELECT name AS employee_name,
salary AS annual_salary
FROM employees;
-- DISTINCT enforces set semantics (no duplicates)
SELECT DISTINCT department
FROM employees;
-- Result: Engineering, Marketing, HR, Finance (4 rows, not 7)
Step 2 — WHERE: Filtering Rows (Selection ฯ)
WHERE is selection ฯ expressed in SQL. It filters rows before any
computation. You can combine conditions with AND, OR, NOT,
and use a rich set of predicates.
-- Equality filter
SELECT name, salary
FROM employees
WHERE department = 'Engineering';
-- Combine conditions
SELECT name, salary, department
FROM employees
WHERE salary > 70000
AND department <> 'HR';
-- BETWEEN: inclusive range (equivalent to >= AND <=)
SELECT name, salary
FROM employees
WHERE salary BETWEEN 60000 AND 90000;
-- IN: match any value from a list
SELECT name, department
FROM employees
WHERE department IN ('Engineering', 'Finance');
-- LIKE: pattern matching (% = any chars, _ = one char)
SELECT name
FROM employees
WHERE name LIKE 'A%'; -- starts with A
-- NULL check: always IS NULL, never = NULL
SELECT name
FROM employees
WHERE manager_id IS NULL; -- top-level managers only
NULL = NULL evaluates to NULL, not TRUE.
Any arithmetic or comparison involving NULL yields NULL. Always use IS NULL
or IS NOT NULL. Use COALESCE(col, default) to substitute a fallback value.
Step 3 — ORDER BY & LIMIT: Sorting and Paging
Relations in relational algebra are sets — they have no inherent order.
ORDER BY imposes an order on the final result after all other clauses are evaluated.
LIMIT and OFFSET then restrict how many rows are returned, enabling pagination.
-- Sort by salary, highest first
SELECT name, salary
FROM employees
ORDER BY salary DESC; -- ASC is the default
-- Sort by multiple columns: primary key is department, secondary is salary
SELECT name, department, salary
FROM employees
ORDER BY department ASC, salary DESC;
-- LIMIT: top 3 earners only
SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 3;
-- OFFSET: pagination — skip first 3 rows, get next 3
SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 3 OFFSET 3; -- "page 2" of results
ORDER BY is always the last clause evaluated (before LIMIT). You cannot rely on
row order anywhere else in a query. Never assume a SELECT without ORDER BY returns rows in
a predictable sequence.
Step 4 — JOINs: Combining Tables (⋈)
A JOIN merges rows from two tables on a matching condition — relational algebra's join ⋈ in SQL syntax. The type of join determines what happens to unmatched rows.
Only matched rows. Unmatched rows from both tables are dropped.
All left rows; NULL columns for unmatched right rows.
All right rows; NULL columns for unmatched left rows.
All rows from both sides; NULL wherever there's no match.
Every row × every row. Cartesian product ×. No ON clause.
-- departments(dept_id, dept_name, budget)
-- INNER JOIN: only employees whose department exists in departments
SELECT e.name, d.dept_name, d.budget
FROM employees e
INNER JOIN departments d ON e.department = d.dept_name;
-- LEFT JOIN: all employees, NULL budget if no department record
SELECT e.name, d.dept_name, d.budget
FROM employees e
LEFT JOIN departments d ON e.department = d.dept_name;
-- Self-join: employee name + their manager's name (uses ฯ via alias)
SELECT e.name AS employee,
COALESCE(m.name, 'No Manager') AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;
Step 5 — Aggregation Functions
Aggregation functions collapse a set of rows into a single value. They are always applied after WHERE filtering (and after JOINs), and before ORDER BY. When used without GROUP BY, they summarise the entire table.
| Function | Computes | NULL behaviour |
|---|---|---|
COUNT(*) | Number of rows | Counts rows even if all columns are NULL |
COUNT(col) | Non-NULL values in col | Ignores NULLs |
COUNT(DISTINCT col) | Unique non-NULL values | Ignores NULLs and duplicates |
SUM(col) | Total of all values | Ignores NULLs |
AVG(col) | Arithmetic mean | Ignores NULLs — denominator is non-NULL count |
MIN(col) | Smallest value | Ignores NULLs |
MAX(col) | Largest value | Ignores NULLs |
-- Total headcount
SELECT COUNT(*) AS total_employees
FROM employees;
-- Complete salary summary for the whole company
SELECT SUM(salary) AS total_payroll,
ROUND(AVG(salary), 2) AS avg_salary,
MIN(salary) AS lowest,
MAX(salary) AS highest
FROM employees;
-- Count distinct departments
SELECT COUNT(DISTINCT department) AS num_departments
FROM employees;
AVG(salary) over [80 000, NULL, 60 000] returns 70 000, not 46 667.
The NULL row is excluded from both numerator and denominator.
Use AVG(COALESCE(salary, 0)) if you want NULLs treated as zero.
Step 6 — GROUP BY & HAVING
GROUP BY partitions rows into groups and applies aggregation per group.
HAVING then filters those groups — it is the WHERE of aggregated results.
FROM → JOIN → WHERE → GROUP BY
→ HAVING → SELECT → DISTINCT
→ ORDER BY → LIMIT / OFFSET
Consequences: (1) You cannot use a SELECT alias in WHERE or HAVING —
the alias is created after those steps. (2) You cannot use an aggregate
function in WHERE — use HAVING instead. (3) Only columns in GROUP BY or inside an aggregate
can appear in SELECT.
-- Headcount and average salary per department
SELECT department,
COUNT(*) AS headcount,
ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;
-- HAVING: keep only departments with avg salary > 75 000
SELECT department,
ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 75000 -- use the function, not the alias
ORDER BY avg_salary DESC;
-- WHERE (row filter) + GROUP BY + HAVING (group filter)
SELECT department,
COUNT(*) AS recent_hires
FROM employees
WHERE hire_date >= '2020-01-01' -- filter rows first
GROUP BY department
HAVING COUNT(*) >= 2 -- then filter groups
ORDER BY recent_hires DESC;
Step 7 — Nested Queries: Subqueries & CTEs
A subquery is a complete SELECT statement embedded inside another query.
It directly expresses relational algebra's ability to compose operations.
Subqueries can appear in WHERE, FROM, or SELECT.
━━━ TYPE 1 — Scalar subquery in WHERE ━━━
-- Employees earning above the overall average salary
SELECT name, salary
FROM employees
WHERE salary > (
SELECT AVG(salary) -- returns one value
FROM employees
);
━━━ TYPE 2 — IN subquery (returns a list) ━━━
-- Employees who manage at least one person
SELECT name, department
FROM employees
WHERE employee_id IN (
SELECT DISTINCT manager_id
FROM employees
WHERE manager_id IS NOT NULL
);
━━━ TYPE 3 — EXISTS (short-circuits on first match) ━━━
-- Same result as above, often faster on large tables
SELECT e.name
FROM employees e
WHERE EXISTS (
SELECT 1
FROM employees sub
WHERE sub.manager_id = e.employee_id
);
━━━ TYPE 4 — Subquery in FROM (derived table) ━━━
-- Rank departments by headcount using a nested SELECT
SELECT ds.department, ds.headcount
FROM (
SELECT department,
COUNT(*) AS headcount
FROM employees
GROUP BY department
) AS ds -- ← derived table must be aliased
ORDER BY ds.headcount DESC;
━━━ TYPE 5 — Correlated subquery (references outer row) ━━━
-- Employees earning above THEIR OWN department average
-- The subquery re-runs for every row of the outer query
SELECT e.name, e.department, e.salary
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department = e.department -- ← refers to outer alias
)
ORDER BY e.department, e.salary DESC;
━━━ CTE — Common Table Expression (WITH clause) ━━━
-- Cleaner way to write the "above dept average" query
WITH dept_avg AS (
SELECT department,
AVG(salary) AS avg_sal
FROM employees
GROUP BY department
)
SELECT e.name, e.department, e.salary
FROM employees e
JOIN dept_avg d ON e.department = d.department
WHERE e.salary > d.avg_sal
ORDER BY e.department, e.salary DESC;
-- CTEs can be chained — each builds on the previous
WITH
dept_stats AS (
SELECT department, AVG(salary) AS avg_sal, COUNT(*) AS cnt
FROM employees GROUP BY department
),
large_depts AS (
SELECT * FROM dept_stats WHERE cnt >= 2
)
SELECT e.name, e.salary, ld.avg_sal
FROM employees e
JOIN large_depts ld ON e.department = ld.department
WHERE e.salary > ld.avg_sal;
Part 3 — Database Design & Normalization
Good schema design eliminates data anomalies — unintended side effects when inserting, updating, or deleting. Normalization is the systematic process for achieving this.
3.1 — Keys & Relationships
| Concept | Definition | Example |
|---|---|---|
| Primary Key (PK) | Uniquely identifies each row. Never NULL. | employee_id |
| Foreign Key (FK) | References a PK in another table. Enforces referential integrity. | manager_id → employee_id |
| Candidate Key | Any column(s) that could serve as PK. | email |
| Composite Key | PK made of two or more columns. | (student_id, course_id) |
| Surrogate Key | Artificial PK (auto-increment or UUID). | id SERIAL |
3.2 — The Three Normal Forms
1NF
Every column holds atomic values. No repeating groups or arrays in a cell. Rows are uniquely identified.
2NF
Meets 1NF. Every non-key column depends on the whole primary key. No partial dependencies.
3NF
Meets 2NF. No transitive dependencies — non-key columns depend directly on the PK, not through other non-key columns.
3.3 — Normalization Walkthrough
Start with this un-normalized orders table:
| order_id | customer_name | customer_city | product | price | qty |
|---|---|---|---|---|---|
| 1 | Alice | Paris | Laptop | 1200 | 1 |
| 1 | Alice | Paris | Mouse | 25 | 2 |
| 2 | Bob | Lyon | Keyboard | 75 | 1 |
Problems: Alice's city is duplicated (update anomaly). Deleting order 2 loses Bob's city (delete anomaly). Each fact should live in one place only.
-- 3NF: four clean, linked tables
CREATE TABLE customers (customer_id SERIAL PRIMARY KEY, name VARCHAR(100), city VARCHAR(100));
CREATE TABLE products (product_id SERIAL PRIMARY KEY, name VARCHAR(100), price NUMERIC(10,2));
CREATE TABLE orders (order_id SERIAL PRIMARY KEY, customer_id INT REFERENCES customers, order_date DATE);
CREATE TABLE order_items (
order_id INT REFERENCES orders,
product_id INT REFERENCES products,
quantity INT NOT NULL CHECK (quantity > 0),
PRIMARY KEY (order_id, product_id)
);
๐งช Practice on W3school
Self-paced exercises covering all three parts — RA, SQL step-by-step, and normalization.
→ Open W3school
Comments
Post a Comment