Skip to content
← SQL · beginner · 13 min · 02 / 10

Filtering, Grouping & Aggregation

Collapse many rows into summary answers with COUNT, SUM, GROUP BY, and HAVING.

group byaggregatehaving

From Rows to Summaries

So far every query returned individual rows. Often you want a summary instead: how many orders did we ship? What’s the average order value per customer? Aggregation answers these by collapsing many rows into one.

We’ll use an orders table:

idcustomer_idstatusamountcreated_at
110shipped49.002026-05-01
210shipped12.502026-05-03
320cancelled80.002026-05-04
420shipped99.992026-05-06

Aggregate Functions

An aggregate function takes a set of values and returns a single value:

FunctionReturns
COUNT(*)Number of rows
COUNT(col)Number of rows where col is not null
SUM(col)Total of all values
AVG(col)Mean
MIN(col) / MAX(col)Smallest / largest value
SELECT
  COUNT(*)        AS order_count,
  SUM(amount)     AS revenue,
  AVG(amount)     AS avg_order,
  MAX(amount)     AS biggest_order
FROM orders;

This returns exactly one row summarizing the whole table.

COUNT(*) vs COUNT(col). COUNT(*) counts rows regardless of nulls. COUNT(col) counts only rows where col is non-null — handy for “how many orders have a discount code”. And COUNT(DISTINCT col) counts distinct non-null values, e.g. COUNT(DISTINCT customer_id) gives the number of unique customers.

GROUP BY: Aggregating Per Category

A single grand total is rarely enough — you usually want one summary per group. GROUP BY splits rows into buckets and runs the aggregate within each:

SELECT
  customer_id,
  COUNT(*)    AS order_count,
  SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id;

Result:

customer_idorder_counttotal_spent
10261.50
202179.99

The rule that catches everyone: every column in the SELECT list must either be inside an aggregate function or named in the GROUP BY. Otherwise the database can’t decide which value to show, since a group has many rows. This fails:

SELECT customer_id, status, SUM(amount)
FROM orders
GROUP BY customer_id;   -- ERROR: status must appear in GROUP BY

You can group by multiple columns to make finer buckets:

SELECT customer_id, status, SUM(amount) AS total
FROM orders
GROUP BY customer_id, status;

HAVING vs WHERE

You can’t filter on an aggregate with WHERE, because WHERE runs before grouping happens — at that point the aggregate doesn’t exist yet. HAVING is the filter that applies after grouping:

SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE status = 'shipped'        -- filter rows BEFORE grouping
GROUP BY customer_id
HAVING SUM(amount) > 50;        -- filter groups AFTER aggregating

Read it as a pipeline:

  1. WHERE status = 'shipped' throws away cancelled orders, row by row.
  2. GROUP BY customer_id buckets the survivors.
  3. HAVING SUM(amount) > 50 discards whole groups whose total is too small.

Put the cheap filter in WHERE. Filtering rows early (in WHERE) means fewer rows to group and aggregate, which is faster. Reserve HAVING for conditions that genuinely depend on an aggregate value. Writing WHERE amount > 50 and HAVING amount > 50 mean very different things.

The Logical Order of Execution

SQL clauses are written in one order but evaluated in another. Understanding the logical order explains nearly every “why can’t I reference that alias here?” question. The engine conceptually processes a query like this:

1. FROM      — pick the source tables, resolve joins
2. WHERE     — filter individual rows
3. GROUP BY  — collapse rows into groups
4. HAVING    — filter groups
5. SELECT    — compute output columns / aggregates, assign aliases
6. DISTINCT  — remove duplicate result rows
7. ORDER BY  — sort the result
8. LIMIT     — keep only the first N rows

Two consequences fall out of this:

  • You can’t use a SELECT alias in WHERE or GROUP BY, because SELECT runs after them. The alias doesn’t exist yet.

    SELECT amount * 0.9 AS discounted FROM orders
    WHERE discounted > 40;   -- ERROR: "discounted" unknown here
  • You can use a SELECT alias in ORDER BY, since sorting happens last:

    SELECT amount * 0.9 AS discounted FROM orders
    ORDER BY discounted DESC;   -- works fine

(PostgreSQL is lenient and also allows aliases in GROUP BY as a convenience, but the logical model above is the portable mental picture.)

Putting It Together

A realistic aggregation query touches most of these clauses at once:

SELECT
  customer_id,
  COUNT(*)            AS shipped_orders,
  SUM(amount)         AS revenue,
  ROUND(AVG(amount), 2) AS avg_order
FROM orders
WHERE status = 'shipped'
GROUP BY customer_id
HAVING SUM(amount) > 50
ORDER BY revenue DESC
LIMIT 10;

This reads as: of the shipped orders, group by customer, keep customers who spent more than 50, and show the top 10 by revenue. That single statement replaces what would be dozens of lines of imperative code — the declarative payoff of SQL.

Recap

Aggregates collapse rows; GROUP BY does it per category; HAVING filters the resulting groups while WHERE filters the input rows. Internalize the logical execution order and most surprising errors stop being surprising. Next we connect multiple tables together with joins.

Finished reading?

Mark complete to earn 10 XP