SQL Subqueries
Goals
- Perform complex query logic using subqueries.
Concepts
- correlated subquery
- scalar subquery
- subquery
Language
SQL
ALL
ANY
EXISTS
FROM
IN
UPDATE
Lesson
One of the most powerful aspects of SQL is that it allows you to use subqueries, which are literally queries that appear within the main query, surrounded by parentheses. Subqueries can themselves contain subqueries, and so on. There are two locations in which you would typically use a subquery: in the WHERE
clause, where the subquery serves as data that plays a part in the restriction predicate; and in the FROM
clause, where the subquery serves as a source of data, just like a table.
Like queries in general, a subquery technically returns a table. Depending on what the query requested, the result table may contain a single column with a single row, a single column with multiple rows, or a general table with multiple columns and possibly multiple rows. Which type of result is called for determines where the subquery can be placed as an expression in the main query.
Scalar Subqueries
SELECT name, stock FROM Item
WHERE stock <
(SELECT AVG(stock) FROM Item);
name | stock |
---|---|
table | 20 |
game table | 12 |
chair | 15 |
kite | 5 |
doll | 10 |
SELECT AVG(stock) FROM Item;
avg |
---|
26.8333333333333333 |
If a subquery returns a table with a single column containing a single row, it is referred to as a scalar subquery. This sort of subquery can be used where a single value would be expected in an expression.
By now you know that you could list the items with stock greater than 10 with the statement SELECT name FROM Item WHERE stock < 20
. But if you wish to find the items that have a stock less than the average stock, you would have to somehow know the average stock ahead of time—unless you include a query to determine the average stock as a subquery within the statement itself! As you have seen, you can use SELECT AVG(stock) FROM Item
to determine the average stock, so SELECT name FROM Item WHERE stock < (SELECT AVG(stock) FROM Item)
would list only items that have less stock than the average. In this example the average is high because there are so many balloons in stock.
Multiple-Value Subqueries
IN
A subquery may be structured so as to potentially return multiple rows in a single column. The values in this column can be used in an expression where multiple values would be expected. You've studied the IN
keyword, which checks if a given value occurs in a set of values, such as WHERE color IN ('red', 'green', 'blue')
. A subquery that returns multiple rows of values in a single column can be used as the set of values, creating a dynamic expression.
Suppose you want to list the suppliers of items with stock under 20. Finding the supplier ID of items with stock less than 20 is not so different than the queries above: SELECT supplierId FROM Item WHERE stock < 20
. As this query returns a single column of supplier IDs, you can use the result as a set of values in the predicate for restricting the list of suppliers: SELECT Supplier.name FROM Supplier WHERE Supplier.id IN (SELECT supplierId FROM Item WHERE stock < 20)
. The columns Supplier.name
and Supplier.id
are qualified for clarity; simply name
and id
could have been used.
ANY
The IN
operator implies that an equality comparison is being performed with each of the items returned by a subquery. The ANY
operator is a more general form that allows any comparison to be compared against items, not just quality. You could for example find all items with stock less than the stock of any item (not necessarily all items) from Acme Furniture Company using SELECT name FROM Item WHERE Item.stock < ANY (SELECT stock FROM Item JOIN Supplier ON Item.supplierId = Supplier.id WHERE Supplier.name = 'Acme Furniture Company')
.
ALL
A similar operator, ALL
, allows you to create a predicate to ensure that a comparison holds against all the results of a subquery. As an example you could find all items with stock greater than the stock of all items from Acme Furniture Company using SELECT name FROM Item WHERE Item.stock > ALL (SELECT stock FROM Item JOIN Supplier ON Item.supplierId = Supplier.id WHERE Supplier.name = 'Acme Furniture Company')
.
Multiple-Table Subqueries
TODO
Subqueries as Data Sources
The previous statements have used subqueries as predicates in the WHERE
clause, providing information to use when restricting the returned rows. Because subqueries produce tables, those tables can themselves be used as a source of data in their own right. In the FROM
clause a subquery acts like a table that existed in the original schema. The main difference is that the subquery table does not have a name. If the columns returned by the subquery are ambiguous, you will need to create an table name for subquery, as you would normally rename columns using the AS
keyword. The Grouping section above already included an example of a subquery as a source of data.
If you wanted to find the total number of items in stock in each category, you would need to join the Item
table with the ItemCategory
table (because items can be in multiple categories) and group by the item category ID: SELECT categoryId, SUM(stock) FROM ItemCategory JOIN Item ON ItemCategory.itemCode = Item.code GROUP BY categoryId
. Listing the categories by category ID, however, is less than useful to users. You could do a three-table join with the Category
table to find the category name, but because you are grouping on categoryId
rather than Category.name
, SQL will not let you include the Category.name
column in the result, as it is neither a grouping column nor an aggregate function.
Instead you can consider the grouping result as a source of data, joining the entire result table with the Category
table. To use the result you will need to rename the subquery result table so that you can refer to it in the outer query. You also need to rename the SUM(stock)
column so that you can refer to it in the outer query as well. The request is shown in the future on the side.
Correlated Subqueries
The subqueries shown so far have been completely independent of the outer query in which it appears. These sort of subqueries can be performed once for the entire query; the result is then fed into the outer query. SQL also allows a subquery to reference to reference variables in the outer query. In such a correlated subquery the subquery is dependent on (“correlated with”) the outer query.
You already saw that you can list all items with under the average stock using SELECT name FROM Item WHERE stock < (SELECT AVG(stock) FROM Item)
. Because the subquery SELECT AVG(stock) FROM Item
requests the average stock of all items, the database only needs to execute the query once; the same resulting average can be used in the comparison for each row in the outer query.
SELECT name, stock
FROM Item CurrentItem
WHERE stock <
(SELECT AVG(stock)
FROM Item
WHERE Item.supplierId = CurrentItem.supplierId);
name | stock |
---|---|
game table | 12 |
kite | 5 |
doll | 10 |
chair | 15 |
Suppose however that you wanted to list all items each of which was under the average stock of all items from its supplier. The subquery would need to calculate the average stock of that item's supplier, and to do this it would have to reference the item from the outer query to be able to find the supplier. Because many times both the inner query and the outer query reference the same tables, when crafting a correlated subquery you will normally need to use an alias one one or more tables in the outer query for the inner table to refer to.
Assuming that the outer query continues to reference the Item
table but using the alias CurrentItem
, the inner query could calculate the average stock for that item's supplier: SELECT AVG(stock) FROM Item WHERE Item.supplierId = CurrentItem.supplierId
. The logic of the outer query would stay the same, as shown in the figure. Only the restriction predicate, the subquery, has changed to be dependent on the outer query rather than independent.
EXISTS
A common operator used with correlated subqueries is EXISTS
, which simply checks to see that a result table has at least one row (i.e. the result “set” is not empty). If you want to list all items in the furniture category, could use a query to determine the furniture category ID: SELECT id FROM Category WHERE Category.name = 'furniture'
. You could then use this query as a subquery inside a correlated subquery to only return items that match the same category: SELECT * FROM ItemCategory WHERE ItemCategory.categoryId = (SELECT id FROM Category WHERE Category.name = 'furniture')
. Lastly the outer query would list all items for which there exist an item category association that matches that item and the furniture category ID: SELECT name FROM Item CurrentItem WHERE EXISTS (SELECT * FROM ItemCategory WHERE ItemCategory.itemCode = CurrentItem.code AND ItemCategory.categoryId = (SELECT id FROM Category WHERE Category.name = 'furniture'))
.
Correlated Updates
Correlated queries are indispensable when updating columns with values that depend on other tables. Suppose that you wanted to increase the stock off all the items by the same amount, such as 50. As you've seen this is possible using a simple update query: UPDATE Item SET stock = stock + 50
.
But consider if each supplier were to indicate a maximum stock order, valid for for all the items provided by that supplier. You wish to increase the stock of each item by the maximum the supplier allows items. For each item you would need to know the item's supplier in order to find out the supplier's maximum stock order. You would use a correlated subquery to find the maxStockOrder
for in the Supplier
table, for the supplier of each item you update, as shown in the figure.
Review
Gotchas
- You can provide an alias for a table in the
FROM
clause, but SQL does not provide for usingAS
in this location. You must leave out theAS
keyword from the FROM clause for your query to work across database products.
In the Real World
- Correlated subqueries are very useful for moving data from one column or table to another within one
UPDATE
statement, when updating a database schema for example.
Think About It
- Many subqueries can be written equivalently as joins. Which approach is easier to read? Should one form be more efficient than the other? What are the benefits and downsides of each approach?
Self Evaluation
- In which SQL statement clauses are you likely to see a subquery? Which SQL clause cannot contain a correlated subquery?
- What relational algebra operation can be used to rewrite and simplify a correlated query that uses
EXISTS
?
Task
- Add and implement a method to
PublicationRepository
for retrieving all books with more stock than the average stock for that book's publication year.
See Also
- Learning SQL, Second Edition (Alan Beaulieu - O'Reilly, 2009)
- Subqueries in SQL (YouTube - Dave Sullivan)
- Correlated Sub Queries Quickly (YouTube - Joes2Pros SQL Trainings) This video uses
AS
in theFROM
clause, which works on SQL Server but is not standard SQL and will not work on all database products. Simply leave off the AS keyword in the FROM clause, and the examples will be standard SQL and work on most database products.
References
Acknowledgments
- Joe Celko's SQL for Smarties, Fifth Edition (Joe Celko - Morgan Kaufmann, 2014)
- SQL:1999 (Jim Melton, Alan Simon - Morgan Kaufmann, 2001)
- SQL Antipatterns (Bill Karwin, The Pragmatic Bookshelf, 2010)
- SQL in a Nutshell, Third Edition (Kevin Kline - O'Reilly, 2008)
- Some symbols are from Font Awesome by Dave Gandy.