SQL Subqueries

Goals

Perform complex query logic using subqueries.

Concepts

correlated subquery
scalar subquery
subquery

Language

SQL

ALL
ANY
EXISTS
FROM
IN
UPDATE

Lesson

One of the most powerful aspects of SQL is that it allows you to use subqueries, which are literally queries that appear within the main query, surrounded by parentheses. Subqueries can themselves contain subqueries, and so on. There are two locations in which you would typically use a subquery: in the WHERE clause, where the subquery serves as data that plays a part in the restriction predicate; and in the FROM clause, where the subquery serves as a source of data, just like a table.

Like queries in general, a subquery technically returns a table. Depending on what the query requested, the result table may contain a single column with a single row, a single column with multiple rows, or a general table with multiple columns and possibly multiple rows. Which type of result is called for determines where the subquery can be placed as an expression in the main query.

Scalar Subqueries

Listing all items with under-average stock.

SELECT name, stock FROM Item
  WHERE stock <
    (SELECT AVG(stock) FROM Item);

`name`	`stock`
table	`20`
game table	`12`
chair	`15`
kite	`5`
doll	10

Finding the average item stock.

SELECT AVG(stock) FROM Item;

`avg`
`26.8333333333333333`

If a subquery returns a table with a single column containing a single row, it is referred to as a scalar subquery. This sort of subquery can be used where a single value would be expected in an expression.

By now you know that you could list the items with stock greater than 10 with the statement SELECT name FROM Item WHERE stock < 20. But if you wish to find the items that have a stock less than the average stock, you would have to somehow know the average stock ahead of time—unless you include a query to determine the average stock as a subquery within the statement itself! As you have seen, you can use SELECT AVG(stock) FROM Item to determine the average stock, so SELECT name FROM Item WHERE stock < (SELECT AVG(stock) FROM Item) would list only items that have less stock than the average. In this example the average is high because there are so many balloons in stock.

Multiple-Value Subqueries

`IN`

A subquery may be structured so as to potentially return multiple rows in a single column. The values in this column can be used in an expression where multiple values would be expected. You've studied the IN keyword, which checks if a given value occurs in a set of values, such as WHERE color IN ('red', 'green', 'blue'). A subquery that returns multiple rows of values in a single column can be used as the set of values, creating a dynamic expression.

Suppose you want to list the suppliers of items with stock under 20. Finding the supplier ID of items with stock less than 20 is not so different than the queries above: SELECT supplierId FROM Item WHERE stock < 20. As this query returns a single column of supplier IDs, you can use the result as a set of values in the predicate for restricting the list of suppliers: SELECT Supplier.name FROM Supplier WHERE Supplier.id IN (SELECT supplierId FROM Item WHERE stock < 20). The columns Supplier.name and Supplier.id are qualified for clarity; simply name and id could have been used.

The subquery SELECT supplierId FROM Item WHERE stock < 20, like many queries in SQL, can return multiple values; here there may be several items with stock under 20 which come from the same supplier. If you wanted to eliminate the duplicated supplier IDs, you could add DISTINCT, as in SELECT DISTINCT supplierId FROM Item WHERE stock < 20. But there is no need to eliminate these duplicates, because WHERE Supplier.id IN (…) merely checks to ensure that the supplier ID matches at least one of the supplier IDs in the subquery—it makes no difference to the end result if some supplier IDs are listed twice.

The same result could be obtained using a join rather than a subquery. SELECT DISTINCT Supplier.name FROM Item JOIN Supplier ON Item.supplierId = Supplier.id WHERE Item.stock < 20 would join the Item and Supplier tables, restricting the result to those items with stock under 20. Here DISTINCT is required; otherwise the join would produce a supplier for each included item, which might result in duplicate suppliers being listed. This example assumes that supplier names are unique, as the Supplier schema from earlier lessons requires. Otherwise the subquery version would be technically more correct, as it eliminates duplicates based on supplier ID, while the join version eliminates duplicates based upon supplier name.

`ANY`

The expression = ANY is equivalent to IN.

The IN operator implies that an equality comparison is being performed with each of the items returned by a subquery. The ANY operator is a more general form that allows any comparison to be compared against items, not just quality. You could for example find all items with stock less than the stock of any item (not necessarily all items) from Acme Furniture Company using SELECT name FROM Item WHERE Item.stock < ANY (SELECT stock FROM Item JOIN Supplier ON Item.supplierId = Supplier.id WHERE Supplier.name = 'Acme Furniture Company').

An equivalent result could be achieved using MAX() to have the subquery return a single value, comparing each item's stock against the maximum stock of Acme Furniture Company using

SELECT name FROM Item WHERE Item.stock < (SELECT MAX(stock) FROM Item JOIN Supplier ON Item.supplierId = Supplier.id WHERE Supplier.name = 'Acme Furniture Company')

. This query might be slightly more efficient because all the item stocks of Acme Furniture are only examined once, providing a single value to compare against all items.

`ALL`

A similar operator, ALL, allows you to create a predicate to ensure that a comparison holds against all the results of a subquery. As an example you could find all items with stock greater than the stock of all items from Acme Furniture Company using SELECT name FROM Item WHERE Item.stock > ALL (SELECT stock FROM Item JOIN Supplier ON Item.supplierId = Supplier.id WHERE Supplier.name = 'Acme Furniture Company').

An equivalent result could be achieved here as well, using MAX() to have the subquery return a single value, comparing each item's stock against the maximum stock of Acme Furniture Company using

SELECT name FROM Item WHERE Item.stock > (SELECT MAX(stock) FROM Item JOIN Supplier ON Item.supplierId = Supplier.id WHERE Supplier.name = 'Acme Furniture Company')

. This query might be slightly more efficient because all the item stocks of Acme Furniture are only examined once, providing a single value to compare against all items.

Multiple-Table Subqueries

TODO

Subqueries as Data Sources

The previous statements have used subqueries as predicates in the WHERE clause, providing information to use when restricting the returned rows. Because subqueries produce tables, those tables can themselves be used as a source of data in their own right. In the FROM clause a subquery acts like a table that existed in the original schema. The main difference is that the subquery table does not have a name. If the columns returned by the subquery are ambiguous, you will need to create an table name for subquery, as you would normally rename columns using the AS keyword. The Grouping section above already included an example of a subquery as a source of data.

Using a subquery to provide names for categories grouped by ID.

SELECT Category.name, CategoryGroup.stockSum
  FROM (
    SELECT categoryId, SUM(stock) AS stockSum
      FROM ItemCategory
      JOIN Item ON ItemCategory.itemCode = Item.code
      GROUP BY categoryId
  ) CategoryGroup
  JOIN Category ON CategoryGroup.categoryId = Category.id;

If you wanted to find the total number of items in stock in each category, you would need to join the Item table with the ItemCategory table (because items can be in multiple categories) and group by the item category ID: SELECT categoryId, SUM(stock) FROM ItemCategory JOIN Item ON ItemCategory.itemCode = Item.code GROUP BY categoryId. Listing the categories by category ID, however, is less than useful to users. You could do a three-table join with the Category table to find the category name, but because you are grouping on categoryId rather than Category.name, SQL will not let you include the Category.name column in the result, as it is neither a grouping column nor an aggregate function.

Instead you can consider the grouping result as a source of data, joining the entire result table with the Category table. To use the result you will need to rename the subquery result table so that you can refer to it in the outer query. You also need to rename the SUM(stock) column so that you can refer to it in the outer query as well. The request is shown in the future on the side.

Using AS is optional but recommended when renaming a column. However you cannot use AS to rename a table in the FROM clause. To give a table an alias to use in your query, just place the alias name after the query, as done with CategoryGroup above. Although some database products allow AS in the FROM clause, do not use this form because it is not standards-compliant and prevents your query from being executed on other products. See How to use the 'as' keyword to alias a table in Oracle?

Correlated Subqueries

The subqueries shown so far have been completely independent of the outer query in which it appears. These sort of subqueries can be performed once for the entire query; the result is then fed into the outer query. SQL also allows a subquery to reference to reference variables in the outer query. In such a correlated subquery the subquery is dependent on (“correlated with”) the outer query.

You already saw that you can list all items with under the average stock using SELECT name FROM Item WHERE stock < (SELECT AVG(stock) FROM Item). Because the subquery SELECT AVG(stock) FROM Item requests the average stock of all items, the database only needs to execute the query once; the same resulting average can be used in the comparison for each row in the outer query.

Listing all items with stock under supplier average.

SELECT name, stock
  FROM Item CurrentItem
  WHERE stock <
    (SELECT AVG(stock)
      FROM Item
      WHERE Item.supplierId = CurrentItem.supplierId);

`name`	`stock`
game table	`12`
kite	`5`
doll	`10`
chair	15

Suppose however that you wanted to list all items each of which was under the average stock of all items from its supplier. The subquery would need to calculate the average stock of that item's supplier, and to do this it would have to reference the item from the outer query to be able to find the supplier. Because many times both the inner query and the outer query reference the same tables, when crafting a correlated subquery you will normally need to use an alias one one or more tables in the outer query for the inner table to refer to.

Assuming that the outer query continues to reference the Item table but using the alias CurrentItem, the inner query could calculate the average stock for that item's supplier: SELECT AVG(stock) FROM Item WHERE Item.supplierId = CurrentItem.supplierId. The logic of the outer query would stay the same, as shown in the figure. Only the restriction predicate, the subquery, has changed to be dependent on the outer query rather than independent.

Subqueries in the FROM clause cannot be correlated, because they are executed a single time to serve as an source of data, unchanging for the rest of the query.

`EXISTS`

A common operator used with correlated subqueries is EXISTS, which simply checks to see that a result table has at least one row (i.e. the result “set” is not empty). If you want to list all items in the furniture category, could use a query to determine the furniture category ID: SELECT id FROM Category WHERE Category.name = 'furniture'. You could then use this query as a subquery inside a correlated subquery to only return items that match the same category: SELECT * FROM ItemCategory WHERE ItemCategory.categoryId = (SELECT id FROM Category WHERE Category.name = 'furniture'). Lastly the outer query would list all items for which there exist an item category association that matches that item and the furniture category ID: SELECT name FROM Item CurrentItem WHERE EXISTS (SELECT * FROM ItemCategory WHERE ItemCategory.itemCode = CurrentItem.code AND ItemCategory.categoryId = (SELECT id FROM Category WHERE Category.name = 'furniture')).

The EXISTS operator is one of the few circumstances in which you would want to use SELECT * in a production query. The database should be smart enough to realize that you are simply wanting to check for the existence of a row, and don't actually care which columns are returned.

In most cases a query using the EXISTS operator can be written to use an INNER JOIN—which may even be more efficient, if the database is not clever enough to realize convert one to the other automatically. You might already have realized that asking for all items the furniture category can be written in a much simpler fashion:

SELECT name FROM Item INNER JOIN ItemCategory ON ItemCategory.itemCode = Item.code WHERE ItemCategory.categoryId = (SELECT id FROM Category WHERE Category.name = 'furniture')

Correlated Updates

Ordering the maximum stock an item's supplier allows.

UPDATE Item
  SET stock = stock +
    (SELECT maxStockOrder
      FROM Supplier
      WHERE Item.supplierId = Supplier.id);

Correlated queries are indispensable when updating columns with values that depend on other tables. Suppose that you wanted to increase the stock off all the items by the same amount, such as 50. As you've seen this is possible using a simple update query: UPDATE Item SET stock = stock + 50.

But consider if each supplier were to indicate a maximum stock order, valid for for all the items provided by that supplier. You wish to increase the stock of each item by the maximum the supplier allows items. For each item you would need to know the item's supplier in order to find out the supplier's maximum stock order. You would use a correlated subquery to find the maxStockOrder for in the Supplier table, for the supplier of each item you update, as shown in the figure.

Review

Gotchas

You can provide an alias for a table in the FROM clause, but SQL does not provide for using AS in this location. You must leave out the AS keyword from the FROM clause for your query to work across database products.

In the Real World

Correlated subqueries are very useful for moving data from one column or table to another within one UPDATE statement, when updating a database schema for example.

Think About It

Many subqueries can be written equivalently as joins. Which approach is easier to read? Should one form be more efficient than the other? What are the benefits and downsides of each approach?

Self Evaluation

In which SQL statement clauses are you likely to see a subquery? Which SQL clause cannot contain a correlated subquery?
What relational algebra operation can be used to rewrite and simplify a correlated query that uses EXISTS?

Task

Add and implement a method to PublicationRepository for retrieving all books with more stock than the average stock for that book's publication year.

References

Acknowledgments

Joe Celko's SQL for Smarties, Fifth Edition (Joe Celko - Morgan Kaufmann, 2014)
SQL:1999 (Jim Melton, Alan Simon - Morgan Kaufmann, 2001)
SQL Antipatterns (Bill Karwin, The Pragmatic Bookshelf, 2010)
SQL in a Nutshell, Third Edition (Kevin Kline - O'Reilly, 2008)
Some symbols are from Font Awesome by Dave Gandy.

SQL Subqueries

Goals

Concepts

Language

SQL

Lesson

Scalar Subqueries

Multiple-Value Subqueries

IN

ANY

ALL

Multiple-Table Subqueries

Subqueries as Data Sources

Correlated Subqueries

EXISTS

Correlated Updates

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

See Also

References

Acknowledgments

`IN`

`ANY`

`ALL`

`EXISTS`