Joins and Other Query Types
Multi-Table queries with
Up to now, we've been working with a single table, but entity data in the real world is often broken down into pieces and stored across multiple orthogonal tables using a process known as normalization.
Tables that share information about a single entity need to have a primary key that identifies that entity uniquely across the database. One common primary key type is an auto-incrementing integer (because they are space efficient), but it can also be a string, hashed value, so long as it is unique.
JOIN clause in a query, we can combine row data across two separate tables using this unique key. The first of the joins that we will introduce is the
SELECT column, another_table_column, ... FROM mytable INNER JOIN another_table ON mytable.id = another_table.id WHERE condition(s) ORDER BY column, ... ASC/DESC LIMIT num_limit OFFSET num_offset;
INNER JOIN is a process that matches rows from the first table and the second table which have the same key (as defined by the
ON constraint) to create a result row with the combined columns from both tables. After the tables are joined, the other clauses we learned previously are then applied.
You may see queries where the
INNER JOIN is written simply as a
JOIN. These two are equivalent, but we will continue to refer to these joins as inner-joins because they make the query easier to read once you start using other types of joins.
Let's do some exercises where we look at multiple tables. We will use the Movies.csv file from previous examples and a new Boxoffice.csv file with additional movie information.
- Find the domestic and international sales for each movie:
SELECT title, domestic_sales, international_sales FROM movies JOIN boxoffice ON movies.id = boxoffice.movie_id;
- Show the sales numbers for each movie that did better internationally rather than domestically:
SELECT title, domestic_sales, international_sales FROM movies JOIN boxoffice ON movies.id = boxoffice.movie_id WHERE international_sales > domestic_sales;
- List all the movies by their ratings in descending order:
SELECT title, rating FROM movies JOIN boxoffice on movies.id = boxoffice.movie_id ORDER BY rating DESC;
Depending on how you want to analyze the data, the
INNER JOIN might not be sufficient because the resulting table only contains data that belongs in both of the tables.
If the two tables have asymmetric data, which can easily happen when data is entered in different stages, then we would have to use a
RIGHT JOIN, or
FULL JOIN instead to ensure that the data you need is not left out of the results.
SELECT column, another_column, ... FROM mytable INNER/LEFT/RIGHT/FULL JOIN another_table ON mytable.id = another_table.matching_id WHERE condition(s) ORDER BY column, ... ASC/DESC LIMIT num_limit OFFSET num_offset;
INNER JOINthese three new joins have to specify which column to join the data on.
When joining table A to table B, a
LEFT JOIN simply includes rows from A regardless of whether a matching row is found in B. the
RIGHT JOIN is the same, but reversed, keeping rows in B regardless of whether a match is found in A. Finally, a
FULL JOIN simply means that rows from both tables are kept, regardless of whether a matching row exists in the other table.
When using any of these new joins, you will likely have to write additional logic to deal with
NULLs in the result and constraints.
You might see queries with these joins written as
LEFT OUTER JOIN,
RIGHT OUTER JOIN, or
FULL OUTER JOIN, but the
OUTER keyword is really kept for SQL-92 compatibility and these queries are simply equivalent to
RIGHT JOIN, and
FULL JOIN respectively.
For the following exercises we will be using new tables. We will use a table which stores fictional data about Employees in the film studio and their assigned office Buildings. Some of the buildings are new, so they don't have any employees in them yet, but we need to find some information about them regardless.
Find the list of all buildings that have employees:
SELECT DISTINCT building FROM employees;
Find the list of all buildings and their capacity:
SELECT building_name, capacity FROM buildings;
List all buildings and the distinct employee roles in each building (including empty buildings):
SELECT DISTINCT building_name, role FROM buildings LEFT JOIN employees on building_name = building;
A short note on
It's always good to reduce the possibility of
NULL values in databases because they require special attention when constructing queries, constraints (certain functions behave differently with null values) and when processing the results.
An alternative to
NULL values in your database is to have data-type appropriate default values, like 0 for numerical data, empty strings for text data, etc. But if your database needs to store incomplete data, then
NULL values can be appropriate if the default values will skew later analysics (for example, when taking averages of numerical data).
Sometimes, it's also not possible to avoid
NULL values, as we saw in the last lesson when outer-joining two tables with asymmetric data. In these cases, you can test a column for
NULL values in a
WHERE clause by using either the
IS NULL or
IS NOT NULL constraint.
SELECT column, another_column, ... FROM mytable WHERE column IS/IS NOT NULL AND/OR another_condition AND/OR ...;
Find the name and role of all employees who have not been assigned to a building:
SELECT name, role FROM employees WHERE building IS NULL;
Find the names of the buildings that hold no employees:
SELECT DISTINCT building_name FROM buildings LEFT JOIN employees ON building_name = building WHERE role IS NULL;
Queries with Expressions
In addition to querying and referencing raw column data with SQL, you can also use expressions to write more complex logic on column calues in a query. These expressions can use mathematical and string functions along with basic arithmetic to transform values when the query is executed, as shown in this physics example:
SELECT particle_speed / 2.0 AS half_particle_speed FROM physics_data WHERE ABS(particle_position) * 10.0 > 500;
The use of expressions can save time and extra post-processing of the result data, but can also make the query harder to read, so we recommend that when expressions are used in the
SELECT part of the query, that they are also given a descriptive alias using the
SELECT col_expression AS expr_description, ... FROM mytable;
SELECT column AS better_column_name, ... FROM a_long_widgets_table_name AS mywidgets INNER JOIN widget_sales ON mywidgets.id = widgets_sales.widget_id;
BoxOfficedata into something easier to understand for the tasks below. We will use the Movies.csv and Boxoffice.csv files for the following exercise:
List all movies and their combines sales in millions of dollars:
SELECT title, (domestic_sales + international_sales) / 1000000 AS gross_sales_millions FROM movies JOIN boxoffice ON movies.id = boxoffice.movie_id;
List all movies and their ratings in percent:
SELECT title, rating * 10 AS rating_percent FROM movies JOIN boxoffice ON movies.id = boxoffice.movie_id;
List all movies that we released on even number years
SELECT title, year FROM movies WHERE year % 2 = 0;
Odd Vs. Even
Remember that to determine if a SQL value is even or odd, we use the module operator (
%), which returns the remainder after division of its operands. Since the remainder of an even number divided by 2 is always 0, and the remainder of an odd number divided by 2 is always 1 - this makes modulo an easy way to find even/odd numbers.
Queries with Aggregates
In addition to the simple expressions from above, SQL also supports the use of aggregate expressions (or functions) that allow you to summarize information about a group of rows of data. With the Pixar database we have been using, aggregate functions can be used to answer questions like, "How many movies has Pixar produced?", or "What is the highest grossing Pixar film each year?".
SELECT AGG_FUNC(column_or_expression) AS aggregate_description, ... FROM mytable WHERE constraint_expression;
Common Aggregate Functions
Below are some common aggregate functions that we will be using in fututre exercises:
| ||A common function used to count the number of rows in the group is no column name is specified. Otherwise, count the number of rows in the group with non-NULL values in the specified column.|
| ||Finds the smallest numerical value in the specified column for all rows in the group.|
| ||Finds the largest numerical value in the specified column for all rows in the group.|
| ||Finds the average numerical value in the specified column for all rows in the group.|
| ||Finds the sum of all numerical values in the specified column for the rows in the group.|
|Docs:||MySQL, Postgres, SQLite|