SQL (Structured Query Language) is a language designed to allow both technical and non-technical users query, manipulate, and transform data from a relational database. And due to its simplicity, SQL databases provide safe and scalable storage for millions of websites and mobile applications.
Often times, in order for us to build the most function website we can, we depend on a database to store information. Databases often are set up as tables with information in columns and rows (e.g. Excel Spreadsheets & Google Sheets).
For the following examples we will be using sqlite3, which is a SQL command line program, on the CS50 IDE. We will also be using the following .csv file which contains a database with data about some of Pixar's classic movies. Let's import this into the IDE. First drag and drop the file into the left side of the IDE, preferably a folder. Next, let's run the following commands:
.import "Movies.csv" movies
SELECT Queries 101
To retrieve data from a SQL database, we need to write
SELECT statements, which are often colloquially refered to as queries. A query itself is just a statement which declares what data we are looking for, where to find it in the database, and optionally, how to transform it before it is returned.
We can think of a table in SQL as a type of entity (i.e. dogs) and each row in that table as a specific instance of that type (i.e. pug, beagle, border collie, etc.). This means that the columns would then represent the common properties shared by all instances of that entity (i.e. color of fur, length of tail, etc.).
And given a table of data, the most basic query we could write would be one that selects for a couple columns (properties) of the table with all the rows (instances),
SELECT column, another_column, ... FROM mytable;
If we want to retrieve all the columns of data from a table, we can then use the asterisk (
*) shorthand in place of listing all the column names individually.
SELECT * from mytable;
Now, lets begin using the .csv file from above for some exercises:
- Find the
titleof each film:
SELECT title FROM movies;
- Find the
directorof each film:
SELECT director FROM movies;
- Find the
directorof each film:
SELECT title, director FROM movies;
- Find the
yearof each film:
SELECT title, year FROM movies;
- Find all information about each film:
SELECT * FROM movies;
Queries with Constraints
Now we know how to select for specific columns of data from a table, but if you had a table with a hundred million rows of data, reading through all the rows would be inefficient and perhaps even impossible.
In order to filter certain results from being returned, we need to use a
WHERE clause in the query. The clause is applied to each row of data by checking specific column values to determine whether it should be included in the results or not.
SELECT column, another_column, ... FROM mytable WHERE condition AND/OR another_condition AND/OR ...;
ORlogical keywords (i.e. num_wheels >= 4 AND doors <= 2). Below are some useful operators that you can use for numerical data (i.e. integer or floating point):
|=, !=, <, <=, >, >=||Standard numerical operators||col_name != 4|
|BETWEEN ... AND ...||Number is within a range of two values (inclusive)||col_name BETWEEN 1.5 AND 10.5|
|NOT BETWEEN ... AND ...||Number is not within range of two values (inclusive)||col_name NOT BETWEEN 1 AND 10|
|IN (...)||Number exists in a list||col_name IN (2, 4, 6)|
|NOT IN (...)||Number does not exist in a list||col_name NOT IN (1, 3, 5)|
In addition to making the results more manageable to understand, writing clauses to constrain the set of rows returned also allows the query to run faster due to the reduction in unnecessary data being returned.
As you may have noticed by now, SQL doesn't require you to write the keywords all capitalized, but as convention, it helps people distinguish SQL keywords from column and table names, and makes the query code easier to read.
Now let's do some exercises using the same .csv from above:
- Find the movie with a row
SELECT id, title FROM movies WHERE id = 6;
- Find the movies released in the
years between 2000 and 2010:
SELECT title, year FROM movies WHERE year BETWEEN 2000 AND 2010;
- Find the movies not released in the
years between 2000 and 2010:
SELECT title, year FROM movies WHERE year NOT BETWEEN 2000 AND 2010;
- Find the first 5 Pizar movies and their release
SELECT title, year FROM movies WHERE year <= 2003;
WHERE clauses with columns containing text data, SQL supports a number of useful operators to do things like case-insensitive string comparison and wildcard pattern matching. Below are a few common text-data specific operators:
|=||Case sensitive exact string comparison (notice the single equals)||col_name = "abc"|
|!= or <>||Case sensitive exact string inequality comparison||col_name != "abcd"|
|LIKE||Case insensitive exact string comparison||col_name LIKE "ABC"|
|NOT LIKE||Case insensitive exact string inequality comparison||col_name NOT LIKE "ABCD"|
|%||Used anywhere in a string to match a sequence of zero or more characters (only with LIKE or NOT LIKE)||col_name LIKE "%AT%" (matches "AT", "ATTIC", "CAT" or even "BATS")|
|_||Used anywhere in a string to match a single character (only with LIKE or NOT LIKE)||col_name LIKE "AN_" (matches "AND" but not "AN")|
|IN (...)||String exists in a list||col_name IN ("A", "B", "C")|
|NOT IN (...)||String does not exist in a list||col_name NOT IN ("D", "E", "F")|
All strings must be quoted so that the query parser can distinguish words in the string from SQL keywords.
We should not that while most database implementations are quite efficient when using these operators, full-text search is best left to dedicated libraries like Apache Lucene or Sphinx. These libraries are designed specifically to do full text search, and as a result are more efficient and can support a wider variety of search features including internationalization and advanced queries.
Now let's test these operators using the same .csv from above:
- Find all the Toy Story movies:
SELECT title, director FROM movies WHERE title LIKE "Toy Story%";
- Find all the movies directed by John Lasseter:
SELECT title, director FROM movies WHERE director LIKE "John Lasseter";
- Find all the movies (and director) not directed by John Lasseter:
SELECT title, director FROM movies WHERE director NOT LIKE "John Lasseter";
- Find all the WALL-* movies:
SELECT title FROM movies WHERE title LIKE "WALL-_";
Filtering and Sorting Query Results
Even though the data in a database may be unique, the results of any particular query may not be - take our Movies table for example, many different movies can be released the same year. In such cases, SQL provides a convenient way to discard rows that have a duplicate column value by using the
SELECT DISTINCT column, another_column, ... FROM mytable WHERE condition(s);
DISTINCTkeyword will blindly remove duplicate values, we will learn in a future lesson how to discard duplicated based on specific columns using grouping and the
Unlike our neatly ordered table in the last few lessons, most data in real databases are added in no particular column order. As a result, it can be difficult to read through and understand the results of a query as the size of a table increases to thousands or even millions of rows.
To help with this, SQL provides a way to sort your results by a given column in ascending or descending order using the
ORDER BY clause:
SELECT column, another_column, ... FROM mytable WHERE condition(s) ORDER BY column ASC/DESC;
ORDER BYclause is specified, each row is sorted alpha-numerically based on the specified column's value. In some databases, you can also specify a collation to better sort data containing international text.
Limiting Results to a Subset
Another clause which is commonly used with the
ORDER BY clause are the
OFFSET clauses, which are a useful optimization to indicate to the database the subset of the results you care about.
LIMIT will reduce the number of rows to return, and the optional
OFFSET will specify where to begin counting the number rows from:
SELECT column, another_column, ... FROM mytable WHERE condition(s) ORDER BY column ASC/DESC LIMIT num_limit OFFSET num_offset;
Order of Execution
OFFSET are applied relative to the other parts of a query, generally done last after other clauses have been aplied.
Now let's do some exercises using the previous .csv used above:
- List all directors of Pixar movies (alphabetically), without duplicates:
SELECT DISTINCT director FROM movies ORDER BY director ASC;
- List the last four Pixar movies released (ordered from most recent to last):
SELECT title FROM movies ORDER BY year DESC LIMIT 4;
- List the first 5 Pixar movies sorted alphabetically:
SELECT title FROM movies ORDER BY title ASC LIMIT 5;
- List the next 5 Pixar movies sorted alphabetically:
SELECT title FROM movies ORDER BY title ASC LIMIT 5 OFFSET 5;
Let's take a new .csv file containing a few of the most populous cities in North America and do some review exercises from what has been learned above:
- List all the Canadian cities and their populations:
SELECT city, population FROM north_american_cities WHERE country = "Canada";
- Order all the cities in the United States by their latitude from north to south:
SELECT city, latitude FROM north_american_cities WHERE country = "United States" ORDER BY latitude DESC;
- List all the cities west of Chicago, ordered from west to east:
SELECT city, longitude FROM north_american_cities WHERE longitude < -87.629798 ORDER BY longitude ASC;
- List the two largest cities in Mexico (by population):
SELECT city, population FROM north_american_cities WHERE country = "Mexico" ORDER BY population DESC LIMIT 2;
- List the third and fourth largest cities (by population) in the United States and their population:
SELECT city, population FROM north_american_cities WHERE country = "United States" ORDER BY population DESC LIMIT 2 OFFSET 2;