Joseph Clark

Approaching a dataset with visualization

2017-11-02T20:14:22-07:00

I’d like to give my students some simple guidelines for how to use data visualization to look at a new dataset. What to do first, second, and so on. Here’s what I’m going to suggest.

Examine individual variables #

First, take one variable at a time. Which are the most important ones, considering the audience and the purpose of your work? What are the mean, median, and mode? Accordingly, your first visualizations may be histograms or box-and-whisker plots, maybe Pareto diagrams. These go beyond the statistics by showing us the overall “shape” of the distributions, revealing things like Normal distributions, skewness, and fat or thin tails.

Compare subsets of the data on single variables #

Once you have a sense of how the data is distributed overall, you can begin slicing and dicing it by some categorical dimension(s). This can be as simple as a bar chart comparing a single statistic across categories, or it can be a small-multiples diagram that repeats a histogram once for each subset of the data. Comparison is the name of the game here. How do men differ from women, or how does Canada differ from the UK, or how does 2016 differ from 2017, on your key variable?

Changes over time #

Line graphs, particularly with time as the X axis, are very meaningful to us perhaps because storytelling is a natural way that we humans think. Line graphs are easy to generate in software like Excel or Tableau, and they transform what was a single point metric into a story of rise and fall and seasonality. These are particularly enriching visualizations if your data has a time dimension.

Relationships between continuous variables #

As we begin to get a handle on the data one variable at a time, we may begin to form theories to explain or predict it, theories that can be represented as relationships between variables. Is television viewing related to socioeconomic status? Are Red Sox wins correlated with the weather on game day? Does student satisfaction depend on class size? The classic data visualization to compare two such variables is the XY scatterplot, with each data point plotted on both dimensions. Lines or curves that indicate significant relationships can be seen, if they exist, and so can outliers.

Incidentally, the list above could also serve as the outline to a data visualization term paper. I’m just sayin’…

Five reasons to try API-first development

2017-09-20T19:29:52-07:00

I confess that I’m a data geek. When I first discovered relational databases in this tutorial by Jay Greenspan over 15 years ago, I began to see the world in third normal form. When I learned about RESTful web services from Michael Mahemoff’s Ajax Design Patterns, I started to think about applications in terms of requests and responses. I love discovering well-crafted JSON web services and showing them to my students, and so the idea of creating a public API to let other people obtain data from one of my databases or applications is not at all alien to me.

But it never occurred to me to build the API first.

I got the idea while developing a data engineering class. I wanted to emphasize the point to my students that analytics is about building data products for others to use and that, before we got into the nitty-gritty of how to use various types of databases, they should be able to think about what sort of operations on data their users needed to make, what sort of data inputs and outputs the systems ought to have. As a hands-on lab, I had them develop mock APIs using Python and Flask. Essentially, they were using a web development framework but serving up JSON instead of web pages. Offhand, I told them that if they were working with front-end developers, the latter could go ahead and build web or mobile apps to read and write data via the API —even though the APIs were entirely fake at this point, just serving up sample data. (We’ll connect them to real databases in a couple of weeks.)

Then I realized — I’m an app developer, and this would work for me, too!

API-first development #

The core idea of API-first development is that rather than building an API for others to access your data, as an afterthought to building a system, you build your application from the beginning as an API that describes all of your application’s operations on data. Then you begin to build the front-end while implementing real back-end functions in place of an initial mock-up . Turns out I’m not the first to think of this, see here and here, for example.

Specifically, if we’re talking about a RESTful web service API, this means setting up a bunch of URL endpoints and HTTP methods, one for each use case. Mock these up by putting fake data at each URL endpoint. Code like this, for example, would serve to fake an API endpoint in Python’s Flask framework and could be deployed right to Heroku:

import os
from flask import Flask, jsonify
app = Flask(__name__)

# listing all items
@app.route("/api/items", methods=['GET'])
def list_items():
    fake_data = { "items": [ {"name":"Widget","id":42,"price":49.95},
                             {"name":"Doodad","id":43,"price":19.95},
                             {"name":"Gizmo","id":44,"price":99.95} ]
                }
    return( jsonify(fake_data) )

if __name__ == "__main__":
    port = int(os.environ.get("PORT", 5000))
    app.run(host='0.0.0.0', port=port)

Once you had this running, you could then split into two groups. The front-end developer(s) working on the web (mobile) applications could build their pages (screens) to display the data they consume from the API. They shouldn’t care that the data is fake, as long as there’s enough of it to effectively test their designs. The back-end developer(s) could then take the API one endpoint at a time and replace the temporary code with real business logic that queries the real database.

Five Reasons #

I have begun to encourage the students in my information systems capstone class, who are all engaged in developing mobile and web apps, to adopt API-first development, and I am trying it out in my own startup as well. Here are five benefits I expect to get from it:

#1: It focuses you on use cases and data processing #

Notice the inversion of the usual way we develop data-driven apps. Since I started designing web sites in 1995, it has always been about the front-end first. We think of our architecture in terms of “pages” or “screens”. Data and logic are added on later with PHP, Javascript, or other types of code inserted somewhere, literally subordinate to the <HTML> tag of a page. As a result, we cannot implement or test the business logic until the pages are mostly complete. Moreover, it distorts our thinking. Pages and screens don’t correspond exactly to use cases or to data entities, so front-end-first design may prevent us from figuring out what our true requirements are until well into a project.

If we design the API first, however, we are forced to think of our application instead as a system that performs operations on data. We have to consider the entities and relationships in our data model, enumerate the use cases, and document the inputs and outputs to each type of operation. Developers should get a clear picture of how many pieces of code they will need to implement, and they get it at an early stage.

#2 It becomes effortless documentation #

When you design the API first, the plan of implementation itself describes URL endpoints, HTTP methods, and specific data structures for inputs and outputs, which can later be recycled as an instruction manual or reference for future development. Much like Tom Preston-Warner’s README-driven development, “you’re giving yourself a chance to think through the project without the overhead of having to change code every time you change your mind about how something should be organized”.

Drafting the API documentation right out of the gate gives you an artifact that other developers can comment on and coordinate their own work with. Front-end developers get sample JSON code to work with, and database developers know what they’re trying to implement. Writing the instruction manual first also helps to prevent “feature creep” in which developers add “nice to have” functionality to a product that didn’t really need to be there.

#3 It guides database modeling #

A well-designed API is expressed as a set of URL endpoints that locate “resources” in the application’s domain. If they follow good REST principles, these URLs happen to look a lot like a folder structure. For example, a few endpoints for a Twitter-like application might include:

/api/users/<user_id>
/api/users/<user_id>/tweets
/api/users/<user_id>/followers
/api/tweets/<tweet_id>
/api/tweets/<tweet_id>/author

When you think about it, you realize that this “folder structure” could never exist on a real disk: it has tweets stored under the users, and users stored under their tweets. Even more oddly, users seem to be stored under the other users that they’re following. That’s not hard to implement, though, because modern web frameworks allow us to easily define these kinds of URL structures that don’t correspond to the actual locations of data on disk.

What’s happening here is actually a sort of entity-relationship modeling! The simulated URL structure shows that each user can publish multiple tweets, but that each tweet has only one author— a good old-fashioned one-to-many relationship. And each user can be a follower of other users, and followed by any number of users: many-to-many.

The exercise of dreaming up URL endpoints for all resources in the application can serve as valuable precursor to database design, as these can be intuitively translated into an implied entity-relationship diagram. In particular, I think it would be a very valuable exercise for those using NoSQL back ends, since NoSQL doesn’t otherwise force developers to think this way about their data models.

#4 It allows you to pivot. #

One of the key reasons I’m using API-first development in my startup is the Lean Startup notion of the “pivot”. We’ve got several hypotheses about our value proposition, and I anticipate evolving the product to try out several of them. We might even run several versions at once under different domain names and front-end designs. But I don’t want to re-write all of the SQL and business logic each time, especially not for very similar sites.

By developing the API for the back-end first, I can plug and play different kinds of front ends, even running them at the same time. Multiple businesses could in fact be built on the same set of data services I’m developing, and it’s not inconceivable that we could sell one of them and keep developing others, or rent out the back end to other startups.

My students, too, often switch technology in the middle of CIS 440. It’s a very common phenomenon that they set out to make a native mobile app and end up settling on a mobile-responsive web application instead, or set out to learn a new language (e.g. Java) and do their project with it, only to find it too challenging and switch later to a tool they know, such as PHP. With an API designed first, they could switch out either the back-end, the front-end, or both, without derailing their projects.

#5 It’s more compatible with Agile #

Agile approaches vary, but they all have one thing in common: strict prioritization of requirements. Feature A needs to be done before Feature B and Feature B needs to be done before Feature C. But, what if the homepage needs twenty different buttons or other controls? It’s not always easy to figure out which one to do first.

Without the distraction of a front-end, developers and product owners will be working with a list of CRUD operations on data entities (resources). These must by necessity be atomic, and if listed in any kind of order, they amount to a strict prioritization of the API development backlog. Developers can begin coding them without having to adjust them due to later front-end changes, and they have no excuse to stand around waiting in between releases of the front end.

Researchers have shown that when businesses operate in chaotic environments, they succeed by embracing constrained flexibility, that is, allowing experimentation and adaptation but controlling it with simple rules. (cf. Brown & Eisenhardt, 1997). Light structure outperforms heavy structure, and also outperforms unstructured, ad hoc decision making. API-first design is a model for applying light structure to application development: the API functions as a “contract” between developers on the team, and their eventual users, but allows infinite flexibility in how to deliver on what is agreed.

Querying the data

2017-09-20T18:01:39-07:00

This is the third (draft) chapter of a new book on relational databases (using Postgres) that I’m working on as a side project. Stay tuned for additional chapters. The book under development can also be viewed at Leanpub, which supports commenting, and also will allow me to bundle the book with video lectures. I appreciate your feedback!

Sorry the tables and LaTeX equations don’t work in Svbtle… check out the e-book to see how they’re supposed to look.

A declarative query language #

You have already seen some of the Structured Query Language (SQL) which is used to express queries in Postgres (and every other relational database that I know of) and you’re going to see a lot more in this book’s chapters. You have “programmed” several queries but here’s one thing you may not know: SQL is not a programming language. A computer program written in a language like Python, Java, or C++ is imperative—it gives a computer a sequence of instructions to carry out until it finished. SQL, by contrast, is a declarative language. In SQL queries, you describe the result that you want, not how the computer should obtain it. That turned out to be a genius move by the creators of the first relational databases.

Inside a DBMS like Postgres is a special function called the query optimizer which processes a SQL query and generates an execution plan for how best to obtain the desired result. In a complex query that incorporates multiple tables, there may be several steps in the plan, some slow and some quicker. These operations may include full table scans (reading an entire table from disk; slow), index scans (much quicker), and different types of join operations (see Table 3-1). Doing them in a certain order may be faster than doing them in another order, and this can make a big difference in a database with millions or billions of rows. Because SQL is declarative, the query optimizer has the freedom to choose the most efficient sequence.

{title=“Table 3-1: Sample of primitive operations in query execution”}
| Operation | Meaning |
|——————–|
| Full table scan | Read every row in the table and find the one(s) specified by the query. |
|——————–|
| Index scan (aka “seek”) | Search an index to quickly find locations of the rows specified by the query. A database index is conceptually like the index in the back of a book; it makes finding the right “page” much quicker, more so when the book is longer. |
|——————–|
| Table access | Go directly to the location of the specified row(s) and read the data. |
|——————–|
| Hash join | A two-phase algorithm to quickly join two tables based on an equality condition. |
|——————–|
| Nested loop join | A slower join algorithm that accommodates inequality conditions and other unusual joins. |

Here is a key point that I’ll come back to repeatedly: you should take advantage of the work that the database developers have already done. Yes, you could write your own execution plan, or your own program for processing the data, but it would take a lot of time and you might not get it right. Database engines are designed by some of the smartest computer scientists in the world and honed by practical experience for years, and they have very likely anticipated queries like yours. Give the database the freedom to optimize, and it will generally do an excellent job.

ASIDE: My philosophy is at odds with received wisdom. When I was first learning databases, I was often told that you should keep your business logic (i.e., your program) out of the database. The SQL dialects of popular databases (Oracle, SQL Server, etc.) were very similar in their DML (data manipulation language) commands like SELECT and INSERT, but had very different implementations of views, triggers, stored procedures, and other programming features (all of which will be further discussed in Chapter 7). The theory therefore was that if you used the latter, you’d be “locked in” to one database platform, so in order to keep your options open you should only use the database for the simplest DML operations. These are known as the CRUD operations: create, read, update, and delete.

I disagree. First off, I don’t think lock-in is a very worrisome problem, especially when your chosen database has a long track record of success and is not too expensive. Second, the rewards of committing to PostgreSQL or any quality database platform are well worth the small risk that it might be more expensive to switch to another platform at some hypothetical point in the future. These days, databases typically support not just one application, but several—perhaps a web page, two mobile apps, and a public API. In that case, do you really want to write the code for simple tasks (such as authenticating a user’s login) over and over again in each of the programming languages used? Implementing parts of your program’s logic in the database reduces redundancy, may improve security, and allows you to take advantage of features like the query optimizer to make your program more efficient.

Relational operations #

Relations (remember, this is the mathematical term for what we’re calling “tables”) are sets of tuples, as discussed in Chapter 2. There are a number of mathematical operations that can be performed on them, with the interesting property of closure: the result of each relational operation is itself a relation. The clauses of a SQL query can be interpreted as a specification of relational operations to be performed on the specified tables. Interestingly, just as you might simplify a complicated equation in high school algebra before solving it, the query optimizer might use relational algebra to build its execution plan—choosing which operations to perform first in order to reduce the amount of computation it will have to do to finish the job.

The key relational operations identified by E. F. Codd and derived from set theory are the projection, selection, and Cartesian product operations, but to this database developers have added several more very useful operations, particularly extended projection, aggregation, grouping, and sorting. See Table 3-2.

{title=“Table 3-2: Important relational operations in SQL queries”, width=“wide”}
| Operation | SQL clause | Symbol | Meaning |
|=========|========|====|=======|
| Projection | SELECT | {$$}\sigma{/$$} | Return only the specified columns |
|—–|
| Selection | WHERE | {$$}\Pi{/$$} | Return only the rows that match specified criteria |
|—–|
| Cartesian product | CROSS JOIN | {$$}\times{/$$} | Return every combination of a row from table 1 with a row from table 2 |
|—–|
| Natural join | NATURAL JOIN | {$$}\Join{/$$} | Return all combinations of rows in specified tables that are equal on their common column |
|—–|
| Extended projection | SELECT | {$$}\sigma{/$$} | Generate new columns in the resulting table, such as the results of calculations or logical tests |
|—–|
| Aliasing | optional AS | {$$}\rho{/$$} | Assign a (new) name to a column in the resulting table |
|—–|
| Aggregation | SUM, COUNT, AVG, etc. | {$$}G_{f(x)}{/$$} | Replace original rows with a single row containing the computed result |
|—–|
| Grouping | GROUP BY | {$$}_xG{/$$} | In combination with aggregation, split the original data into subsets to yield subtotals, subaverages, or whatever |
|—–|
| Sorting | ORDER BY | n/a | Re-arrange the rows in a specific order |

Basic relational operations from set theory #

Projection is the operation of reducing a table to a subset of its columns, and in SQL it is expressed as a list of columns following the SELECT keyword, for example:

SELECT name, age
FROM players;

Selection is the operation of reducing a table to a subset of its rows, and in SQL it is expressed as a logical test (for equality or inequality) follwing the WHERE keyword. Multiple conditions may be combined into one with the AND and OR keywords if needed. For example:

SELECT *
FROM players
WHERE team='Patriots' AND position='QB';

These are certainly the most common operations, and most queries will employ both. Consider the query

SELECT name, age
FROM players
WHERE team='Patriots' AND position='QB';

This query may be expressed in relational algebra as {$$}\Pi_{name,age}(\sigma_{team=Patriots \land position=QB}(players)){/$$}. This formulation implies that the selection operation should be computed first, and then the projection operation. But because it is an algebra, and because the outcome of every operation is another relation, we could just as easily flip it around, i.e.: {$$}\sigma_{team=Patriots \land position=QB}(\Pi_{name,age}(players)){/$$}. This kind of flexibility gives the query opimizer room to make choices that speed up the query.

The third of the “original” relational operations is the Cartesian product operation which joins every row of one table with every row of a second table. The Cartesian product is expressed in PostgreSQL as CROSS JOIN and one way it sometimes comes in handy is to generate a cross-tabulation of the rows of two tables. For example, if you want a report to yield some statistics about every football team in every year (perhaps to build a line graph?), the core of the query might be:

SELECT * FROM teams CROSS JOIN seasons;

Or in relational algebra, {$$}teams\times seasons{/$$}. The more common type of join, as discussed in Chapter 2, is a natural join, where each row of one table is joined with only the rows of the other table that have matching values of a specific column (i.e., a foreign key - primary key relationship). In Postgres there is actually a NATURAL JOIN keyword that works when the columns literally have the same name. If they have different names (for example, if a “players” table has a FK called “team_id” but in the “teams” table it’s simply called “id”), you can use either a JOIN clause or a WHERE condition to effect the join. These are three ways you might perform a natural join on two tables in SQL:

SELECT * FROM players NATURAL JOIN teams;
SELECT * FROM players JOIN teams ON player.team_id=team.id;
SELECT * FROM players, teams WHERE player.team_id=team.id;

In relational algebra, the natural join is expressed as {$$}players\Join _{team_id=id} teams{/$$}; the subscript expressing the join condition can be omitted if the FK-PK relationship is obvious. You could perform a natural join by first taking the Cartesian product and then selecting the rows where the FK matches the PK, à la {$$}\sigma _{team_id=id} (players \times teams){/$$}, and in theory this is what the database engine is doing. In practice, the query optimizer will use an algorithm like a hash join to perform an equality join much more quickly.

Inequality joins are also possible. If you want to join each player with teams he is not on, in order to perform some kind of comparison, you might do the following:

SELECT * FROM players JOIN teams ON players.team_id != teams.id;

In relational algebra notation this is {$$}players\Join _{team_id \neq id} teams{/$$}. Such a join is generally going to be quite expensive in computational terms because the database engine must perform a nested loop: for each row of the “players” table it must loop through the entire “teams” table to find relevant rows.

Extensions to the relational toolkit #

Although relational modeling and relational algebra originate in set theory, database developers and users have made numerous pragmatic extensions to the original theory-derived set of methods we can apply. After all, a database isn’t an academic exercise, but a practical business tool.

The idea behind extended projection is that a query can give us not only a selection of columns from the original table(s), but can also produce new columns as a result of calculations or logical tests. For example:

SELECT running_yards + passing_yards FROM game_results;

A closely associated idea is that of aliasing (also known as the “rename” operation), an operation that changes the name of a column or assigns a name to a column that doesn’t have one (such as the calculated column above). In Postgres, you can use the optional AS keyword, or simply provide an alias after specifying the column:

SELECT running_yards + passing yards AS total_yards FROM game_results;
SELECT first_name || ' ' || last_name AS full_name FROM players;
SELECT age > 35 oldguy FROM players;

The last query above is an example that contains a logical test, “age > 35”, and the result will be a column called “oldguy” that contains Boolean values: “true” and “false”. The AS keyword is omitted; it is optional, but may make your queries easier to read.

Another powerful extension to relational algebra is aggregation; this allows us to generate new rows that do not come from the original tables, but instead are the result of calculations over some or all of the original rows. The aggregations you will use most frequently are SUM and COUNT, but a number of other aggregate functions in Postgres are available, particularly for statistics such as MIN, MAX, AVG and so on.

Aggregation works together with the operation of grouping, which identifies the set(s) of rows to be aggregated together. If no grouping condition is set by a GROUP BY clause in the SQL, all rows are aggregated into one result. To count the number of teams, for example, we could simply do:

SELECT COUNT(*) FROM teams;

If we want to compute aggregates for subsets of the data, we use GROUP BY to generate a group for each distinct value of a particular column, for example:

SELECT team, AVG(salary) FROM players GROUP BY team;

In relational algebra notation, the grouping and aggregation operations are denoted by a capital “G”; the grouping column in a preceding subscript and the aggregation function in the following subscript. The above example would be expressed {$$}_{team} G _{AVG(salary)} (players){/$$}.

A final important operation is sorting. Although in theory the order of rows in a relation is meaningless (it’s a set), in practice we want to sort the rows into some kind of meaningful order. There is no standard notation for this operation in relational algebra, but in SQL it’s expressed in the ORDER BY clause of a query:

SELECT * FROM players ORDER BY last_name, first_name;

The relational operations listed here are the key components from which you’ll build most of your queries, and they are common to virtually all relational databases. In Chapter 5, we’ll introduce you to some additional relational patterns that are useful in special cases, and in Chapter 7 we’ll explore some of the special features of PostgreSQL that distinguish it from other relational databases.

Queries within queries #

SQL queries can be much more complex than you have seen so far, when the answer to one query depends on the answer to others. In such cases, a complete query called a subquery can be nested within another. Subqueries are most often found in the WHERE, FROM, and SELECT clauses. For example, this query names all players who earn a salary greater than the average:

SELECT last_name, first_name, salary FROM players
WHERE salary >
  (SELECT AVG(salary) FROM players);

The inner query, “SELECT AVG(salary) FROM players” is evaluated first and yields a single numerical result. Then the outer query is evaluated to finish processing the query. Just like in a mathematical expression, parentheses set the subquery apart from the clauses of the surrounding query. (I also indented the subquery, but this is not necessary; PostgreSQL like most databases is indifferent to whitespace like spaces, tabs, and line breaks.)

Subqueries in the FROM clause, when evaluated, produce result sets that act like tables in the outer query. To identify all the quarterbacks in the AFC East, we could join the “players” table with the result of a subquery that selects all teams in that division:

SELECT last_name
FROM players JOIN 
  (SELECT * FROM teams WHERE conference='AFC' AND division='East') AS afceast_teams
  ON players.team=afceast_teams.team_id
WHERE position='QB';

In this example you saw that the AS keyword can be used to assign a name to a table, just as previously we saw it used to assign a name to a column. When using a subquery as a table in the FROM clause, it must be given a name. The AS keyword is optional, though I think it makes the query easier to read.

Another place you will often see a subquery is the SELECT clause. Such a query will almost always yield a single numeric result, because in the SELECT clause it generates a value for a single column of query output. Here is a simple example that lists the total salary budget for each team. (Note that this could also be accomplished using a join and a GROUP BY… there are more ways than one to solve many SQL problems.)

SELECT city, team_name,
  (SELECT SUM(salary) FROM players WHERE players.team=teams.team_id) AS total_payroll
FROM teams;

This is a case of a correlated subquery, meaning that this subquery depends on some value from the outer query (namely teams.team_id). Therefore, the subquery will be executed numerous times, once for each row of the “teams” table. The queries in the previous two examples are noncorrelated (or uncorrelated) subqueries which need only be executed once. Correlated subqueries are potentially much slower, so it could be preferable to solve this problem with a join instead of a subquery, but that’s not possible in every case. My philosophy is to trust the query optimizer (and its developers) to find the fastest way to execute the query, and not over-think the SQL: if it ain’t broke, don’t fix it.

Finally, subqueries can be nested within subqueries. Indeed, there may be several levels of such nesting, making for some pretty complicated queries. As you solved complicated equations in high school by first “simplifying the expression”, the query optimizer may have many ways to simplify and sequence things behind the scenes so that the query’s result can be obtained as efficiently as possible.

The relational model

2016-12-17T11:52:03-08:00

This is the second (draft) chapter of a new book on relational databases (using Postgres) that I’m working on as a side project. Stay tuned for additional chapters. The book under development can also be viewed at Leanpub, which supports commenting, and also will allow me to bundle the book with video lectures. I appreciate your feedback!

There are other data models #

For a couple of decades (roughly 1985-2005), the relational data model was the only game in town, you had to learn it, and there was no reason for a textbook to argue the point. Today, data engineers have a lot of other options. Document-oriented databases are booming in popularity with app developers for their ease of use; graph databases have captured the imagination of researchers and tinkerers because of their natural fit with social network applications; and the analytics world has found performance advantages to be gained with dimensional databases, column-family databases, and cluster-based Big Data platforms like Hadoop. These are meaningful advantages, so before we take it for granted that the relational model is the most important one to learn, it is necessary to remind ourselves what it is and what its unique advantages are.

A data model is a means of describing the data in a database without regard to the way it is actually physically stored. It provides abstractions that humans can work with—e.g. tables, documents, dimensions—instead of implementation artifacts like bytes, pointers, and disk sectors. This abstraction is quite important when you consider the growth and evolution of a database over time, and the number of applications that might come to depend on it. As databases are used, they grow larger, the types of data stored may grow or change, and it may be necessary for database administrators to optimize their performance by upgrading the technology in various ways. If the developers of applications that depend on these databases had written their code to interact with the data as it was stored on disk, their applications would break, and code would need to be rewritten, every time such a change was made. But because we have data models, this is not necessary. Application developers interact with the database using a “data language” that is independent of physical implementation; they query tables, rows, and columns for instance, instead of locations on disk.

The relational model’s particular strength is its ability to efficiently answer queries that were not foreseen at the time the database was developed. Before E.F. Codd’s landmark paper introducing the relational model, the leading approach to data modeling was a hierarchy or network in which you would follow links from one data point to another. Consider, for example, a hierarchical database of movies listed by director:

[Figure 2-1. hierarchical database of movies]

In this example, it would be very easy to query the database for a list of movies directed by Christopher Nolan; just start from his name and follow the pointers. By contrast, it would be quite difficult (computationally) to query for a list of movies in which Matthew McConaughey had appeared. The database engine would have to essentially read through the entire database, the vast majority of which is not relevant to the query, following all paths from left to right in the diagram to make sure it had found all of the paths that end with his name. In a really huge database, such a query could be prohibitively expensive.

The problem is, as databases grow, you will always find that you want to make queries you didn’t anticipate at the time you created the database. Codd, a researcher at IBM, had this problem in mind when in a 1970 he published “A Relational Model of Data Storage”.

[Figure 2-2. Edgar F. Codd]

In the relational model, the database consists of a set of tables, one for each entity (or noun) described by data. Each table can be queried on its own, or related tables can be combined into one query, but there is no “parent” table and no network or hierarchy that must be traversed. You have already seen a glimpse of this in Chapter 1; now, let’s define the terms a bit more precisely.

Theoretical roots #

Codd’s conception of a relation comes from set theory:

The term relation is used here in its accepted mathematical sense. Given sets S1, S2, …, Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on. We shall refer to Sj as the j th domain of R.

It is possible to visualize this as a table, hence the commonly-used language of tables, rows, and columns. In Figure 2.3, we see a relation defined by four sets—a set of movie titles, a set of movie studios, the set of integer years, and the set of director names. These indicate the possible values that may appear in any given data record; it’s not necessarily the case that all of the years of history (for example) will be found in the data. This is an important point: the relation (table) is “defined” by specifying the domains (columns) and not by the rows. You’ll see this in SQL’s CREATE TABLE command, which specifies column names and data types only.

[Figure 2-3. A relation with four domains, three tuples]

The term tuple (derived from “double”, “triple”, “quadruple”, and so on, up to “n-tuple”) in mathematics refers to an ordered list of values. In this context, a tuple fits in to a relation if it contains one value for each domain, in the same ordering as the domains. So in the above example, a tuple must contain a movie title, a studio name, an integer, and a director’s name, in that order. Each such tuple is a “row” of the table, or a record of the data—in this case, it’s a movie that we’re interested in.

Because a relation is a set of tuples so defined, a number of constraints apply—some of which will be relaxed in practical implementations. These include:

Each row must be unique.
The order of rows is immaterial.
The order of columns is significant.
Each row must include a value for each column.

The first rule is easy to accidentally violate, for example in an order-taking system, where the same customer may purchase the same product on more than one occasion. It is common practice therefore to add a primary key (PK) column that contains a value guaranteed to be unique in the relation, such as an identification number.

The second rule implies that we must query the database without regard to the order in which data was entered, or any other order. One cannot query for the “next” value and reliably predict what result will be given. Attention must be paid to the WHERE clause of a SQL query to specify exactly what we want.

In practice, the third rule is ignored. Codd informs us that the mathematical term for a relation with no specific domain ordering is a relationship. Instead of referencing an element of a row by its position in the sequence, in practice we use column names. Database developers should give sensible names to their columns, particularly when there may be one or more with the same domain of potential values. For example, a table of customer records may have two or more “address” columns, one for billing and one for shipping. It would be wise to give these columns names that indicate both the domain and the role, such as “billing_address” and “shipping_address”.

The fourth rule may be relaxed as well. Allowing missing values (called nulls) in certain columns gives database developers the flexibility to include optional attributes, or to add a data record in a step-by-step way instead of all at once. Obviously nulls cannot be allowed in every column, or the first rule may be violated. Therefore most databases don’t allow nulls in the primary key column.

In a query result, which otherwise resembles a relation, even the first two rules may be violated. This can be demonstrated by the following database interaction:

ch2=# select studio from movies;
 year
---------
 Paramount
 Paramount
 Warner Bros
(3 rows)

ch2=# select movie, year from movies order by year;
 movie         | year
---------------+------
 Sahara        | 2005
 Inception     | 2010
 Interstellar  | 2014
(3 rows)

The first result contains non-unique rows. There’s a mathematical term for a set of n-tuples that admits duplicates: it’s called a bag. The second example result orders the rows in a meaningful sequence.

A relational database is a database of relations #

We can therefore think about a relational database as a collection of tables (technically relations (even more technically, relationships)) that, together, describe the important facts in whatever context we’re interested in: a business, an application, a research project, or whatever. In some cases, a table may be useful on its own, but in many tasks you will want to query two or more related tables together. Yes, tables are related: customers are assigned to salespeople, employees belong to departments, products have bills of materials that go into their manufacture.

Two aspects of the context must be represented in the database design: entities and relationships. Entities are the nouns that matter: people, things, places, events, and concepts. Relationships are connections between entities which can usually be described by verbs. The most useful tool in developing a data model is an entity-relationship diagram (aka E-R diagram or ERD). See Figure 2-4. The boxes are entity types (although we usually just say “entities”), which correspond to tables. The lines indicate the relationship types (or “relationships”), which indicate how the rows of each table relate to the rows of the other tables.

ASIDE: Relationships, not relations; Also, this use of “relationships” differs from the mathematical term introduced above.

[figure 2-4. a sample ERD with 1:1, 1:M, and M:M relationships]

Three types (cardinalities) of relationships are depicted in Figure 2-4, a relational data model for a database of movies (like IMDB) diagrammed in “crow’s foot” notation. These are a “one-to-many” (or 1:M) relationship between Studio and Movie, a “one-to-one” (1:1) relationship between Script and Movie, and a “many-to-many” (M:M) relationship between Movie and Actor. Out of these three basic cardinality cases, complex data models can be designed.

ASIDE: You’ll see me use the term “relational model” in a couple of different ways. On the one hand, when we talk about “-the- relational model”, we mean the set of principles for designing relational databases, as contrasted with e.g. -the- dimensional model or -the- graph model. On the other hand, I may talk about “-a- relational model” for a database; this is a specific set of tables designed for a particular scenario, also known as a **schema. Figure 2-4 depicts a schema for a movie database.

ASIDE: There are a few other notations for E-R diagrams, and which one you use is a matter of personal preference. (Unless you’re in my database class; in that case, use the crow’s foot notation.) I use an online program called Gliffy to draw the diagrams.

One-to-many #

The most common type of relationship between tables is one-to-many (1:M), as seen in Figure 2-5. The rectangles in the diagram stand for entity types, which correspond to tables, and mean that any number of entity instances of each type may be in the database. We can assume that there are, or will eventually be, a large number of movies in the Movie table and large number of studios in the Studio table. The line with a crow’s foot at one end and a single tick mark at the other end indicates the type of relationship that may exist between rows of the Movie table and rows of the Studio table. Simply put, it says that each movie is related to just one studio, but that any given studio may be related to more than one movie. It is also common practice to add text to the diagram to explain why, or how, the two entity types might be related—in this case, studios produce movies (and movies are produced by studios).

[Figure 2-5. A one-to-many relationship]

ASIDE: A very common mistake I see in my classes is that students draw boxes on diagrams for specific -instances- of entity types. It would be wrong to, for example, have a rectangle on this diagram for the movie Interstellar or the studio Paramount. Those instances will become rows, but in an E-R diagram we are concerned with the tables.

When it comes time to create the actual tables, a one-to-many relationship is implemented by a simple and intuitive mechanism called a foreign key (FK). Remember that each table has a primary key, a column of data whose values are guaranteed to be unique—typically an ID number or code. To implement a 1:M relationship we add a special column to the table on the “many” side (in Figure 2-5, “Movie”) that holds references to primary key values in the other table. This could be achieved with the following SQL (note the REFERENCES clause):

CREATE TABLE studio (
 id  integer  PRIMARY KEY,
 name text );

CREATE TABLE movie (
 id  integer  PRIMARY KEY,
 title text,
 studio_id integer REFERENCES studio(id) );

If we view the tables with a few sample rows of data (Figure 2-6), the purpose of the foreign key column studio_id should be clear. At a glance you can see that Sahara and Interstellar were produced by Paramount (studio #1) and that Inception was produced by Warner Bros. (studio #2).

[Figure 2-6. Sample tables in a 1:M relationship (emphasis on foreign key column)]

Notice that there is no foreign key in the “Studio” table. If you pencil in a “movie_id” column on Figure 2-6, or simply imagine one, the reason should be obvious. If a studio were assigned a “movie_id”, then a studio could only be related to one movie, and that clearly doesn’t mesh with reality or with our E-R diagram.

Many-to-many #

The next most common kind of relationshp between two entity types is many-to-many (M:M), as in the relationship between Movies and Actors: each movie involves more than one actor, and each actor can be in more than one movie. See figure 2-7.

[Figure 2-7. A many-to-many relationship]

Now, an M:M relationship is easy to draw on a diagram, but it’s a little more complicated to implement as real tables in the database. Take a minute to think about how you might do it, before reading about it below.

…

Did you come up with a solution?

Well, here’s the solution that I teach my students: to represent the many-to-many connection, you need a third table. This is the only kind of case where a table represents a relationship rather than an entity. This new table, called an associative relation, might have two columns only: each a foreign key to one of the tables in the relationship. Figure 2-8 illustrates some sample data for the simplest kind of associative table.

[Figure 2-8. Movie_Actor creates the relationship between Movie and Actor]

A question of style concerns whether you should depict the associative entity on your E-R diagram—essentially as a table with two 1:M relationships to the entities of interest—or simply use the double-crow’s-foot notation as in Figure 2-7.

To answer this, consider the opportunity afforded by the existence of an associative table. It need not be limited to two foreign key columns only, because a table can have any number of columns. You can therefore use columns of the associative table to give meaningful characteristics about the relationship. Such a relationship may in fact be a “thing” of its own, perhaps an intangible one. An M:M relationship between a Buyer and a Seller might be an entity called a Contract. An M:M relationship between a social media User and a channel he follows could be called a Subscription. And in the case of the Actor-Movie relationship, we might call it a Role.

The answer to the style question is that, if the associative table can be construed as an associative entity, a thing or concept (or noun) meaningful to the business, it makes sense to add it to the diagram. If the associative entity is meaningless to anyone except the guy writing database code, you would in most cases leave it off the diagram.

See Figure 2-9 for an example diagram including the associative entity. Note that instead of an M:M relationship you now see two 1:M relationships. In this example, the rectangles representing entities also show the names of some columns that would belong to each table; this is also a common style of E-R diagramming.

[Figure 2-9. Many-to-many relationship with meaningful associative entity.]

An implementation in SQL could look like this:

CREATE TABLE movie (
 id integer PRIMARY KEY,
 title text,
 year integer );

CREATE TABLE actor (
 id integer PRIMARY KEY,
 name text );

CREATE TABLE role (
 id integer PRIMARY KEY,
 movie_id integer REFERENCES movie(id),
 actor_id integer REFERENCES actor(id),
 character_name text );

Notice that there are two foreign keys in the role table.

One-to-one #

One-to-one (1:1) relationships are much less common than 1:M and M:M relationships, simply because if two types of entities are related this way it’s often easier to put them together in one table. For example, a person has only one Social Security number, and a Social Security number has only one person, so you would typically have a table for People and include a Social Security number column rather than having two tables.

One reason you might store the data in separate tables is if some of the data needs to be stored in a different (physical) space, or stored with different security settings. Physical database optimization is beyond the scope of this chapter (although we’ll come back to it in Chapter 8) but I’ll give you and example. In many web applications, each user has a profile picture. Picture files (JPEG, PNG, or whatever) are quite large on disk, and to include them in the Users table of the database would make that table larger (perhaps by orders of magnitude) hence making queries slower. To give your website users a quicker response time, you might store the pictures in a separate ProfilePicture table, and relate them by foreign key to the corresponding rows of the Users table.

Another use case for a 1:1 relationship is when one of the tables contains “optional” data. For example, let’s say our movie database contains 10,000 rows, and we have scripts for 1,000 of those movies. Rather than have a script column in the Movie table, with 9,000 NULL entries making that table bigger and slower, we could have a separate Script table with only the rows it needs. In this case, each row of Script relates to exactly one row of Movie, but each row of Movie may relate to zero or one row of Script. It’s not exactly a parnership among equals: Movie is the main table in the relationship, and Script is a “dependent”.

[Figure 2-10. One-to-one relationship between Script (optional) and Movie (required)]

Figure 2-10 uses a version of the crow’s foot notation that indicates the difference. The double tick mark next to Movie means that, for any row of Script, there must be at least one and no more than one row of Movie. The circle (or zero) with a tick mark next to Script means that for any row of Movie, there may be zero or one rows of Script. (The simpler notation was seen earlier, in Figure 2-4.)

To implement this in SQL, even though you could get away with putting the foreign key on either side of this relationship, or on both, the best practice is to put a foreign key in the “dependent” table. In this case, that table is Script. Essentially a 1:1 relationship is a special case of 1:M, where the foreign key is simply constrained to be unique. Since primary keys are unique, the most practical way to do this is for the rows of the dependent table to have the same primary key values found in the “independent” table, and for its primary key to double as a foreign key, like so:

CREATE TABLE movie (
 id integer PRIMARY KEY,
 title text,
 year integer );

CREATE TABLE script (
 movie_id integer PRIMARY KEY REFERENCES movie(id),
 author text,
 body text );

A selection of the data, as in Figure 2-11, shows that the Script table borrows its primary key values from Movie but does not have a row for every value.

[Figure 2-11. Sample data from tables in a 1:1 relationship]

Unary relationships #

All of the relationships in the E-R diagrams seen so far have been binary relationships, meaning they include two tables, and are said to have degree of two. Other degrees are possible. Frequently we also see unary relationships, where rows of one table are related to other rows of the same table. Unary relationships can be of any cardinality we’ve seen so far: 1:M, M:M, or 1:1. Remember that it is not the tables that are related, per se, but the rows.

An example of a unary 1:M relationship is posed by Figure 2-12, which depicts a movie studio ownership relationship. Studios are often owned by other studios, as for example Disney owns Lucasfilm and Sony owns Columbia. This can be expressed as a foreign key in the Studio table referencing an “owner”. (Note that it can’t be done with a foreign key referencing the “owned” studio; the FK belongs on the “dependent” side of a 1:M relationship.)

[Figure 2-12. A unary 1:M relationship]

Here it is in SQL. The NULL keyword indicates that owner is an optional field; it may contain NULL if a particular studio has no owner recorded in our database.

CREATE TABLE studio (
 id  integer  PRIMARY KEY,
 name text,
 owner integer NULL REFERENCES studio(id) );

As in other cases, sample data may be helpful to illustrate how the implementation works.

[Figure 2-13. Sample data for a unary 1:M relationship]

A unary many-to-many relationship could easily represent ties of affiliation between individuals or organizations. Hollywood being the way it is, we could define a M:M relationship between Actors called “was married to”. Like binary M:M relationships, this would require an associative table (which may or may not be diagrammed). See Figure 2-14.

[Figure 2-14. Two ways to diagram Hollywood marriages as a unary M:M relationship.]

An implementation in SQL could look like this:

CREATE TABLE actor (
 id integer PRIMARY KEY,
 name text );

CREATE TABLE marriage (
 id integer PRIMARY KEY,
 husband integer REFERENCES actor(id),
 wife integer REFERENCES actor(id) );

And some sample data illustrates how it works:

ch2=# select * from actor;
  id  |        name
------+--------------------
 9001 | Brad Pitt
 9002 | Angelina Jolie
 9003 | Jennifer Aniston
 9004 | Billy Bob Thornton
(4 rows)

ch2=# select * from marriage;
 id | husband | wife
----+---------+------
  1 |    9001 | 9003
  2 |    9004 | 9002
  3 |    9001 | 9002
(3 rows)

Finally, an example of a unary 1:1 relationship that might be found in our movie database as a reference from a sequel to its predecessor. This would be implemented as an optional column in the Movie table called “sequel_to”, referencing another row of the Movie table, like so:

ch2=# select * from movie;
 id |   title   | sequel_to
----+-----------+-----------
  1 | Rocky     |
  2 | Rocky II  |         1
  3 | Rocky III |         2
  4 | Rocky IV  |         3
  5 | Rocky V   |         4
(5 rows)

Ternary and n-ary relationships #

Virtually all of the relationships you’ll find in most E-R diagrams are binary (connecting two tables) or unary (connecting rows within one table). It is nevertheless possible, however uncommon, to imagine a relationship connecting three tables (“ternary”), four (“quarternary”), or any arbitrary number of tables (“n-ary”). This simply means that one of each entity type is required to create a case of the relationship.

In the context of movies, we might consider a distribution deal to be one such relationship. A movie distribution deal is an agreement between a producer (Studio) and a third party (Distributor) to distribute a movie (Movie) in a particular territory (Territory). Figure 2-15 illustrates this as a quarternary many-to-many-to-many-to-many (M:M:M:M) relationship.

[Figure 2-15. An example of a quarternary M:M:M:M relationship.]

It is rare to see this kind of relationship diagrammed, though. The only way to implement a relationship of degree > 2 is to use an associative table with foreign key columns for all the participants. Therefore, in most cases you will simply model a new associative entity—for example “Deal” or “Contract”—which contains details as well as 1:M relationships with each of the related tables.

Summary #

The relational data model is one of several data modeling paradigms used by databases today, albeit the most popular one. At the heart of the relational model is a construct called a relation, which we typically call a “table”, although it has some strict constraints such that not any table will qualify. The particular strength of the relational model is that it enables database users to efficiently execute queries that were not anticipated at the time the database was designed. This flexibility is obtained by breaking down the data into numerous tables, one for each “entity” or meaningful noun concept in the domain, and by relating these tables to each other by means of foreign key columns. The relationships that may be defined differ in cardinality and degree. Cardinality refers to how many rows of each table participate in a relationship, and the possibilities are one-to-many (1:M), many-to-many (M:M), and one-to-one (1:1). Degree refers to how many tables are related. Binary (two table) and unary (one table) relationships are the most common ones you will encounter. A relational database model can be visualized as an entity-relationship (E-R) diagram, and you are encouraged to become familiar with at least one E-R diagram notation, such as the “crow’s foot” notation summarized in Figure 2-16.

[Figure 2-16. Symbols in the crow’s foot notation for E-R diagrams]

Lab 2: Creating and querying a relational database #

In Lab 1 you created a one-table database and were introduced to some of the basic things you can do with SQL queries. Now that you have learned the basics of how multiple tables can be related to one another within a relational database, you will want to see some richer and more realistic examples. Moreover, you’ll need to familiarize yourself with the main way that tables are queried together: the SQL join.

What we’ll do in this lab is:

Create a new database called “lab2”.
Define several tables to learn the features of the CREATE TABLE command.
Learn how to code different types of JOIN queries

In order to get started, first create a new empty Postgres database called “lab2”, log in to your Postgres server using psql, and switch your context to the new database—much as we did in Lab 1:

$ createdb lab2
$ psql
psql (9.6.1)
Type "help" for help.

joeclark=# \c lab2
You are now connected to database "lab2" as user "joeclark".
lab2=#

The CREATE TABLE command #

SQL is divided into two main types of commands, called data definition language (DDL) and data manipulation language (DML) respectively. DDL commands are those used to design and structure the tables that constitute the database, and the chief among them is CREATE TABLE. (Two others you may frequently encounter are ALTER TABLE and DROP TABLE.) The basic form of a CREATE TABLE command in Postgres is as follows:

CREATE TABLE table_name (
  column_name data_type [constraints/options],
  column_name data_type [constraints/options],
  ...,
  [constraints]
);

I’ve kept it simple here in order to make a clear introduction, and will reveal more options as we move on. The complete specification of CREATE TABLE can be found in the PostgreSQL online documentation, which is excellent. You have already seen several examples of CREATE TABLE, one of which I’ll reproduce here:

CREATE TABLE studio (
  studio_id integer PRIMARY KEY,
  name text 
);

ASIDE: Postgres makes things easier if the names of your tables and columns are lowercase and contain no special symbols other than underscores and digits. If you want to use uppercase letters or spaces in these identifiers, it’s possible, but you must always remember to surround them with doublequotes, e.g., CREATE TABLE "Movie Studio". Different database systems have different naming conventions. See the documentation for details.

In this example you saw only one of PostgreSQL’s optional constraints: PRIMARY KEY. Yes, a “PK” is a constraint on the data. By that we mean that it sets up a rule that will lead to errors if we try to insert bad data—here’s one of the advantages of databases as opposed to spreadsheets: they tell us when we make mistakes. In this case, the rule is the entity integrity rule, which dictates that the primary key column must contain a unique (not null) value for every row. Look what happens if I try to add two movie studios with the same “studio_id”:

lab2=# insert into studio (studio_id,name) values (1,'Disney');
INSERT 0 1
lab2=# insert into studio (studio_id,name) values (1,'Warner Bros');
ERROR:  duplicate key value violates unique constraint "studio_pkey"
DETAIL:  Key (studio_id)=(1) already exists.

Three other constraints you’ll see me use in this lab are NOT NULL, which means a particular column can’t be empty, REFERENCES which sets up a foreign key relationship, and CHECK which allows for arbitrary validation of the data.

Developing your database iteratively #

As you work through this and other labs, you’ll soon find that you’ve made mistakes or come up with better ideas, and want to start over. Typing those CREATE TABLE commands into psql gets tedious and is error-prone. As in most other types of programming, the solution is to write your SQL code in a text file (you may call it a script) that you can save, modify, and re-use. You can tell Postgres to run the whole script at once with a one-line command. This allows you to iterate toward a design that works.

To continue with this lab, create a text file using any text editor you like, such as Notepad++ on Windows, TextWrangler on the Mac, or Vim for you Linux geeks. One trick that I find handy is to precede my CREATE TABLE code with corresponding DROP TABLE commands. That’s because I expect to run this script over and over again, tweaking it until I get it right, and you have to delete a table before you can (re)create one with the same name.

File lab2.sql:

DROP TABLE studio;
CREATE TABLE studio (
  studio_id integer PRIMARY KEY,
  name text
);

There are two ways to tell Postgres to run this script. From within the psql environment, use the \i command and the location of the script:

lab2=# \i c:/psql_scripts/lab2.sql
DROP TABLE
CREATE TABLE

Or if you’re not logged in to psql, you can use your operating system’s command line, specifying the database name (after “-d”) and the script file (after “-f”):

$ psql -d lab2 -f lab2.sql
DROP TABLE
CREATE TABLE

If you made any mistakes, don’t worry about it! Correct your code and run the script again, as many times as you need until it works.

Data types in PostgreSQL #

Recall that each column in a relation is defined by a domain. When defining a table, we constrain the domain of values that may be stored in a column by specifying a data type. In the “studio” table, you saw two data types used: the studio’s ID number is defined as an integer and it’s name is text. If you tried to insert the wrong type of data into either column (such as a decimal number, perhaps) you’d get an error. In order to design your database well, you should become familiar with the main data types available in Postgres.

ASIDE: Perhaps because it is an open-source database and anyone can contribute to its development, Postgres offers a bewildering array of data types. The full list and details can be found in the online documentation.

When it comes to numbers, there is a trade-off between precision, accuracy, and database performance. The integer data type stores whole numbers in the range -2147483648 to +2147483647 with absolute precision but doesn’t allow fractions. The real data type can hold a decimal number of any magnitude with about 6 significant digits of precision, but small inaccuracies can result from the necessity of rounding. Both require four bytes of storage on disk. A third type, numeric, can be made precise and accurate up to an arbitrary number of digits before and after a decimal point, but it is much larger on disk and hence slower to process. The numeric type might be useful for storing currency amounts because inaccuracy cannot be allowed when counting money.

ASIDE: Alternatively, you might consider storing dollars as an integer number—of cents! In U.S. currency there’s no real such thing as a fraction of a cent, so money values are actually whole numbers disguised as decimals, from a certain point of view. By the way, Postgres also offers a money type which is essentially a dressed-up integer. Be careful with that type because it may behave differently depending on your “locale”: a US computer would display it as dollars and a foreign computer would show it as their own currency, potentially confusing people.

A variation on integer that will become very useful is the serial type; this is just an integer type that can be automatically filled in, when a new database row is created, with the next whole number in sequence: 1, 2, 3, and so on. It can be very handy for a primary key column, where you don’t care about the actual value except that it must be unique.

Text in a computer is stored as a sequence or string of characters—mostly letters, numbers, spaces, and punctuation—and the data types for text differ in whether you want to constrain the length of the text. In standard SQL, the two main types are char(n) and varchar(n). The char(n) type specifies text that has exactly “n” characters. Like the bed of Procrustes, the database will cut off the end of a string that is too long, or stretch one out that is too short (by adding spaces to the end of it). You might use char(2) to store a state abbreviation in an address. The varchar(n) type holds text of any length up to “n”, so varchar(20) is probably adequate to store last names, and varchar(64) might be enough to accommodate titles of books. You might use varchar if you need to limit the length of the data, for example to make sure it will look good on a computer screen or print on a shipping label.

In most databases, char offers a performance advantage over varchar when the data is predictably the same length: for each row of data, char(10) would require 10 bytes while varchar(10) would require about 2 bytes to say how long the text is plus 10 bytes to store the text. When the text is of widely varying length, varchar would have an advantage because it doesn’t pad the shorter values with spaces. However, in Postgres, due to clever engineering there is actually no such performance trade-off. In fact, Postgres offers a third type, simply called text, which allows character strings of unlimited length and is no slower than the others. So unless you have a reason to limit the length of a text value, or feel strongly about sticking to standard SQL, use Postgres’s text type.

Some of the other data types you will often find useful are boolean (a true/false value), date, time, and timestamp (the latter combines a date and a time of day). Postgres is also known for offering some highly unorthodox (to the SQL world) data types, such as Arrays, XML documents, and JSON, but those are beyond the scope of this chapter.

I’ve added the following to lab2.txt:

DROP TABLE person;
CREATE TABLE person (
  person_id serial PRIMARY KEY,
  first_name text,
  last_name text NOT NULL,
  sex char(1) CHECK (sex='M' or sex='F'),
  birthdate date
);

You can see that I’ve used a few more features of the DDL here. I’ve added a NOT NULL constraint to the “last_name” column, so it can’t be empty. (Any of the other fields will accept nulls, effectively making them “optional”.) The CHECK constraint will validate data in its column according to any arbitrary logical test. In this case we’re checking that “sex” is either “M” or “F”; the table will throw an error message if you try to enter anything else. It does accept a null value, though. You also see examples in this code of the serial, char, and date data types which we haven’t used before.

Referential integrity #

Previously we saw the entity integrity rule in action—that every row must have a unique value for its PK. Another vitally important integrity constraint for relational databases is the referential integrity rule, which states that a foreign key value can only match a valid primary key value from the referenced table. (Nulls may be allowed by the database designer, but values that don’t match existing PKs cannot be.) It also happens that you cannot define a foreign key column if the referenced table doesn’t exist, so this rule has an effect on the sequence in which we must create the database and insert data.

Case in point: although we are building a database of movies, we could not create the “movie” table first, because we know it’s going to reference certain other tables such as “studio”. Now that “studio” exists, we can define “movie” like so:

DROP TABLE movie;
CREATE TABLE movie (
  movie_id serial PRIMARY KEY,
  title text,
  year integer CHECK (year>1900 and year<2100),
  studio_id integer REFERENCES studio(studio_id),
  director_id integer REFERENCES person(person_id)
);

If you try to create the “movie” table before the “studio” and “person” tables exist, PostgreSQL will refuse to do it, and give you an error message, because the foreign key constraints on the “studio_id” and “director_id” columns won’t make sense to it. What you might not have guessed is that if you try to drop the “studio” or “person” table after creating “movie”, you’ll also get an error message. When dropping tables, you must drop the referenc*ing* tables before you drop the referenc*ed* tables. In order for our script to work, we have to re-arrange it so that the command DROP TABLE movie; comes before the other DROP commands. My script now looks like this (with column definitions omitted):

DROP TABLE movie;
DROP TABLE studio;
DROP TABLE person;

CREATE TABLE studio ( ... );
CREATE TABLE person ( ... );
CREATE TABLE movie ( ... );

The last table created is the first table deleted.

Sequencing of INSERT commands is also important. We cannot add a movie before its studio exists, because “studio_id” in the “movie” table must match a real “studio_id” in the “studio” table. Ditto for directors. (We can create a movie before its actors have been added, because there’s no direct FK relationship to actors.)

By the way, there are two common ways to write the PostgreSQL INSERT command: single-row and multi-row insertions. In either case, you first specify the columns to add data to, and then provide the values for the new row(s). Single-row insertions look like this:

INSERT INTO studio (studio_id, name) VALUES (1,'Disney');
INSERT INTO studio (studio_id, name) VALUES (2,'Paramount');
INSERT INTO studio (studio_id, name) VALUES (3,'Warner Bros');

INSERT INTO person (first_name, last_name, sex, birthdate)
 VALUES ('Christopher','Nolan','M','1970-07-30');
INSERT INTO person (first_name, last_name, sex, birthdate)
 VALUES ('Breck','Eisner','M','1970-12-24');
INSERT INTO person (first_name, last_name, sex, birthdate)
 VALUES ('Brad','Bird','M','1957-09-24');

ASIDE: In Postgres you must use ‘singlequotes’ for text strings like these studio names, rather than “doublequotes”. The two types of quotation marks are not interchangeable. “Doublequotes” are used for identifiers of database objects, like table and column names. The PostgreSQL documentation on lexical structure is a good read that will help you avoid some common errors like using the wrong punctuation.

Notice that when inserting to the “person” table, we didn’t specify a “person_id”. We could have if we’d wanted to, but because the PK is a serial data type, it will automatically number the new rows for us. You can check the numbers with a simple SELECT query in psql. Your numbers might be different from mine if you have created and deleted other data previously, so be sure to check:

lab2=# select * from person;
 person_id | first_name  | last_name | sex | birthdate
-----------+-------------+-----------+-----+------------
         1 | Christopher | Nolan     | M   | 1970-07-30
         2 | Breck       | Eisner    | M   | 1970-12-24
         3 | Brad        | Bird      | M   | 1957-09-24

Multi-row insertions look like the following. Make sure you check the director’s “person_id” PKs and use the right ones in your code:

INSERT INTO movie (title,year,rating,studio_id,director_id) VALUES
 ('Sahara',2005,'PG-13',2,2),
 ('Interstellar',2014,'PG-13',2,1),
 ('Inception',2010,'PG-13',3,1),
 ('The Incredibles',2004,'PG',1,3),
 ('Ratatouille',2007,'G',1,3);

Commas separate each row’s values, and a semicolon ends the command.

To complete the database for our lab, let’s create the one-to-one and many-to-many relationships. The “script” table is in a 1:1 relationship with “movie”: there is zero or one script per movie. We will not include the full screenplay in this example database, just the screenwriter’s name.

CREATE TABLE script (
  movie_id integer PRIMARY KEY REFERENCES movie(movie_id),
  screenwriter text NOT NULL
);
INSERT INTO script (movie_id,screenwriter) VALUES (1,'Donnelly');

Actors are related to movies in this database via an M:M relationship: each actor may be in multiple movies and each movie may include multiple actors. As diagrammed in Figure 2-9, we implement this by creating an associative table called “role” which has foreign keys to both “person” and “movie”.

CREATE TABLE role (
  role_id serial PRIMARY KEY,
  movie_id integer REFERENCES movie(movie_id),
  actor_id integer REFERENCES person(person_id),
  character_name text
);

INSERT INTO person (first_name, last_name, sex, birthdate) VALUES
 ('Leonardo','DiCaprio','M','1974-11-11'),
 ('Joseph','Gordon-Levitt','M','1981-02-17'),
 ('Matthew','McConaughey','M','1969-11-04'),
 ('Anne','Hathaway','F','1982-11-12'),
 ('Penelope','Cruz','F','1974-04-28'),
 ('Lou','Romano','M','1972-04-15');

INSERT INTO role (movie_id, actor_id, character_name) VALUES
 (1,6,'Dirk'), (1,8,'Eva'), (2,6,'Coop'), (2,7,'Brand'),
 (3,4,'Cobb'), (3,2,'Arthur'), (4,9,'Bernie'), (5,9,'Linguini');

Feel free to add more movies and actors if you like. This example is just a tiny prototype of what you might find behind the scenes of IMDB.com. The complete code for my lab2.txt is available on this book’s GitHub repo. Next, we’ll start writing queries that join tables.

Queries that `JOIN` tables #

What makes a relational database more than just a collection of single-table databases is the capability to join tables and query them together. We can combine the “studio” and “movie” tables with a query like this one, which I’ll explain below:

SELECT title, year, rating, studio.name AS studio
FROM studio NATURAL JOIN movie;

Tables are joined using the FROM clause. Instead of identifying one table, we can list two (or more) separated by commas. What the database does when you query multiple tables is first take the Cartesian product of the two. Essentially what this means is that it combines each row of the first table with each row of the second. If the “studio” table has three rows of data and “movie” has five, the Cartesian product has fifteen. Observe:

lab2=# SELECT name FROM studio;
    name
-------------
 Disney
 Paramount
 Warner Bros
(3 rows)

lab2=# SELECT title FROM movie;
      title
-----------------
 Sahara
 Interstellar
 Inception
 The Incredibles
 Ratatouille
(5 rows)

lab2=# SELECT name, title FROM studio, movie;
    name     |      title
-------------+-----------------
 Disney      | Sahara
 Disney      | Interstellar
 Disney      | Inception
 Disney      | The Incredibles
 Disney      | Ratatouille
 Paramount   | Sahara
 Paramount   | Interstellar
 Paramount   | Inception
 Paramount   | The Incredibles
 Paramount   | Ratatouille
 Warner Bros | Sahara
 Warner Bros | Interstellar
 Warner Bros | Inception
 Warner Bros | The Incredibles
 Warner Bros | Ratatouille
(15 rows)

Clearly, of course, most of these combinations don’t make any sense. The Incredibles is a Disney picture, so there’s no circumstance where you’d want a row matching it up with Paramount or Warner Bros. Recall from Chapter 1’s lab that the WHERE clause allows us to filter the rows of a result. What you’d want to do now is to keep only those matchups where the PK “studio_id” of the “studio” data equals the FK “studio_id” of the “movie” table, like so:

lab2=# SELECT name, title FROM studio, movie
lab2-# WHERE studio.studio_id = movie.studio_id;
    name     |      title
-------------+-----------------
 Paramount   | Sahara
 Paramount   | Interstellar
 Warner Bros | Inception
 Disney      | The Incredibles
 Disney      | Ratatouille
(5 rows)

That’s more like it! In my WHERE clause I identified the columns of interest by the combination of table name and column name, e.g., “studio.studio_id”. This is only necessary where the column name alone would be ambiguous otherwise, that is when two or more tables have column names in common; it’s an option available to you in other cases.

SQL also offers a JOIN keyword that you can use to make it more explicit what you’re doing. The following two SQL statements do exactly the same thing:

SELECT * FROM studio, movie WHERE studio.studio_id = movie.movie_id;
SELECT * FROM studio JOIN movie ON studio.studio_id = movie.movie_id;

This situation—a join on a one-to-many relationship where the FK and PK columns have exactly the same names—is so common that it’s known to SQL users as a natural join, and in fact, PostgreSQL offers a keyword for it so you don’t have to do the tedious typing of the equality condition. The following command does the same thing as the two above:

SELECT * FROM studio NATURAL JOIN movie;

The result of this query keeps the same column names from the two joined tables, and “name” becomes a bit ambiguous. We can use the AS keyword to rename a result column. Being more specific about what we want as the result brings us back to the first example above:

SELECT title, year, rating, studio.name AS studio
FROM studio NATURAL JOIN movie;

While we’re at it, why not pull in the director’s name, too? It’s also derived from a 1:M relationship, but the PK “person_id” and FK “director_id” aren’t the same, so we can’t use a natural join.

lab2=# SELECT title, year, rating, name AS studio, last_name AS director
lab2-# FROM studio NATURAL JOIN movie 
lab2-# JOIN person ON person_id = director_id;
      title      | year | rating |   studio    | director
-----------------+------+--------+-------------+----------
 Sahara          | 2005 | PG-13  | Paramount   | Eisner
 Interstellar    | 2014 | PG-13  | Paramount   | Nolan
 Inception       | 2010 | PG-13  | Warner Bros | Nolan
 The Incredibles | 2004 | PG     | Disney      | Bird
 Ratatouille     | 2007 | G      | Disney      | Bird

A many-to-many relationship, as we’ve seen, is effectively implemented as two one-to-many relationships between the principal tables and an associative table. To get all appearances of each actor in a movie’s cast, we use the “role” table as the associative one. Join “movie” to “role” and then to “person” (or do it the other way around).

SELECT first_name, last_name, title, character_name
FROM movie, role, person
WHERE movie.movie_id = role.movie_id
AND role.actor_id = person.person_id;

At the start of this chapter, I said that in an old-fashioned hierarchical database such as pictured in Figure 2-1, it would be very easy to query the database for a list of movies directed by Christopher Nolan, but quite difficult (computationally) to query for a list of movies in which Matthew McConaughey had appeared. Now we see that by breaking up the data so there’s one table for each entity, and utilizing the power of joins, either of those queries can be done in a brief SQL snippet:

lab2=# SELECT title
lab2-# FROM movie, role, person
lab2-# WHERE movie.movie_id = role.movie_id
lab2-# AND role.actor_id = person.person_id
lab2-# AND last_name = 'McConaughey';
    title
--------------
 Sahara
 Interstellar
(2 rows)

lab2=# SELECT title
lab2-# FROM movie JOIN person
lab2-# ON movie.director_id = person.person_id
lab2-# WHERE last_name = 'Nolan';
    title
--------------
 Interstellar
 Inception
(2 rows)

Inner and outer joins #

You’ll see a lot more SQL tricks involving joins throughout this book. One last twist I’d like to add in this chapter is the concept of inner and outer joins. Natural joins, and indeed all of the joins so far, are inner joins because they only return those rows of the original tables that participate in the relationship. To illustrate this, look what happens when you join “movie” with “script”, keeping in mind that we only created one row of “script”:

lab2=# SELECT title, year, rating, screenwriter
lab2-# FROM movie NATURAL JOIN script;
 title  | year | rating | screenwriter
--------+------+--------+--------------
 Sahara | 2005 | PG-13  | Donnelly
(1 row)

The four movies in our database that don’t have matching rows in “script” do not appear in the result. In any case where the relationship is optional, you can imagine a sort of Venn diagram: there may be several rows of Table A and several of Table B but only a few combinations where an “A” is related to a “B”. Sometimes, though, we want the complete set of rows of one of our tables. For example, we might want the list of all movies, showing the screenwriter’s name if we know it. That’s called an outer join and you can think of it as taking the entirety of one of the circles in the Venn diagram. An outer join is a LEFT OUTER JOIN if the first table (the one mentioned before “JOIN”) is the one you want to include all of. In this way, we can get the full list of movies with screenwriters’ names.

lab2=# SELECT title, year, rating, screenwriter
lab2-# FROM movie LEFT OUTER JOIN script
lab2-# ON script.movie_id = movie.movie_id;
      title      | year | rating | screenwriter
-----------------+------+--------+--------------
 Sahara          | 2005 | PG-13  | Donnelly
 Interstellar    | 2014 | PG-13  |
 Ratatouille     | 2007 | G      |
 The Incredibles | 2004 | PG     |
 Inception       | 2010 | PG-13  |
(5 rows)

There are also RIGHT OUTER JOINs and FULL OUTER JOINs. For more on joins, see the PostgreSQL documentation. Have you imagined how you could code a self-join which joins a table to itself?

Challenges #

In the lab we constrained the “sex” column of the “person” table to be an uppercase “M” or “F”. If a lowercase value were provided, we would see an error. Can you rewrite the constraint such that it would accept either case?
Try to use ALTER TABLE to add an ownership relationship to the “studio” table in our lab, as diagrammed in Figure 2-12. This is tricky because you already have some data in the table, so you need to think about how to avoid a referential integrity error from those pre-existing rows that don’t have FKs.

The Purpose of Excel

2016-09-29T14:36:02-07:00

At lunch today, I was telling a colleague that in my “Introduction to MIS” course at UMaine, because the students will take a full-semester Excel course later, I have tried to demonstrate the business purpose for the software rather than the nuts-and-bolts of how to make it go.

His reply was, “that would be a great name for a course!” So today I’m thinking about how I might teach The Purpose of Excel as a university course. It may not be a course, but I’d bet I could come up with a few good chapters or essays on it.

So what is the purpose of Excel? #

Excel is a great tool that can be used for everything from simple calculations (a substitute for a calculator) to graphic design (anything consisting of rectangles). Its business purpose though is to aid people in making decisions by creating simple models, which might better be called simulations, of business scenarios, and enabling what-if analysis.

The way I like to set up a spreadsheet is to have all of the parameters that I might change, or experiment with, in a grouping by themselves, as at the top of the figure above. These numbers are used in calculations, which may form one or more complicated data tables leading up to some kind of success measures such as profit, cash flow, production, or ROI.

What is absolutely vital to this kind of decision analysis is that users can change the parameters—the inputs—and see how it affects the outcomes of their model or simulation. Without that, the spreadsheet is just a curiosity. With that capability, you can consider alternative assumptions and instantly see the results of all the calculations. You can quickly compare a best-case scenario, a worst-case scenario, and a most-likely scenario.

Now, there are a lot of neat things you can do in Excel such as Monte Carlo simulations, dynamic programming with macros, and all kinds of advanced math, but these are all in support of the basic business proposition that you can simulate different business “worlds” in order to think, or argue, mathematically about what decision you should make.

This, for me, is the purpose of Excel.

How databases fit in

2016-05-24T16:36:11-07:00

This is the first (draft) chapter of a new book on relational databases (using Postgres) that I’m working on as a summer project. Stay tuned for additional chapters. The book under development can also be viewed at Leanpub, which supports commenting, and also will allow me to bundle the book with video lectures. I appreciate your feedback!

Introduction #

Imagine that you work in a small direct-response mail order company that takes orders from customers by phone. Each agent in the call center downstairs has a stack of paper order forms on his desk, and when he receives an order he writes down the product name(s), quantity ordered, and the customer’s address and payment information. He uses a calculator or computer to sum up the order total, and tells the customer how much they’ll be charged.

Periodically, a data entry worker visits the call center and picks up stacks of filled-in order forms. She enters each order’s details into a file on her desktop computer, perhaps a big Excel spreadsheet that she’s designed herself for this task. At the end of the day, when all the orders are entered, she sends the complete file to two other departments: fulfillment, which processes the customer’s credit card and packs and ships the orders, and accounting, which calculates each salesperson’s commission.

This is the kind of process that a small business might develop when it’s first getting started, and in fact, it’s exactly the process that I encountered when I was first learning about databases at a small company in Maine. Unfortunately, simple processes like this tend to get complicated as the company gets bigger, and can become impossible to maintain. Just a few of the challenges this company might face are:

When the business grows to the point that multiple data entry workers are needed, they must coordinate their work somehow. Perhaps each worker creates her own file, and they must combine them at the end of the day. There are many opportunities for errors to enter the system.
If a payment is declined, or if a customer returns an order, one of the old spreadsheets must be updated, but which one? The data is kept in the order it was entered, not alphabetized by customer, and if there are now multiple spreadsheets for each day of business, searching for the old order is a big challenge.
As a result of the update made by fulfillment or customer service, multiple versions of the data exist. Accounting still has the old data, and they need to be notified of the change so they can correct the salesperson’s commission. Even if they’re sent the modified file, how do they know which row(s) are changed? And as more changes are made over time, who is responsible for keeping the official record?
As the business grows, the order entry data may change. For example, product numbers or names may change, discounts or coupon codes may be introduced, or some kind of membership number may be offered to frequent customers. As the paper order forms change, the spreadsheets must also be modified. Consequently, today’s data files look different from last month’s and even more different from last year’s. As people change jobs and leave the company, can their replacements even read the older data?

As this small business becomes a medium-sized enterprise, the seeming convenience of using a spreadsheet can become a nightmare of data management. Before long, the data entry workers, accountants, and other departments may find they spend more time managing spreadsheets than doing their main jobs. And sooner or later, management may wish to use the historical order data for a new purpose, such as customer relationship management (CRM) or business intelligence (BI). What they’ll find is that the data is spread out over a huge number of files, with multiple versions of each file existing in different places, and older files having a different structure and meaning from newer files. What a mess!

Aside: There are lots of other problems we could imagine in this scenario, but I wanted to keep it brief. One that ought to be mentioned is the security disaster posed by this system. The spreadsheet in my story contains customers’ credit card numbers as well as personally identifying information, and it’s being passed around willy-nilly between departments. Moreover, any employee with a grievance could walk out the door with all of the data on a thumb drive, and who would know?

A database #

Imagine instead if there was a black box in the office into which all those order forms were fed. At any time, a person could ask the black box to retrieve any order record (or list of records) by time and date, by customer name, by product, or another attribute. Fulfillment could ask for the payments waiting to be processed, and it would get a printout of exactly that. Shipping could request invoices and mailing labels for orders ready to ship. Accounting could ask for the sum of order totals taken by each salesperson for a given time period. Moreover, changes could be filed in the black box and all subsequent requests would include the up-to-date, corrected information. The black box serves as the manager of, and the official system of record for, all the company’s order data.

That’s the big idea of a database. Instead of having every person or department or program keep its own copy of the data, a database serves as a system of record, a “single source of truth” that can always be accessed by everyone who needs it for their different purposes. A database stores some knowledge about the data’s structure and meaning, or metadata, so diverse users can know what they’re looking at. And most importantly, a database offers flexible but easy-to-use query methods so that users can request just the data they want, whether it’s a single record, a collection of data, or an aggregation into averages, counts, or sums.

A variety of databases #

Databases come in many shapes and sizes and they serve a lot of different purposes. Most often, a database acts as a server, that is, a software program that is always on, waiting for requests and responding to them. It’s necessary for databases to always be on and available if people and other software systems are going to depend on them to store data. (Otherwise, those people and programs will have to store their own data locally, which defeats the purpose of a database – independent and shared data management.)

A database administrator (DBA) is therefore responsible for a key component of a business’s IT infrastructure. If the database goes down, a lot of other programs go down, so the DBA ought to learn best practices for managing security, keeping backups and recovering from disasters. Maintaining and upgrading a database has been compared to maintaining a sailing ship when it is out to sea. The DBA can’t just take the database out of service to work on it however he wants; the business must stay afloat.

Server-based databases vary widely in scale and scope. Some databases support a single application, such as a dynamic website, and these may run on the same physical computer as the application’s code. A larger database might run on a dedicated machine that multiple users access over the network; for example, a company may keep a database of customer relationship information which can be accessed by sales, marketing, and customer service systems.

Aside: The term “server” means a software or hardware system that is constantly listening for requests or commands and reacting to them. When a software server runs on a computer dedicated to that purpose, the computer itself is also called a “server”.

Larger still are enterprise-scale databases that integrate a wide variety of subject areas. This category includes enterprise resource planning (ERP) systems that integrate several areas of business operations, and enterprise data warehouses (EDW) used for analysis and reporting of business performance. Enterprise-scale databases may run on mainframes or may be distributed over large clusters of dozens or hundreds of computers.

Aside: It is worth noting that there are also what we may call “personal” or “desktop” databases that run on personal computers and do not remain active when not in use. These databases are created with programs such as SQLite and Microsoft Access (the two you’re most likely to encounter) and are saved as files. They have many of the features for structuring and querying data that you’ll learn from this book, such as the SQL query language, but for business use they would still pose most of the same problems as the spreadsheets in my opening vignette. You would still end up with lots of different versions of the same (database) files with inconsistent data kept by different people and departments.

If your goal is simply to learn SQL or relational data modeling, though, SQLite and Access are both fine choices.

What you’ll learn from this book #

This book will introduce you to relational databases, with data modeling and SQL first and foremost. It covers the scope of a typical first database course in an information systems or analytics program at the university level. It can be used as a textbook for an instructor-led course—instructors, please contact the author for an instructor’s guide, slides and materials–or used for self-guided study with or without the video lectures produced by the author (coming soon via Leanpub).

My main goal in creating this book is to make data modeling and SQL understandable to the reader, so it may serve as a good low-cost supplement for students who are struggling with a theory-heavy textbook and having a hard time getting the point. Instead of starting with loads of theory up-front, I’ll take a more pragmatic approach based on relational data modeling “patterns” and examples.

This book will also provide a lot of practice using SQL, the structured query language common to all relational databases, because the most frequent feedback I’ve heard from my students at Arizona State (and from companies that hire them) is that they need more practice with SQL.

The database of choice for this book is PostgreSQL (often nicknamed “Postgres”), an open-source database that has become quite popular with developers in recent years. Compared to some other popular databases like SQL Server and MySQL, there aren’t many good books about Postgres, so I hope this book will be valuable if only for its examples. Postgres makes a good choice for teaching because it is free software (both “free as in free speech” and “free as in free beer”), and because it runs on all the major platforms—Windows, Mac OS, and Linux—so you can follow this book no matter what kind of computer you have handy.

Rest assured that the lessons of this book are transferable to other relational databases. Each of the major brands has its own quirks and special features, but this book mainly covers the fundamentals that apply everywhere. As currently planned, one chapter will exhibit some of PostgreSQL’s special features.

Aside: If this book finds any significant success with readers, I fully intend to create additional versions of the text that highlight SQLite, MySQL, Access, or whatever other platforms people are interested in. Give me feedback!

Lab 1: your first PostgreSQL database #

Up and running with Postgres #

PostgreSQL is available for free at www.postgresql.org and is extremely well documented there. Installation instructions will vary depending on your platform, and should be pretty straightforward. You can probably accept all the default configuration options. Be sure to remember the password you set during installation. You’re up and running when you can enter the command psql -V at your system’s command line, and the system responds by telling you the version of PostgreSQL installed. At the time of this writing, it looks like this for me:

$ psql -V
psql (PostgreSQL) 9.5.3

If that doesn’t make any sense to you, see Appendix A for my detailed guide to installing Postgres on Windows, Mac, and Linux, or refer to the online documentation.

Relations are tables #

Databases can be classified according to the types of abstractions they allow you to model your data with. In a relational database, data is modeled as a set of tables with structured rows and columns. Other data models are possible. In a document-oriented database such as MongoDB, data is modeled as documents with tree-like structures. In a graph database like Neo4J, data is structured as a network diagram (a mathematical graph) with nodes and edges. Compared to those newer forms, the relational model is far more commonly seen and better understood, and is the most versatile. Relational databases have been tried and tested in business for nearly four decades, and are probably the best tool for the job in all but a few specialized cases.

The tables you find in a relational database are properly called relations. A relation is not just any table; it is a construct found in set theory and is defined by the following characteristics:

Every row has the same columns.
Column names must all be different.
Each column is defined to contain just one specific type of data.
Each row must be unique; usually we enforce this by adding a machine-generated ID number to each row. This column is known as the “primary key” column.
No inherent ordering of rows or columns, or other information about how to display the data, is stored in the table.

Consequently, a relation is a simpler and less flexible structure than a table you might create in a spreadsheet program like Microsoft Excel. Spreadsheets allow you to mix data types, to have rows with different numbers of columns, and to decorate your data with display logic like fonts, colors, and sizing. Figures 1-1 and 1-2 illustrate the comparison with an example of sales data that might be recorded by a small outdoor sports mail-order business.

In a spreadsheet the user can be lax in data entry, for example omitting the state “TN” when we all know where Nashville is, or entering a quotation mark (meaning “ditto”) instead of spelling something out. Data types unanticipated at the time the table was designed could be inserted freely; for example, a three-letter Canadian province abbreviation could be inserted into a column meant for two-letter US states. Although these are convenient for data entry, they may lead to problems for computer systems that want to use the data (for example, to print mailing labels). The spreadsheet user can also decorate the data with fonts, styles, sizes and colors in order to make it more readable, and he can add extra information like a “grand total” row.

As seen in Figure 1-2, a database table (or relation) is much more strictly defined. Data types must be specified in advance for each column, guaranteeing uniformity. That means special cases must be anticipated before they occur. In this example, the database designer specified that state abbreviations must be exactly two characters, and that the price may be numeric (allowing fractions) rather than integer. In order to guarantee that each row is unique, and therefore can be looked up, the databsae users has added a primary key column and populate it with an auto-generated ID number.

No other information is found in the rows of a relation except the data itself: not fonts and styles, and not even the sort order. Totals, averages, and the like wouldn’t be stored in the table either, because rows correspond only to individual data “records”. Aggregated values like totals and averages could be calculated in a query or perhaps stored in additional tables created specifically for the purpose.

Creating your first relation #

Let’s fire up PostgreSQL and create a table. (I probably won’t use the word “relation” much after this, except for a bit of theory in Chapter 2. Where I write “table”, you should be able to figure out what I mean.)

First, a note about the term “database”. As I have described it above, a database is a system that organizes and stores data and, importantly, makes it available to people who need to search or retrieve it. Others more precise than I will distinguish between the database, which is the organized data store, and the database management system (DBMS) which is a program like PostgreSQL that creates the database and grants access to it. When we call PostgreSQL (or Oracle, or SQL Server, etc.) a database, we are using the term more generally to include both the database and the DBMS, since they go together.

To understand how we interact with PostgreSQL, though, you need a third definition of the term. In PostgreSQL, a database is a logical subdivision of the data store, which in some other systems might be called a tablespace. You may create any number of tables grouped into databases on the same server. (For the purposes of this book’s labs, your personal computer is acting as a PostgreSQL server.) Table names must be unique within a database, but not within a server. If several examples in this book include a table called “customers”, you can avoid a conflict by creating a new database for each lab.

What we’ll do in this first lab, then, is:

Create a PostgreSQL database called “lab1”
Log in to that database with psql
Create a table of Purchases
Query the single-table database with SQL

Database creation #

You can create a database from your operating system’s command line (i.e., before logging in to PostgreSQL with psql or another front-end tool), by using the command createdb. The basic structure of this command is createdb [OPTIONS] [DBNAME], and you can learn more by typing createdb --help at the command prompt. The only optional parameter you need to specify is the identity of the database “user” that was created when you installed PostgreSQL. The user “postgres” is the superuser who has power to make any and all changes to the server, including creating databases. Thus, the following command creates a database called “lab1”:

$ createdb -U postgres lab1

If you did not specify a username with the -U parameter, createdb tries to log in to PostgreSQL with your computer account’s username (in my case, “Joseph”). If I have set up such an account, createdb lab1 would work. But since I haven’t, it fails. One of the Challenges offered at the end of this chapter is to find out how to create a user account to make these commands less verbose.

If necessary, you can also drop (i.e., delete) the new database from the commmand line, with:

$ dropdb -U postgres lab1

For future reference, you can alse create and delete databases using SQL once you’re logged in to psql: the CREATE DATABASE and DROP DATABASE commands, respectively. One way or another, create that database, which will be home to your first table.

Introducing `psql` #

The command-line client for PostgreSQL is psql, and like createdb, it needs to know the username you want to connect with. Connect with psql -U postgres. This will not open a new window, but rather you will see a brief welcome and the command prompt will be different from the operating system’s default prompt.

$ psql -U postgres
psql (9.5.3)
Type "help" for help.

postgres=#

The change in the command prompt means you’re in a different environment. Here, you can enter SQL queries or some commands specific to psql. The first thing I’d recommend you do is type help, which introduces you to a few of the latter. Most psql commands begin with a backslash (\) and you can get a full listing of them by entering the command \?. If you need to quit, \q is the command for that. If you want to, take some time now to explore the lists of SQL queries and psql commands possible.

postgres=# help
You are using psql, the command-line interface to PostgreSQL.
Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

By default, when you start psql you’re connecting to the default database, which like the superuser is called “postgres”. Any SQL commands you enter at the prompt will be executed on that database’s tables, which isn’t what you want. To switch over to the new database you created, use the \c command:

postgres=# \c lab1
You are now connected to database "lab1" as user "postgres".
lab1=#

Notice that the prompt changes to tell you which database you’re working in.

Creating a table #

To create a table in the “lab1” database, we use the aptly-named SQL CREATE TABLE command. I will expand on its usage in Chapter 2, but the basic form of it is as follows:

CREATE TABLE table_name (
   column_name    data_type   [OPTIONS],
   ...
   );

As I mentioned, one of the special characteristics of a relation is that each column allows data only of a specified type. PostgreSQL offers a number of built-in data types, such as numeric, text, date, and more. I will discuss the choice of data type more in Chapters 2 and 8, but it need not delay an introductory example. There are many optional clauses available in the CREATE TABLE statement which can be discussed later or looked up in the documentation; the only one we need now is PRIMARY KEY, a flag which indicates that a particular column is going to contain unique values that may be used to look up specific rows later.

The command to create our table of Purchases is as follows. You may type this in at the psql prompt, even if it spans several lines. The code won’t execute until the semicolon (;) is reached. Mind the cases: in PostgreSQL the SQL keywords (i.e. CREATE TABLE, PRIMARY KEY, and the data types) may be uppercase or lowercase, but you should only use lower case letters and underscores (_) for the table and column names. PostgreSQL isn’t very sensitive to whitespace, so you can enter this code all on one line, or spread out over several lines, with indentation and tabs if you want them.

CREATE TABLE purchases (
order_id integer PRIMARY KEY,
city text,
state char(2),
product text,
category text,
price numeric );

If the command succeeded, you’ll see “CREATE TABLE” in the output. If there’s an error message instead, don’t worry, just try again. The most likely causes of errors are typos in the data types, the wrong number of commas, and uppercase letters in the table or column names. If the command worked but you defined the table incorrectly, the easiest solution is to start over by issuing the command DROP TABLE purchases; and creating the table anew.

You can confirm that the table exists with the psql command \dt, which displays a table of all the tables in the currently selected database:

lab1=# \dt
           List of relations
 Schema |   Name    | Type  |  Owner   
---------------------------------------
 public | purchases | table | postgres
 (1 row)

That’s all there is to defining a table, at least an empty one. In order for us to demonstrate some SQL queries, though, we’ll need to store some data in the table with the SQL INSERT command. We’ll use the simplest form of this command, adding only one row at a time to the table, for example:

INSERT INTO purchases VALUES
(1001, 'Nashville', 'TN', 'Sea Kayak', 'Boating', 449);

Take note that text data must be wrapped in quotation marks ('), and numbers must not.

Writing INSERT commands by hand will quickly become tiresome, and is not the usual mode of entering data into a real database. Typically the database will support software (such as a web app, or an enterprise system) that generates data insertion and update commands automatically. Another way we might load a lot of data quickly is to read in a file containing (presumably machine-generated) INSERT commands. In psql you can execute SQL commands from a file using the \i command.

I have provided a script file containing 100 lines of purchase data on the GitHub repository that supports this book. You may find the file purchases.txt at https://github.com/joeclark-phd/databasebook-postgres, in the “psql_scripts” folder. I have downloaded this file to my computer, a Windows laptop, and saved it in the directory C:/psql_scripts, so for me the command looks like this:

lab1=# \i c:/psql_scripts/purchases.sql
INSERT 0 1
...

Be sure to use the correct file path for your operating system and the location where you downloaded the script.

Regardless of how you insert data into the table, please add at least several records so that you can try some meaningful queries in the next section. If you are experiencing errors with INSERT commands, check that the number of values you’re inserting matches the number of columns in the table, that they’re in the right order, and that the order_id number is unique for each inserted row. If you run into problems, you can empty the table by entering the command DELETE FROM purchases; and start again.

Querying your data with SQL #

SQL is the structured query language more or less common to all relational databases, and it really shines for its ability to extract just the data you want from a table or group of tables. What kinds of queries might you want to make of this data? You might want specific subsets of the data, such as all the orders for a particular product or in a particular state. Or you might want to aggregate the data, that is, sum or count or average them, perhaps in groups. Even with one simple table, there are quite a few ways to query it.

Let’s start with the basics. Your first query is the simplest: it just requests all the data.

SELECT * FROM purchases;

That’s quite a lot of rows, so I’ll give you a trick to shorten the results. Affix “LIMIT ” to the end of the query to get only the first several rows:

SELECT * FROM purchases
LIMIT 10;

The meaning of the “*” is “all columns”. It’s possible to request only certain columns, for example, let’s say you only want to know the cities and states that your customers are ordering from. Specify the desired columns in the “SELECT” clause:

SELECT city, state
FROM purchases
LIMIT 10;

Most of the time you don’t want every row, but want to select a subset of the data. This is accomplished with the “WHERE” clause of a query. You may request a single row by its primary key, for example:

SELECT *
FROM purchases
WHERE order_id = 1011;

Or you may give criteria that qualify more than one row, if you want to see a specific subset. For example:

SELECT *
FROM purchases
WHERE state='ME';

The criteria don’t have to be “equality” conditions, by the way. We can also use numerical inequalities; any row for which the inequality is “true” will be returned:

SELECT city, state, product
FROM purchases
WHERE price > 500;

Another condition you might use, for a primitive text search, is “LIKE”. The “%” character is a wildcard that matches any text. Thus, the following code returns all data where the product name ends in “Kayak” but would not return data where there was additional text after that word.

SELECT city, state, product
FROM purchases
WHERE product LIKE '%Kayak';

Aggregate queries #

The queries above allow you to carve out subsets of the data by requesting only certain columns, certain rows, or both. In every example, though, the rows you get in the result are rows from the original table. Aggregate queries are those that generate data by combining the original rows via an aggregation function, usually SUM, COUNT, or AVERAGE. Obviously the sum of two rows is one row, and is not identical to either of the original rows. The following query gives you the total dollar amount of all purchases in the table:

SELECT SUM(price)
FROM purchases;

No matter how many rows were in the original table, the query above returns just one row. Say… how many rows are in the original table?

SELECT COUNT(order_id)
FROM purchases;

The COUNT function actually counts the number of rows, not the number of unique values. If you used COUNT(state) instead of COUNT(order_id) you’d get the same result. Even if you have five hundred purchases in 50 states, the COUNT would be 500, not 50. Since the parameter given to the COUNT function doesn’t matter, it’s often easier to simply use COUNT(*).

A grand total (or count, or average) is interesting, but a lot of the time what we want to do are compare subtotals (or counts, or averages) for various groupings of the data. To do this, we introduce a “GROUP BY” clause. If we want to know how many purchases were made in each of several categories, we can group by the “category” column and count up the rows in each group:

SELECT category, COUNT(*)
FROM purchases
GROUP BY category;

If you want to know which products account for the largest portions of your revenue, you might group by product and sum the order prices. There are a lot of products, though, so it’s helpful to sort the results with an “ORDER BY” clause. Since the sum is the 2nd column of the result data, that’ll be “ORDER BY 2”.

SELECT product, SUM(price)
FROM purchases
GROUP BY product
ORDER BY 2;

To get just the top ten, here’s a trick: sort the data in descending (“DESC”) order, and “LIMIT” the results to just the first ten rows.

SELECT product, SUM(price)
FROM purchases
GROUP BY product
ORDER BY 2 DESC
LIMIT 10;

In Chapter 3 and beyond you’ll learn a lot more SQL, such as how to create queries that “join” multiple tables, and how to write queries that employ nested sub-queries. Even in this example, though, you’ve seen several of the main parts of a SELECT query, including the “WHERE”, “GROUP BY”, and “ORDER BY” clauses, and aggregate queries. You have begun to see that even a simple one-table database may be queried in several different ways, and that doing this with short SQL queries may be much easier than trying to wrangle the data in Excel.

I recommend that you attempt the exercises and challenges at the end of this chapter to get more practice with the basics of relational databases, SQL, and PostgreSQL specifically.

Read more at Leanpub.

Understanding the "M" in MVC: a database nerd tries to learn how SQLAlchemy ORM fits in with Flask

2015-11-19T23:33:50-08:00

Having cut my teeth as a web developer in the bad old days of spaghetti code (when PHP was the innovative new thing!), I came back to it after several years away and have discovered with delight the new species of web framework they call MVC—model, view controller.

My favorite so far is Flask, a Python-based microframework. Unlike the leading Python framework (Django), it’s a minimalistic framework that doesn’t make many decisions for you. I like to go slow and figure things out on my own, so that’s perfect for me. A simple Flask application is structured like so:

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "<h1>Hello World!<h1>"

@app.route('/user/<name>')
def user(name):
    return "<h1>Hello, %s!</h1>" % name

if __name__ == "__main__":
    app.run()

What’s great about this is that instead of having PHP or some other kind of code threaded into your front-end HTML code, and spread across as many pages as you have in your application, all of your business logic is neatly organized in a Python code file separate from display logic. In the example, each @app.route() decorator specifies a URL pattern and the function below it defines the logic. (Instead of the crude return instructions in the example, you would generally call a template file full of HTML to format and style the output.)

The trouble I have with the “M” #

I have a great handle on the controller (the code file that handles the business logic) and on the views (the HTML and CSS templates). The “M” for model is what I don’t yet understand. Hopefully, by the time I finish this blog post, I’ll have worked it out.

Basically, as a database nerd who really cares about good SQL and likes to build clever logic into the database with stored procedures, triggers, recursive queries and so on, I expected to set up “routes” in Flask like this (from a project done by my students):

@app.route('/users')
def users():
    #do query
    db.execute("SELECT * FROM users ORDER BY timestamp")
    rows = db.cur.fetchall()
    #pass result to template
    return render_template("users.html", rows=rows)

Simple, right? A URL maps to a function, the function does some operations on the database (in this case a SELECT), and the results are fed into an HTML template which gets rendered as a response to the user.

But that’s not how the pros do it. What you’re advised to do is to use an object-relational mapper (ORM) like SQLAlchemy ORM. This means declaring your “database” “tables” as a kind of Python objects, trust SQLAlchemy to manage the database for you, and calling SQL-like functions on the objects in order to use data in your controller. A “table” declaration might look like this:

class User(db.Model):
    __tablename__ = 'users'
    id = db.Column(db.Integer, primary_key=True)
    username = db.Column(db.String(80), unique=True)
    email = db.Column(db.String(120), unique=True)
    def __init__(self, username, email):
        self.username = username
        self.email = email

You create the table in the database with something like this:

db.create_all()

And you might use it in a view like so:

@app.route('/user/<username>')
def show_user_profile(username):
    user = User.query.filter_by(username=username).first_or_404()
    return render_template('profile.html', user=user)

There’s something here I must be missing. I see how this might be a convenience for somebody who knows Python but doesn’t know SQL, but if you care about your data (and I do) you’re giving up lots of control over table creation, keys, indexes, and other optimizations. You’re losing the ability to use views, triggers, and stored procedures to put logic into the application. And it looks like the Python application is going to be making changes to the database structure without asking or informing you about how it does that. Other than the ability to write queries in something that looks like Python, what’s the benefit?

Figuring it out #

A few days ago, when I wrote the above, I had to assume I was wrong about some of this, because “models” are deemed a core component of the MVC concept, and lots of Flask developers swear by the SQLAlchemy ORM. The purpose of this blog post was to try to understand what this is all about, and to decide whether someone who cares about his database would hand over control to an ORM. In order to begin the learning process, I dove into Miguel Grinberg’s Flask tutorial (in book form) and, suspending my inhibitions, worked through all of the examples despite the pull of what must no doubt be a millennia-old caveman instinct to just write raw SQL. This turned up a few clues to the popularity of the ORM approach, as follow in roughly the order I discovered them.

First clue: Interchangeable databases #

In Grinberg’s tutorial, you use a SQLite database for development and testing, then switch to PostgreSQL when you deploy your app to Heroku. It’s pretty cool that you don’t have to change any of the code to do this—all SQLAlchemy needs is a database URL from the DATABASE_URL environment variable, and it looks at the prefix (e.g., postgres://) to determine what sort of database it’ll be working with. Or you can give it a default (say, SQLite) database for the development environment and let it use Postgres or whatever in staging and production, with a line in your configuration file like this:

    SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') or \
        'sqlite:///' + os.path.join(basedir, 'data-test.sqlite')

I like this feature a lot, but in order to take advantage of it, obviously I would have to limit myself to features that are common to all the databases—and forego using Postgres’s unique features like “recursive” queries.

Second clue: ‘flask-migrate’ wows me with version control #

Database version control is a tricky problem (worth a blog post of its own) because the database can’t be torn down and re-created every time its structure needs to change. Because the data must persist, database refactoring is something like re-designing a ship while it’s out to sea. What you’d probably do is create an initial script to create the first version of the database, then for each change create a new script to update the schema, version control these scripts, and remember to run the scripts in order upon deploying a new instance of the application. If you want the ability to go backwards, to regress to earlier versions, it’s even more complicated because you’d need “downgrade” scripts as well.

Automating this would be amazing but seems quite difficult. However, in Grinberg’s tutorial, he demonstrates a utility called Alembic with SQLAlchemy that does exactly that—automatically generate “upgrade” and “downgrade” scripts as the all-Python description of his data model changes. It also features a command-line “upgrade” that seems to know where you are in the sequence of scripts (so it doesn’t re-run old ones) and automatically applies the upgrade(s) necessary to bring your database up to the current version of the code.

In a few lines of code, the database version control problem is solved! Also, because it is Python and not SQL, the upgrade and downgrade scripts can be used for any type of database. This could be a great convenience if you are developing initially for an open-source software stack and eventually want to port your application to, oh, Oracle or I don’t know what.

There are some caveats, though. First, it’s easy to mess it up. I found that I occasionally used the drop_all() and create_all() commands while developing the application and testing, and these caused some quirks. Alembic wouldn’t create an upgrade script if by running create_all() I had already created the new tables. (It would say to itself, “the database matches the code, therefore nothing needs to be updated”.) Also, I understand there are some kinds of schema changes it can’t detect, and moreover, I still see that I would have to forego using any of PostgreSQL’s unique features to take advantage of this database-agnostic tool.

Which makes me wonder—are there any alternatives that are specific to Postgres, and can give me the benefit of “migrations” without the trade off?

Third clue (and first that really pertains to database abstraction): the “backref” in 1:M relationships #

I found a podcast in which Mike Bayer, creator of SQLAlchemy, was interviewed in April 2015 (cf. around 44:00). He points to one very useful mapping that relates to one-to-many relationships between tables. A 1:M relationship in the database is really only seen from the M:1 side: it’s the “child” entity that holds a foreign key reference to the “parent”. However, in object-oriented programming, we typically want to reference the child objects from the parent object. We want to see an author’s blog posts just as often as we want to see a blog post’s author.

By representing the Author and the Post “models” as Python classes, a .posts attribute can be added to the Author class which looks like a database field but actually returns the results of a separate query. What’s more, SQLAlchemy allows us to do this without writing a lot of extra code. We simply assign the “backref” argument in the method that creates a foreign key relationship. When I saw this, I thought it was pretty slick.

More clues, and the discovery of SQLAlchemy “Core” #

A number of other convenient little functions and methods worked their way into the application that show the benefit of having an object “wrapper” around a database entity. Methods like get_or_404() simplify error handling when a user requests a non-existent resource. Custom “set” methods like change_email() can be created to perform validation tasks like sending a confirmation e-mail. Add-ons like Flask-Login expect certain methods like is_authenticated() and get_id() and make it easy to implement user accounts.

By the time I finished the tutorial, I saw that creating object models for my entities could be a very powerful tool for organizing code that operates on the data. The User entity in my tutorial app, for example, has numerous methods for everything from checking permissions to sending account confirmation e-mails and outputting data about the user as JSON. Putting all of that in a “models.py” package makes my controllers much cleaner.

However, as I researched SQLAlchemy more, I discovered that there are in fact two major parts of it: the ORM, and the underlying layer called SQLAlchemy Core. I could take advantage of SQLAlchemy core, which does not automatically map objects to tables, or auto-generate SQL, without having to use the ORM. Furthermore I could simply use psycopg2, the Python-Postgres connector, and not use any part of SQLAlchemy at all with this concept of an object to wrap my database operations around a particular entity. So the question has changed:

Given that I’m going to embrace “Model”, do I implement it as ORM, use an engine like SQLAlchemy Core, or write my own? #

Here’s what the ORM layer seems to give me: it allows me to define a class with a few attributes that SQLAlchemy interprets as columns, and which Alembic will allow me to automatically create as database tables. The first few lines of my Post model, for example, show the definition of a table with a foreign key to the Users table, a “backref” to the Comments table, and some other columns:

class Post(db.Model):
    __tablename__ = "posts"
    id = db.Column(db.Integer, primary_key=True)
    body = db.Column(db.Text)
    timestamp = db.Column(db.DateTime, index=True, \
     default=datetime.utcnow)
    author_id = db.Column(db.Integer, db.ForeignKey("users.id"))
    body_html = db.Column(db.Text)
    # comments
    comments = db.relationship("Comment", backref="post",\
     lazy="dynamic")

One of the nice things that I can do with this object is make a change to a row in the database by simply assigning a new value to an attribute of a Post object. That reduces the amount of SQL I would have to write for “get” and “set” methods. However, it takes away my control of data definition (DDL) and, presumably, prevents me from using a view in place of a table. It would be somewhat tedious but not extremely so, to write my own getters and setters in SQL. Also, character for character, the ORM approach really doesn’t seem to make the code any shorter. Consider this model-definition code from my tutorial app.

The SQLAlchemy core as an option #

SQLAlchemy Core seems to provide this option. I could write my own plain-text queries (if I didn’t care about cross-database compatibility) or use the core’s SQL-like query syntax to maintain database-agnosticism. This could be a very attractive middle ground. SQLAlchemy Core still does things like simplifying database connection and managing a connection pool when multiple concurrent users are online. I believe that what I’d lose with this approach is the slick version-control functionality of Alembic. [Update: No, I wouldn’t. Mike Bayer informs me via Twitter that Alembic works on the core.]

Compared to “rolling my own”, this seems like the smart way to go. Not only does it automate things like opening and closing of connections, it also allows me to take advantage of a mix of higher-level and lower-level abstractions as I need them. Since it also allows me to bypass abstraction entirely and go directly to raw SQL when I need to, I can’t see any advantage of writing my own database layer with plain psycopg2.

Three big questions #

In sum, I have seen the benefit of creating “model” objects to hold a variety of methods and helper-functions that correspond to database entities, and I have seen some benefits of the ORM: notably, that by giving up the freedom to use database-specific features, you can get automated database version control. But what should I use in my own app development? I’ve boiled it down to three questions for myself:

Do I want to use Postgres-only features?
- And am I willing to trade database agnosticism and version control for them?
Do I want to put logic into the database (views, triggers, stored procedures)?
Can I find substitutes for the features I’d give up by using Core, such as automated version control?

On balance, it seems like if I’m developing a simple CRUD application with no complex relationships between entities, and I want to get it up and running quickly with version control and continuous integration, ORM does make sense. I love database optimization, but it’s a craft that takes time, and may not be necessary for every project. For more complicated systems, I find that I want to put logic into the database such as views and triggers. The reasoning for that is to prevent code duplication and inconsistency with other, future applications that access the same data. A trigger, for example, can be thought of as a special type of DDL.

However, in the newer world of API-first development, it may make more sense to have my web app itself expose an API which future applications would build on. If that’s the case, then the models I implement in Python would define in one place the logic that all database users are going to need. This is a weak argument, perhaps.

In my next project, I think, I’ll use SQLAlchemy Core without the highest-level abstractions of the ORM, in order to better gauge the difference. I’m curious to see exactly how tedious it gets writing pure SQL; frankly, I don’t think it could be worse that writing the Python equivalent. However, I can say that as a result of this experience I do appreciate the role of Models and of ORM, even though I’m not personally ready to give up the unique features of my favorite relational database.

Is science design? Prototyping in academic research.

2015-11-04T13:49:23-08:00

I have argued that design is a science, with particular reference to information systems development. There is a philosophy and a body of theory on how to design software, data, and socio-technical information systems in organizations. But what if the output we are working toward is scientific knowledge? I believe, but have not yet proven, that we can treat science itself as a design activity. In the spring of 2016, my capstone class will take part in an experiment to figure out just what this might mean.

What kind of problem is science? #

Horst Rittel and Melvin Webber of U C Berkeley articulated the concept of “wicked problems” in a noteworthy 1973 paper. Their purpose in using this term was to argue that problems like public policy cannot be solved by simply hiring experts to apply professional knowledge (“science”). Problems of social policy were different, they argued, because these problems cannot be definitively described (or “tamed”). Wicked problems defy definition, because there is no agreement among stakeholders about the goals, there is no unequivocal measurement of outcomes, no way to know when the problem is “solved”, and no hard boundaries around the system where a problem seems to be found.

They identified ten features of wicked problems (Rittel & Webber, 1973):

There is no definitive formulation of a wicked problem.
Wicked problems have no stopping rule.
Solutions to wicked problems are not true-or-false, but good-or-bad.
There is no immediate or ultimate test of a solution to a wicked problem.
Every solution to a wicked problem is a “one-shot operation”; because there is no opportunity to learn by trial-and-error, every attempt counts significantly.
Wicked problems do not have an enumerable (or an exhaustively describable) set of potential solutions, nor is there a well-described set of permissible operations that may be incorporated into the plan.
Every wicked problem is essentially unique.
Every wicked problem can be considered to be a symptom of another problem.
The existence of a discrepancy representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem’s resolution.
The planner has no right to be wrong.

Farrell and Hooker (2013) simplify these ten criteria to three cognitive features of a problem situation that together determine its “wickedness”: the finitude of our cognitive capacities and resources, complexity or entangledness of consequences, and normativity constraints—the value conflicts among stakeholders.

“Each of these features poses a challenging aspect of the fundamental methodological problem: how is it possible to act intelligently and responsibly in a world characterized by deep limits on our problem-solving capacities? We contend that it is the depth and extent of this methodological challenge that ultimately constitutes the wickedness of a problem.” (Farrell & Hooker, 2013).

Are scientific research challenges “tame” problems, that is, are they well-defined, bounded, and do they have clear success criteria? Another way of asking the question is: does there exist a body of professional or expert knowledge about how to get from point A to point B in science?

Tame problems are those to which professional knowledge, engineering discipline, or well-honed art, can be applied in a straightforward manner. The professional’s experience, skill, time, and resources are the primary determinants of success, or so the thinking goes. Wicked problems are the domain of designing; solutions must be found, not simply identified and applied. Design is distinguished in this way from science:

“[D]esign problems are ill-defined, ill-structured, or ‘wicked’. They are not the same as the ‘puzzles’ that scientists, mathematicians, and other scholars set themselves” (Cross, 1982).

Perhaps these authors are beating something of a straw man of “science” in order to advocate for more interest in theory of design; because at first glance it seems to me that scientific research matches many of the characteristics of wicked problems, except #5 and #10, because trial-and-error are certainly possible in science. (Those exceptions also should apply to other species of design problems from architecture to zydeco.) Criteria #4 and #8 resonate with the cliché about science, that every answer raises another question. Furthermore, criterion #9 reminds me of something Weick wrote about theorizing:

…it was argued that a reaction such as ‘that’s interesting’ was sufficient to selectively retain a conjecture, independent of additional efforts to verify it. Eventual attempts at verification may occur sometime later but… the value of a theory does not ride on the outcome of those tests. The reason it does not is that validation is not the key task of social science. It might be if we could do it, but we can’t—and neither can economists…

If validation is not a criterion for retaining conjectures, this means at least two things. First, the criteria used in place of validation must be explored carefully since the theorist, not the environment, now controls the survival of conjectures. Second, the contribution of social science does not lie in validated knowledge, but rather in the suggestion of relationships and connections that had previously not been suspected, relationships that change actions and perspectives. (Weick, 1989)

In other words, at least in the social sciences, we can never really rule out a theory (or a design?) in the way that a physicist or chemist could disprove a hypothesis, as long as there is one researcher who still believes in it.

Re-reading Weick’s (1989) “Theory Construction as Disciplined Imagination” just now, another excerpt jumps out at me:

Natural scientists pick problems they can solve, work for colleague approbation rather than lay approbation, collaborate with people who share their interests and values, and seldom worry about what others think. The world of the social scientist, poet, theologian, and engineer is dramatically different. These people choose problems because they urgently need solution, whether they have the tools to solve them or not.

So maybe the answer to my question is “it depends”. Some scientific problems are tame, merely requiring the patient application of skill and elbow grease, while others are wicked and remain problems as long as anyone cares to grapple with them. Perhaps science becomes more like design in areas where there are multiple potential theoretical explanations and not enough knowledge or resources to explore them all.

“This commonality [between design and science] becomes still clearer whenever scientists venture into new unexplored territories, e.g. from Newtonian into relativistic or quantum domains, or where scientists are fundamentally re-evaluating previously explored territory… in short whenever scientists are engaged in deep or revolutionary research.” (Farrell & Hooker, 2013)

How does design apply to wicked problems? #

If a scientific problem is wicked, the literature agrees that design thinking is the way to approach it. I’d like to review for a moment how the literature describes design thinking, beginning with Farrell and Hooker, who describe three cognitive stages at work in solving wicked problems:

“(I) an initial phase of problem space formation where the context of the design situation (e.g. location, interested parties), the general possibilities of the design situation, the putatively desired outcome of the design process and the normative constraints applicable to it are each initially characterised, followed by (II) a development/exploration phase where a trial space of various potential partial solutions (e.g. in sketch or model form) is developed and explored and are allowed to mutually interact and are modified, often in interaction with negotiated modification of any and all of the elements of the initial phase (the problem re-definition aspect) so as to access new resolution pathways and realisation of value, and (III) a final production phase where the design outcome is produced and normative value realised.” (emphases mine)

Similarly, Herbert Simon (mentioned in my previous post) describes design as a search process which can be characterized as a cycle of generating and testing alternatives. We can

“think of the design process as involving, first, the generation of alternatives and, then, the testing of these alternatives against a whole array of requirements and constraints… The generators implicitly define the decomposition of the design problem, and the tests guarantee that important indirect consequences will be noticed and weighed” (Simon, 1996, pp. 128-129).

IDEO, the leading firm in the “design thinking” revolution, also describes their process as having three parts: “inspiration”, “ideation”, and “implementation”:

“Inspiration is the problem or opportunity that motivates the search for solutions. Ideation is the process of generating, developing, and testing ideas. Implementation is the path that leads from the project stage into people’s lives.” (Source: IDEO)

The middle phase, ideation, is where we see a real break from the world of “tame problems”. Designers strive to generate and test out as many ideas as possible before running out of time and money. Collaborative brainstorming, rapid prototyping, and an abundance of feedback are the hallmarks of design thinking, and they replace the professional approach of hiring an expert to apply known techniques. “Enlightened trial and error succeeds over the planning of the lone genius.”

How do scientists design? #

I think it is pretty straightforward to imagine how “problem space formation” occurs in science; whether driven by a new observation or from a scientific answer that generates a new question, scientists are inspired to work on something new. In addition to identifying the problem, they come up with multiple possible avenues for solving it, and a new formulation of the research question that corresponds with each.

The final production phase for science-as-design is constrained by standard means of documenting and publishing research: the objective is a conference presentation, a journal article, a book or maybe a patent. So we know that at some point there must be a stopping point for search, a decision made to publish results.

But what does the middle phase look like? Given a number of possible problem formulations, in science

“…just as with design, the issue becomes which few of these possibilities is currently most worth pursuing and in which specific forms. Various options will be developed in more detail, their resource demands and risks analysed and their merits spelled out for consideration. During that process more specific versions of the initial general problem will be developed, some of them … perhaps requiring a significant reformulation of both what the problem is and what criteria a solution would need to meet. A critical debate will develop about these options, the upshot being that one or two of them will be selected to pursue, perhaps by individual laboratories, perhaps as cooperative ventures. After the results of that round are in, the whole process can be repeated again and again until an at-least-satisfactory explanation emerges within the investigatory resources available.” (Farrell & Hooker, 2013)

This is, of course, making the point that a scientific field “designs” at the field level—each researcher or institution is playing a small part in a large search process. My intent, however, is to show how a single research paper by a student (not a whole program of research) can be seen as a design activity. What would design of an individual paper look like? What could be prototyped and done iteratively in a “one shot” study?

Cross (1982) made an interesting observation that designers take a solution-focused approach to problem solving while scientists take a problem-focused approach. Scientists analyze and try to discover the “rules” that would determine an optimal solution, whereas designers are more likely to begin proposing solutions and ruling them out. Though this sounds atheoretical, designers did tend to learn quite a bit about the “rules” as a result of their solution-focused result: “In other words, they learn about the nature of the problem largely as a result of trying out solutions, whereas the scientists set out specifically to study the problem” (Lawson, 1980, quoted by Cross, 1982).

Building on that thought: if the goal I want to give my students is to discover theory (or at least generalizable knowledge of some kind, learning that can be transferred from a sample or case study into other situations), perhaps I should guide them in a solution-based approach and trust that, by some process of empirical and abductive reasoning, they will arrive at a theoretical understanding of the problem. These “solutions” might take the form of technology prototypes to be subjected to a process of elimination. There should be some process of reflection on each prototype, a reflection which generates new hypotheses. This should be an iterative process and some synthesis of hypotheses should be the final step after time runs out.

Having written that down, it doesn’t sound too different from the normal process of research—generate multiple hypotheses, eliminate some of them by an experiment, and reflect on what the findings mean. Perhaps the difference is in the order in which these activities are undertaken.

Eric Ries proposes a “build-measure-learn” loop for entrepreneurs, which suggests that they build prototypes in order to gather data in order to evaluate one business hypothesis at a time. I find that this loop tells us something about scientific search if we read it backwards, that is:

First we decide what we want to learn (i.e. a theory we want to bet on).
Then we decide what data would be needed to prove or disprove it.
Finally we design the cheapest or quickest possible experiment that could give us such data.

What if as a scientist, instead of planning the study first, we started by deciding on the theoretical contributions that we think we can make. Then identify the bare minimum data needed to eliminate our ideas—for example, could one phone call with a practitioner rule out an idea before money and time are spent on other data collection? Then iterate, until we find an idea that we genuinely need a significant effort to learn about, and do that as our study. I don’t know if I’ve really hit upon something new here, but it gives me a new way to think about how to do research.

Designing research in CIS 440 #

Given certain constraints in the course—for example that it must be graded in three milestones (initial proposal, literature review and analysis, final term paper)—I am not focusing at this time on redefining the project. At the end of the semester, students will need to produce something like a traditional research paper with theory development and the results of an empirical study. My thoughts therefore are on what I can do within the class time that would enable them to approach their research as designers.

The following are some ideas I’m developing for the spring semester:

Problem discovery: Students have three weeks to come up with their project proposals, but I don’t want them chewing on their pencils the whole time. Instead, I plan to force them to brainstorm ideas in a group workshop. After talking about the genres of information systems theories, I propose to prompt them with everything from xkcd and Dilbert cartoons to magazine articles and tweets and challenge them to describe theories that they imply. Then, in the next class, I’ll ask them to rapidly generate as many problems-for-research as they can. There’ll be an activity to rapidly categorize them as “wicked” or “tame” or into the quadrants of the Cynefin framework. Thus, divergent thinking (brainstorming ideas) will be followed by convergent thinking (filtering out those that aren’t appropriate for their projects).
Research design: After their project proposals have been turned in, the students will turn to research design. In a workshop intended to teach the students about research methods, I will have them draw hypotheses and research methods out of a hat and then, on the spot, proposes a research design. Dot-voting or some other mechanism will be used to eliminate unfit designs without any student feeling his term paper plans have been criticized.
Argument development: Students are expected to do a thorough literature review and form some expectations (if not hypotheses) about what they expect to find. In the past, they haven’t been very good at identifying opposing arguments and thereby defending against them. I’m working on the design of a workshop called a “straw man check” which would involve the group in brainstorming counter-arguments for each project—hopefully to sharpen their analysis and improve their literature reviews.
Writing: Students will proofread and give feedback to one another, thus forcing them to iterate (at least once) on their writing.
Overall quality: It will be tricky for me to predict the research quality of the final term papers before they come in. And once I’ve graded them, the students don’t have another chance to iterate. One solution is to have them present their work-in-progress to each other as a “conference” and elicit feedback. The weakness of this is that other students will tend not to be critical, and I doubt the participants learn very much. In order to give them hard, reliable feedback, I’m thinking about a prediction market. What if students could “bet” on which grades their classmates are likely to earn? Then each student could get an honest evaluation of what his classmates think that I will think of his work.

Measuring how well it works #

My expectation is that students will learn more about research (and about writing) by treating the design of a research term paper as a wicked problem and iterating with their peers on it, than they would learn without iteration. Moreover, I believe they will develop better quality research in a shorter amount of time; I’m allowing 10 weeks for the term paper in the spring as compared to the full 15 weeks I allowed this fall. Thirdly, I believe the students will feel more satisfied with their own work and less likely to tolerate it as a necessary evil for getting to graduation.

My next task is to iterate on how to measure these outcomes. I’d like to get a baseline of measurement this fall, so I can see what has improved in the spring.

Design is a science.

2015-10-21T13:13:03-07:00

Philosophy of the artificial #

For many centuries, there has been a distinction made between natural science on the one hand and engineering or applied science on the other. The natural sciences, physics, chemistry, biology, etc., are branches of philosophy—the pursuit of truth. Along with the humanities, the natural sciences were recognized as important parts of a liberal education and enjoyed great respectability among academics. More recently, the social sciences such as psychology and sociology have taken their place in academia.

By contrast, the various applied and professional disciplines have been seen as non-philosophical, concerned with usefulness rather than truth. Although engineering, medicine, business, law, education, architecture and similar problem-solving disciplines did convey useful bodies of knowledge, it was hard to see any “theory” in them. After all, techniques and technology were constantly changing. They emphasized prescription of solutions rather than description of the world as it is. As a result, many faculties in the professional schools preferred to focus on fundamental science and not on the domain knowledge their graduates would need at work. Business schools taught economics but not business planning, engineering schools taught physics but not design, and so on.

Universities were faced with an unhappy choice: teach what was useful today without the philosophical rigor that would enable graduates to adapt their knowledge or abstract it to other domains, or teach the fundamental sciences and leave its graduates unprepared to apply their knowledge in real-world problem solving. The way out of this dilemma was charted by one of the past century’s great geniuses, Herbert A. Simon.

In The Sciences of the Artificial, Simon laid the foundations for a philosophy of the artificial and a science of design. Artificial (i.e. “man-made”) things surround us. They are constrained by physical laws and, if we want, we can study them “objectively” using the natural sciences: for example, a physicist could take an airplane as a given and study the way it is affected by the flow of wind. But this doesn’t really capture the whole truth of the artifact. An artifact such as an airplane also embodies certain goals that we need to grasp if we are to understand it.

“If science is to encompass these objects and phenomena in which human purpose as well as natural law are embodied, it must have means for relating these two disparate components.” (Simon, p. 3)

An artifact for Simon could be seen as a meeting point between an “inner environment”, the way the artifact itself worked, and an “outer environment”, the problem it addresses and the context in which it operates. The relationship between the inner and outer environments is where we find human purpose. A problem solver, engineer, or designer is someone who tries to make the inner environment appropriate to the outer environment. For example: for the inventor of an airplane, the outer environment is characterized by the known laws of gravity and air pressure. He makes choices about the inner environment (the materials to use, the type of engine, and so forth) in order to achieve a goal—flight.

The philosophy of science of design #

In order for a science of man-made things to be intellectually rigorous, philosophically weighty, abstract, and teachable, therefore, it would take the form of a science of design. Unlike a trade school education, which teaches about existing artifacts (technologies, processes, methods), a science of design seeks theories about how to develop artifacts. It may be a science of design processes rather than of the things that result from design.

Simon suggested a number of elements or features that theories of design might incorporate. First, these theories would have to address the question of how designs would be evaluated. The naive assumption that is conveyed by economic theory (not a design science) is that problem-solvers seek the “optimal” solution to every problem. For most complex problems, though, it is impossible to find the optimum or to know if you have found it. Instead, criteria for evaluating designs are usually phrased in terms of minimum requirements. Designers satisfice, that is, they work until they find “good enough” solutions. Unlike a math problem, a design problem usually doesn’t have one uniquely correct solution.

Design theories must also address the question of search, since most design solutions cannot simply be optimized or solved mathematically, nor can they be deduced intuitively. With problems of any significant complexity, solutions also cannot simply be designed by adding up known cause-effect relationships. An architect may know the properties of various materials and structures, and a medical researcher may know the isolated effects of various treatments, but when assembled together they may have interactions or side effects that could not be foreseen. Thus, for complex problems designers must follow some process of developing solution alternatives (simulated or actual) and evaluating them as wholes.

It may be the case that different methods of logical reasoning are employed in design, although it is not necessary. In the natural and social sciences, which strive for description of the world as it is, one observes a cycle of induction and deduction. Inductive reasoning observes natural or social phenomena and derives theories to explain them, with the hopes that these theories can be generalized to apply elsewhere. Deductive reasoning derives hypotheses from these theories, which can then be tested to confirm or disprove the generalizability of the explanations. By contrast, a design theorist seeks prescriptive or normative knowledge and in design we often see an abductive mode of reasoning. In abductive reasoning, a designer considers a variety of competing hypotheses at the same time, and through iterative designing filters some of them out and settles on one or a few that seem to have held up best in his experience with the real world.

There has been substantial thinking in many domains about the best ways to embark on the search for designs. In CIS 440, students will learn about a variety of approaches to developing information technology and information systems. Because there may be more than one satisfactory solution to any given problem, the design process may have a great effect on which solution, or which style of solution, is ultimately arrived at. Even if different design approaches all achieve satisfactory solutions according to the same evaluation criteria, the solutions they achieve may be radically different.

Design science in information systems #

I argue that my field, information systems, is a science of the artificial and ought not to be counted (as it typically is) among the social sciences. Although much of the research in our field takes information systems artifacts as “given” and tries to theorize social phenomena like technology choices, as if the researcher was aloof to the goals of his subjects, this research is properly seen in support of a larger mission: to generate prescriptions that really help practitioners. To do this, we draw on descriptive knowledge of how the world is but strive for what should be.

“Everyone designs who devises courses of action aimed at changing existing situations into preferred ones.” (Simon, p. 111)

Scholars within our field have written a great deal about design research in the past several years (e.g. Hevner et al, 2004). Alan Hevner has been the most outspoken proponent of the idea that design is an inalienable dimension of a full understanding of information systems.

“While natural science research methods are appropriate for the study of existing and emergent phenomena, they are inadequate for the study of ‘wicked problems’ which require innovative solutions. Such problems are more effectively addressed using design research methods.” (Hevner, 2009).

Figure 2 from the 2004 research paper by Hevner and colleagues shows that IS research must prove its relevance by addressing business needs, and prove its intellectual rigor by contributing generalizable knowledge to a scientific knowledge base of theories, frameworks, and models.

This sets a standard for design research to obtain academic respectability. Merely “designing” is not design science research if it does not solve a new problem or uncover new general knowledge, connecting it to the extant knowledge base.

“Rigor in design research is what separates a research project from the practice of routine design.” (Hevner, 2009).

The outputs of design research are not just new artifacts but new types of artifacts, or new principles, methods, or models for creating them. To contribute to the knowledge base of our discipline, researchers document their findings as information systems design theory (e.g. Walls et al, 1992; Gregor, 2006). One way of documenting a design theory is outlined by Jones and Gregor (2009):

Purpose and scope (what the system is for)
Constructs (the causa materialis)
Principles of form and function (the “blueprint”)
Artifact mutability (the changes or adaptations possible)
Testable propositions (which could be tested to prove/disprove the theory)
Justificatory knowledge (background data or theory underpinning the design)
Principles of implementation (how to build it)
Expository instantiation (case study or prototype to demonstrate it)

What I think is important for students to note is that, if you can understand your own designs in these kinds of terms, you begin approach problem solving in information systems not only as a tradesman but also as a philosopher. You can see how a particular type of solution can be applicable to other problems, and identify the limits of any design principles or heuristics you’ve developed. Moreover, if we as faculty are only teaching you what to do but not why it works, we aren’t adequately training you as thinkers.

Practical design thinking in information systems #

The next question to answer is: what theories do we already have, in 2015, to guide information systems designers and problem solvers? There are several which you may find presented in textbooks as methods or processes for project management, product development, and engineering. Indeed, the practice of information systems development (ISD) has gone through several phases and is still in continuous flux. Traditional project management methods (PMBOK-based) have given way to a variety of Agile models, and these themselves are now being supplanted by new concepts: design thinking, DevOps, and the Lean Startup. In CIS 440, I will explore with my students a sampling of the best current thinking about how to develop IT products. A few common threads unite all of these methodologies:

Pay attention to process.

In classical economics, to be “rational” is to make the optimal choice. This is a substantive rationality—the choice itself is deemed to be rational if it is the best possible choice. Herbert Simon and his students developed a concept of procedural rationality in which we evaluate the process of search rather than the ultimate choice. This is important in the real world of problem solving and design where there may be no optimal solution, or if there is we are unlikely to find it. A rational process for decision making might include developing criteria, identifying alternatives, and evaluating alternatives until a satisfactory one is found. Importantly, one satisficing decision maker may make a different choice than another even given the same set of alternatives, and this does not imply that they are irrational.

In CIS 440, I will not attempt to lecture students on what is the “best” IT architecture or the “optimal” design for a piece of software. Instead, we will focus on best practices in the development process. Importantly, many of these practices have built-in mechanisms for reflection and continuous improvement on the process itself. Scrum, for example, has regular “review” meetings that evaluate the artifact being developed, as well as “retrospective” meetings to evaluate the team’s own processes.
No facts in the building.

The direction of evolution of ISD processes has been away from “big design up front” (BDUF) methods where all planning was done by the project team and project manager, toward increased contact with the business and the ultimate customers. The Agile movement made the client explicitly a part of the development team, in the Product Owner role. Instead of getting requirements from a document, they would be elicited by regular back-and-forth with this representative of the business’s interests.

Newer approaches to IT product development, such as the Lean Startup, take advantage of analytics to go one step farther: instead of asking the client to interpret what the business needs, teams are to go directly to the end users (i.e. the client’s customers who will use the technology). And instead of asking them what they want, developers increasingly use hard data about what they do with the product. As Stanford’s Steve Blank tells his students:

“There are no facts inside the building so get the heck outside.”

We will examine the different approaches to, and technologies for, eliciting feedback from clients and end users during design.
Iteration.

Modern IS development approaches are iterative. Traditional project management methods like those used in construction and aerospace turn out not to work very well in IT, because IT projects are more complex (as opposed to complicated) and planning up front is difficult. Therefore, most of the newer methods replace up-front planning with continuous re-planning or “just-in-time planning” of small chunks of a project. But how do you plan only a small part of a system, when other parts of the system will depend on it? And moreover, how do you explain an incomplete plan to a client, or get them to sign a contract on it? These are some of the challenges that new ISD approaches need to meet.
Commitment to experimentation.

Complex problems cannot be solved in a planning meeting; instead, it requires lots of trial-and-error. Increasingly, ISD methodologies are moving toward a scientific mindset, a necessary development because a major part of trial-and-error is error. Designers and their managers must see designs as hypotheses and take a scientific attitude toward failure—as in science, an experiment that disproves a hypothesis teaches us as much, or maybe more, than one that supports a hypothesis. An oft-repeated mantra (usually attributed to Tom Kelley of IDEO is

“Fail faster to succeed sooner”

We will study how to use designs as hypotheses and to use empirical data to learn from them. In addition, we’ll see the importance of certain engineering or DevOps practices like continuous deployment in enabling organizations to speed up their engines of learning, and how this can changes organizational culture, strategy, and the bottom line of innovation.

Conclusion #

I have argued that there is a philosophy of designing—an important perspective on our field as a science of the artificial, which students should be exposed to in order to adequately prepare them for thinking intellectually about information systems. In addition, there are a number of methodologies or theories of how to design information systems products and solutions. Because these are parts of the theory of IS design, they are formalizable and teachable. Moreover these methods themselves have aspects of the “scientific method” in that increasingly they involve explicit articulation of design hypotheses, iterative experimentation, and use of empirical data. In teaching my students to be IS designers, I hope to also teach them to be design scientists.

A capstone course in information systems

2015-10-16T12:34:17-07:00

An education for problem solvers #

In an undergraduate information systems degree program, we strive to prepare students to be valuable employees, effective managers, and capable entrepreneurs using information technology in business. This is not to say that they should be exclusively focused on commerce. Everyone who is trying to accomplish anything in the world—entrepreneurs, government, nonprofits, artists—has “business”. Being a part of the business school distinguishes information systems (MIS, CIS, whatever) from other programs like computer science, not because we have a different domain of applications, but because business students are trained to pay attention to problems that need solving, rather than “solutions looking for problems”.

We ought to teach students how to approach problem solving with IT, not which problems are worth solving. Jeff Hammerbacher, one of the first data scientists at Facebook, provides us with an observation about wasted talent that I find sobering:

“The best minds of my generation are thinking about how to make people click ads. That sucks.” —Jeff Hammerbacher

Designing the capstone class #

My responsibility is the capstone class, the final course that information systems majors take before they graduate. By the time they reach me, these students have had a courses in programming and databases, networking and web development, and—guided by the above—my objective is to teach them how to put all of this knowledge together to develop real world solutions enabled by IT.

So how do I do this? Well, when I began teaching the course, I was assigned a project management textbook, and the focus of the course were the capstone projects. Each student team was developing a real IT system for a local business or organization. The learning experience was primarily about the clash between the predictability of textbook assignments and the unpredictability of real-world projects, or as my predecessor Tim Olsen described it, “the messiness of execution”:

“We want students to experience the messiness of execution,” said clinical assistant professor Timothy Olsen, who teaches the capstone class. “When we teach students concepts in classes, most of the homework and test assignments are pretty clear-cut and there are fairly good directions. But in the real world, there are political problems and integration problems and learning curves, and lots of reasons why execution is difficult.”

The capstone project gives the students experience in dealing with the messiness of execution, when working with a client who might change his mind often, or dealing with incomplete requirements. “Having that sort of experience is one of the more valuable lessons that comes out of this project,” Olsen said.

Delivering this learning experience, however, challenged me as a professor in a number of ways. First, there was the question of what I could teach in the classroom that would be relevant to what the students needed, when most of them were fully focused on their projects. One of the perennial complaints students put on their course evaluations was “we don’t need the lecture or readings; just finish up and let us have our team meeting”. Second was the issue of visibility; I couldn’t be an active participant in twenty teams at the same time, so I typically didn’t know whether projects would succeed or fail until the “big release” at the end of the semester.

To solve the second problem, I began to change the project management approach of the course from a traditional SDLC to an Agile method based on Scrum. Instead of a big release at the end of the semester, we would have frequent “demo days” of work-in-progress throughout the semester. This has been incredibly successful; not only is Agile an important trend that our students need to know for the job market, it’s also fantastic for pedagogy. Teams learn not only from feedback on their own work, but from seeing the evolution and the trial-and-error that other teams are going through.

The other problem, though—that teams don’t quite know what to do with my readings or lectures—seems to be perennial. After 2 ½ years teaching the course, I’m contemplating a major redesign.

Changing context, changing course #

There are two drivers for changing the course design, one imposed on me from outside, and one that’s more of an evolutionary development from what I’ve learned through the past several iterations.

An outside force #

In 2014, the capstone course’s designation as an “L” class (Literacy and Critical Inquiry) for general studies credit was due to expire, and if it was not renewed, all students in the major would have to take an extra class for this credit. College credits aren’t cheap, and no senior wants to be told he needs to stick around for one extra semester before his degree is final, so it was a no-brainer that we’d want to keep the designation. However, doing so required that 50% of class credit be based on writing or an individual research project, and would take the focus of the course away from the group capstone projects.

Adding an individual “thesis” to the capstone course isn’t a bad idea from a teaching standpoint, either. We use alumni surveys to find out how well we’ve done at preparing them in three areas: critical thinking, communication skills, and domain-specific knowledge; for our department, the one we always get bad marks on is “communication skills”.

A paradigm shift #

Outside of the college, also, information systems development is undergoing a major conceptual change. In the Agile paradigm, developers learn to work with the business client (represented by a “product owner”) to elicit requirements and feedback. It was a great feature of the capstone course that students could get this real-world experience and learn the struggles of communication and coordination that it entailed.

What is changing in the workplace today, though, is that developers can no longer ignore what happens before they start developing—how does the product owner come up with those requirements?—and what happens after they finish—i.e., how does the company know the project was a success? In the Lean paradigm, we can no longer find out what the business needs just by asking it. As Steve Blank says:

“There are no facts inside the building so get the heck outside.”

Instead of asking client organizations what they want, I believe that students need to be talking to potential customers or users. They need to learn how to validate prototypes, not by whether a product owner signs off on them at the end of a sprint, but by the use of real analytics: did users like the features, and use them in the way developers expected or hoped? This is an entrepreneurial paradigm, but entrepreneurship is running through everything these days, and is a core design principle of our university.

A new design #

Over the past two semesters, I’ve increasingly added “validation” activity to the course requirements; for example, students are required to use rapid paper prototyping with their clients in the first milestone, and must conduct usability studies of their software projects with real users for a later milestone. Additionally, this fall I added an individual term paper that requires students to conduct empirical research on their own.

All this has been incremental and feels a little bit disconnected, so I’ve been doing a lot of blank-slate thinking lately. Here are my thoughts.

Two semester-long projects at the same time (a group project and a term paper) is a heavy cognitive burden on students. Maybe the projects should be done in sequence, with intense schedules for half a semester each.
- The next question is: which one first?
Design is a science, for sure. Can we also say that science is a design activity? I wonder if the design thinking or design science paradigm might be a good theme to unify both projects. Thus, the group project would involve explicit identification of hypotheses, and the term papers would require students to use design science techniques like prototyping.
- Nigel Cross (1982) has argued that design in education provides some of the educational value of “critical thinking”, developing innate abilities “to understand the nature of ill-defined problems, how to tackle them, and how they differ from other kinds of problems”. If he’s right, then design fits into my mandate to help students learn how to learn (as cliché as that sounds).
A “flipped classroom” approach could be useful. I could grade students pass/fail on whether they had watched the required videos or read the readings, and then facilitate meaningful project work (through a variety of workshops) during class time. That might be a way to ensure they learn the principles of design thinking, Agile, DevOps, while using face-to-face time in a way that aligns with their most pressing concerns.
It might not be heretical to drop “client interaction” from the class, if I replace it with “customer interaction”. In other words, students would still grapple with the “messiness of execution” but the challenge wouldn’t be pleasing one customer but instead pleasing many, and using analytics to prove it. Students could identify their own projects, in this case.
Many outside organizations, particularly companies who hire our graduates, still want to be involved. I haven’t quite figured out how they’d fit in if I designed “client work” out of the course. Some kind of “mentorship” role would appeal to many of them.

The upcoming spring semester is my next good opportunity to experiment with the design of the course, so I’ll be developing and validating some ideas over the next couple of months. I welcome feedback and ideas!