The various No SQL vendors tout the benefits of “schema later” in their promotional materials and in other places. In this blog post, we argue that schema later is a bad idea, and should be avoided, especially for important data that you care about.
The argument in favor of “schema later” is that it makes the “out of box” experience much better for an initial user. In other words, a user can simply start entering data; without having first to define a schema for the data. Hence, the initial learning curve is easier, and that is considered good.
Consider the following example that will illustrate the problems with schema later.
In the syntax of MongoDB, USER-1 can start entering data. The next two commands create an Employee collection and add two records for two Turing award winners.
db.employee.insert (
{name : "Stonebraker", age : 45, salary : -99}
)
db.employee.insert (
{name : "Codd", age : "old", salary : 40000}
)
Sometime later, another Turing awardee is added by a second person (USER-2).
db.employee.insert (
{name : "Bachmann", age : "unknown", salary : "20.65/hour"}
)
Finally the last DBMS Turing award winner is added by USER-3.
db.employee.insert (
{name : "Gray", age : 120, salary : 80000}
)
The above are four acceptable inserts by three users to an Employee collection in many No SQL DBMSs. In the four cases above, MongoDB outputs:
WriteResult({ "nInserted" : 1 })
Now suppose USER-4 wants to query the collection. It is straightforward to find Stonebraker:
db.employee.findOne (
{name: "Stonebraker"}
)
Output:
{
"_id" : ObjectId("5d88fe325fa42577f1bfe9ea"),
"name" : "Stonebraker",
"age" : 45,
"salary" : -99
}
The problem arises when USER-4 wishes to run an aggregate, say to find the sum of employee salaries:
db.employee.aggregate([ {
$group: {
_id: null,
total: { $sum: "$salary" }
}
} ] )
Output:
{ "_id" : null, "total" : 119901 }
119901 is computed as “-99 + 40000 + 80000”, so “20.65/hour" was treated as zero, when USER-2 clearly intended to store a non-zero value. Also, -99 was treated at face value, even though USER-1 probably meant it as a placeholder for “I don’t know”. In other words, schema later allows a user to get incorrect and very misleading output. Woe on a schema later user who made a business decision based on the above output!
In order to get reasonable output, USER-4 must perform the following actions.
Ask USER-1 what a salary of -99 means. Does it mean null (“undefined”) ?
Ask USER-2 if 20.65/hour means an hourly rate. If so, how many hours did Bachmann work?
USER-4 now waits for replies from the other users. Since they don’t report to him managerially (in all probability), there is no incentive for them to answer promptly, if at all. If even one doesn’t answer, then USER-4 is stuck. If one of them answers “I don’t remember – that was months ago”, then USER-4 is stuck. You can clearly see this process of salary rationalization is going to be arduous! It should also be noted that the age column has the same sorts of problems exhibited by the salary column. Hence, data integrity issues are not restricted to monetary columns.
Source code of the above MongoDB commands and how to run them are available on this GitLab snippet.
It should be noted that most No SQL vendors also offer “schema first” facilities that will address many of the shortcomings that we discuss. However, schema first nullifies all the claimed benefits of schema later.
In a relational DBMS things are a considerably better. Consider the corresponding operations in SQL:
create table employee (name varchar, age int4, salary int4 check (salary > 0));
insert into employee values ('Stonebraker', 45, -99);
ERROR: new row for relation "employee" violates check constraint "employee_salary_check"
DETAIL: Failing row contains (Stonebraker, 45, -99).
insert into employee values ('Codd', 'old', 40000);
ERROR: invalid input syntax for integer: "old"
LINE 1: insert into employee values ('Codd', 'old', 40000);
^
insert into employee values ('Bachmann', 'unknown', '20.65/hour');
ERROR: invalid input syntax for integer: "unknown"
LINE 1: insert into employee values ('Bachmann', 'unknown', '20.65/h...
insert into employee values ('Gray', 120, 80000);
INSERT 0 1
Source code of the above PostgreSQL commands and how to run them are available on this GitLab snippet.
Notice that Postgres will refuse the salary value for Stonebraker, because salaries must be non negative. Postgres will also refuse the Codd insert because age is an int4. It will refuse the Bachmann insert because of the age value and the salary value. The user at data entry time must fix these erroneous values. This is a wildly better idea than having some other user after the fact correct the problem. In the same vein, an integrity constraint on the age field could be defined to catch Gray’s age.
Hence, most-to-all of the data entry issues can be found and fixed at data entry time, which is the easiest time to deal with these issues. However, this requires that one adopt a schema first mentality.
Put differently schema later allows erroneous data to be added, which accumulates as “data debt”; i.e. errors that must be correctly later at a presumably higher cost. We assert that data debt is bad, because the best person to fix data errors is the person doing the initial insert. Schema later allows data debt to grow; schema first does not.
Let’s classify columns/attributes into one of two types:
Schema optional. There are columns for which a schema is irrelevant. Such columns contain documents, tweets, and other “casual” data. It will never make sense to run aggregates over such columns. It does not matter whether there is data checking or not.
Schema required (now or later). In this case, queries to the column/attribute will produce erroneous or misleading results unless the schema is enforced. In our opinion it is cheaper to use schema first, rather than schema later.
It is sometimes claimed that JSON is a solution to the above schema issues. Obviously it is not, since the first set of inserts goes into JSON fields in MongoDB. If you have data that will require a schema at some point, you are way better off doing the work up front to avoid data debt, because the cost of schema later is a lot higher.
This blog was co-authored by Álvaro Hernández, founder of OnGres.