DB 101: Databases for Developers

Advertisements
database icon over mail boxes
Original photo by Bernard Hermant on Unsplash

Over the past few weeks, I’ve been doing some deep-diving into databases as part of my day-job effort to migrate our existing Postgres 9x databases to something more recent, like Postgres 13. As an SRE I don’t typically have much exposure to the database layer of the applications that I support beyond making them available and ensuring migration scripts and the like run successfully. But this deep-dive has allowed me time to get to know our databases and their structure and to see some of the inner workings of our applications. Out of this has sprung a few articles so far (and a few to come, stay tuned!) but today I want to talk about databases — specifically relational databases — from the perspective of a software developer and maybe help some new and aspiring developers understand their data stores a little better.

The first thing I want to talk about is probably the most controversial topic on this list, and that’s choosing the right database for your application. Now, I understand that not everyone will be in a position to choose the DB for their project, but I think it will be helpful to understand a few of the most popular DBs and their pros/cons.

First, there’s the matter of DB type: relational, non-relational (or NoSQL), and key-value (or KV).

Author’s note: One thing I want to point out is that the dichotomy of SQL vs. NoSQL is sort of a misnomer in that a relational DB doesn’t have to use SQL and non-relational DBs can use SQL. But SQL and NoSQL have become shorthand for relational and non-relational, so I’m going to use this colloquial meaning of these terms for the rest of the article.

Relational databases — sometimes abbreviated as RDBMS or SQL databases — store two kinds of data: the various bits of data that populate the database and the relationships between those bits. This allows SQL DBs to be flexible in their querying and make it easier to only get the data you’re looking for, as the data tends to be broken up across tables (or normalized) more than in a NoSQL DB. Normalization reduces data redundancy and improves data integrity, as you only need to store one bit of data a single time and simply reference it where it’s needed. Popular examples of SQL DBs are PostgreSQL, MariaDB, and Microsoft SQL Server.

NoSQL, or non-relational, databases come in many different flavors (too many to talk about here), but probably the most common are document databases. Instead of data being stored in pieces across multiple tables like a SQL DB, data is stored all together as a document. Documents are semi-structured in that they are typically stored in some kind of encoding, like JSON or XML, but the data that is encoded does not need to follow a consistent format. You can store documents of different types (or schemas) in a single document database. Document databases have the advantage of flexibility that SQL DBs don’t, and depending on the data you’re working with, this could be an advantage. Popular document databases are MongoDB, Elasticsearch, and Couchbase.

Key-value stores are similar to NoSQL databases in that, technically speaking, document databases are a subset of KV DBs. However, for our purposes, we’ll keep KV DBs separate, as they serve different purposes from an application’s perspective. Key-value stores are essentially hash tables and as such can be incredibly fast when doing lookups. This is why they’re commonly used as caches for applications; their read performance is very, very good. Typically, however, KV DBs are limited in their feature sets and, as such, working with complex data can be more time- and process-intensive than offloading some of that work to a SQL or NoSQL database. Some of the most popular key-value stores are Redis, Memcached, and Consul.

Now that we have an overview of the basic database types, how do we decide which one to use? The first thing you want to look at is your data and determine five things:

Will you be storing log files? User accounts? Product information? JSON documents? The data type will help determine whether a SQL or NoSQL database is the right choice.

Can my data be easily normalized? If your data can’t be normalized, then maybe a NoSQL database is the right way to go.

Do all of the units of data follow the same schema? Could it be made to fit into column types? If the data isn’t uniform, then a document database might be the right choice.

Will this application be read-heavy or write-heavy? A write-heavy application might require a database that performs well under write-heavy loads.

Do we have any contracts or licenses that would sway the decision toward a certain brand or platform? Exiting contracts or considerations might mean using something like Microsoft SQL Server or Oracle Database instead of an open source offering like MariaDB.

In the end, which database to choose isn’t something that I can tell you in an article. It’s a decision that needs to be made by the development team after considering all the factors above.

The next question to ask yourself is “how do I want to manage my database”? This may seem like a silly question, but the choice between writing raw SQL queries, using a SQL query constructor, or using a full-blown ORM isn’t an easy one. Wait, what’s an ORM?

An ORM, or object-relational-mapper, is a way to manage the objects in a database and represent database constructs (tables, columns, etc.) as objects in the programming language you’ve chosen. Typically, tables are represented as classes, class properties are columns, and so on. Using an ORM can make life a whole lot easier, especially when just getting started, but ORMs also have their drawbacks.

ORMs are generic by necessity. They have to be able to handle any kind of query for any kind of data, and as such, they tend to optimize for flexibility over performance. Querying data with an ORM, especially in a high-volume production environment, can cause latency issues, deadlocks, lockups, and more, causing your application to come grinding to a halt. I’ve seen some ORMs do some really wacky things, like multiple nested transactions and stuff. But they still have their place. Early development, prototyping, and low-volume database access are all good use cases for using an ORM.

Alternatively, you could simply write raw SQL. While this will most likely provide you with the most performance, it can also be a more painstaking way of accessing your data. Also, unlike with most ORMs, which automatically escape special characters and quote strings, writing raw SQL can open your application up to SQL injection attacks if you’re not very careful about how you’re handling external data (especially user-provided data).

Somewhere in the middle are SQL constructors. They provide some of the same tooling as ORMs (SQL-injection mitigation, a more functional way of querying) without some of the overhead that ORMs bring to the table. They’re essentially string builders designed for constructing SQL statements. They provide a middle ground between ORMs and DIY SQL.

One of the most overlooked and important aspects to understand about databases is indexes. Indexes provide a way for a database to quickly look up values in a table without having to iterate through all of the rows in the table. The trade-offs are increased storage space required to hold the index and increased write time when inserting new values into the table, as the database needs to also update the index lookups. Most tables in a SQL DB have a primary key that is automatically indexed by the database; in addition, you can specify other columns in the table to be indexed.

Let’s say that you’re working with a table of users and you want to retrieve a user’s email when they input their username. Our table might look something like this:

id*| username | email 
-------------------------------------------------
1 | user_ted | ted@company.us
2 | user_jill | jill@company.us
3 | user_annie | annie@company.us

In this case, id is our primary key and thus automatically has an index associated with it. However, your users aren’t going to be submitting their row ID, they’re going to be submitting their username. With only three rows in the table, an index isn’t going to do much in terms of performance, but what happens when the table grows to 100 rows? 1,000? 10,000? Each query for the user’s email will take longer and longer as new rows are added to the table. By adding an index to the username column, we can reduce that query time to an almost constant time (or O(1)). Again, this costs us some disk space and write performance, but as we’ll be looking up users more often than we’ll be adding them, the trade-off is a very good one as long as we don’t get carried away. Getting carried away would be to add indexes we don’t need, like an index on email since we’re never looking up by email address. All we would be doing by adding an email index would be to add more time to inserts and needlessly using disk space.

When you have an application that will have a long-running connection to the database, it can be good to periodically check that the database is still available. Depending on the database driver, language, and other factors, there might be ways to do this easily, like checking for an open connection, but sometimes the simplest is to just run a DB query. What kind of query you run, however, depends on your goal.

OK, I understand the desire to ensure your database is in a usable state. However, I don’t recommend this as a regular health check. I would do this sort of validation step on application start. For one, getting a list of tables is going to put a read load on the database. It might be small, but if you have multiple replicas of the application all running at the same time doing the same health check, then the performance impact will be greater. But if you’re really hellbent on doing it this way, then just be sure to keep in mind that you might see a performance hit.

I like this one. Getting statistics from the database, like read/write load, would be something useful the application can log for later consumption. There’s not much the application itself can do with this information beyond the state of the database’s availability, but it might help later on in troubleshooting.

This is probably the most basic and most logical health check, at least to me. Everything else, like DB performance or validation or whatever, can be handled separately and by other tools. Checking basic availability is really the only thing your application cares about, so keeping it simple is the best way to go in my opinion. So what kind of query should you use for this? Well, what you absolutely don’t want to do is something like this:

SELECT * FROM 'Customers';

No. No, no, no. Stop it.

Yes, this will check if the DB is alive. But it will also give you a list of all your customers, a list that will grow as your customer base grows, meaning that your health check will also grow in execution time. Don’t do this. This is a real thing I saw recently, so yes, it does happen. Instead, the best and simplest query is this:

SELECT 1;

This is essentially the same as

def my_function():
return

Short, simple, and does the job. You can’t really do better than this.

This has only been a skimming of everything that goes into working with databases, but I hope it provided some insight into some often overlooked areas. Remember: Databases are usually going to be the performance bottleneck in your application, so treat them well and always look for ways to make them faster.

Source

Leave a comment

Your email address will not be published. Required fields are marked *