Harry Li Consulting, LLC

Redshift: easy to start, annoying to maintain

Redshift was the first cloud data warehouse that I used. When I chose it to be the platform for my team, the qualities that drew me to it were:

  1. It was well-established. I knew I wouldn’t be getting a piece of technology from a start-up that might not be around next year.
  2. It was popular. It seemed like it had lots of market share and so if I ran into any problem, a quick Google search would often resolve the problem.
  3. My team started off using services in Amazon already, so I didn’t have to worry about storing sensitive student data in a whole new service.
  4. There were a ton of third-party vendors/tools that were compatible with Redshift.
  5. Redshift’s SQL dialect was based off of PostgreSQL, which my team was already using.
  6. I didn’t need to contact anyone in Sales or sign additional contracts. I could just click on a few buttons and spin up a small deployment to test it out.

Unfortunately, I ran into some obstacles along the way that I couldn’t design around and that eventually led me away from Redshift. These obstacles are the kind that won’t go away, no matter what team or project you’re on.

Before getting into these obstacles, let’s get some background on Redshift’s architecture. When setting up Redshift, you specify what type of nodes you want (lots of storage or lots of compute) and how many nodes. You click a button, wait a few minutes, and then…tada! You have a cluster of machines working together to give you a data warehouse. It’s pretty great. Those nodes have a leader. The leader listens for queries, figures out how to parcel out the work to other nodes, collects the results, and hands them to your analytics tools (like Tableau).

Problem: leader bottleneck

By default, that leader node maintains a single queue in which queries start execution in the order they are received. Also by default, at most 5 queries can run at the same time. This level of concurrency was fine when it was just me. As my team grew though, we would hit the limit multiple times a day and people would complain that their queries were taking too long. In reality, as more analysts were using the warehouse, more poorly written queries were getting written as well. Those queries were tying up the system, sometimes taking hours if not days to finish.

Queries that take up too many system resources are always going to be a drain on the system; no warehouse can make that problem go away. We ended up implementing multiple queues for different priorities as well as a query killer that would automatically terminate any query that took too long.

In the end though, we knew the leader was eventually going to be a bottleneck as we grew and grew. There was little we could do to design around this problem other than have multiple identical Redshift clusters, whose maintenance sounds like one of the rings of Hell.

Problem: SORTKEY and DISTKEY

In any data warehouse, your data is distributed across nodes. Some ways of distributing your data are going to lead to faster queries than other ways. Data that is often JOINed together benefits from being co-located. Data that you would typically create a database index for would also benefit from being sorted on disk to make searches faster. Redshift addresses these issues with their SORTKEY and DISTKEY features. When you create a table, you can specify which column to use as the key for SORTing the rows and which column to use as the key for DISTributing the rows to different nodes.

When I was creating tables I often didn’t know what the best choice was, so I’d leave it blank or guess. Unfortunately, Redshift doesn’t give you do-overs. If I changed my mind, I had to create a different table with the appropriate SORTKEY and DISTKEY set, copy all of the data into it, DROP the original table and then rename my new table to the name of the original. This was often made worse because any VIEWs I may have created based on the original table would prevent me from DROPping it in the first place.

This was arguably one of the main reasons my team stopped using Redshift.

Problem: Resizing and costs

In the beginning, I wanted to start small and provision 1 Redshift node. That went well and the amount of data I was storing was getting pretty large, so I took the system down one day and resized it to 3 nodes. I went to lunch and when I came back, everything was working and I had 3x more capacity than before.

At some point when my team started using more and more data, I needed to resize again. I did it over the weekend and grew it to a 5 node cluster. I knew that one day I would need a 10 node cluster, but I didn’t want to waste the money until I got closer to that point. That little resizing effort took about 10 hours to finish. I was astounded.

In order to have enough capacity, I had to resize my cluster. In order to resize, I had to take down the cluster. The bigger my cluster, the more downtime I’d incur. The downtime from resizing was so large that it really encourages users to way over-provision and spend more money just to not have to do it as frequently.

Given that other cloud solutions let you only pay for what you use and obviate resizing at all, Redshift’s model seemed unnecessarily archaic.

Wrap

I truly appreciate how easy it was for me to learn about data warehousing and performance tradeoffs through Redshift. It really is ridiculously easy to get running and there still are a lot of tools that work seamlessly with the AWS ecosystem.

The drawbacks though for someone maintaining or scaling the system are painful and not something I want to do again.