Harry Li Consulting, LLC

Auto-complete table and column names in Jupyter with SQLAlchemy

SQL is pretty fantastic. It’s easy to learn and with the right training it can do very powerful things. Even in the wrong hands, it can do powerful things! And yet, as a programmer-turned-data scientist, I’m annoyed every time that I have to write a query. Many times, my irritation comes from not having a good SQL editor that auto-completes table and column names. Other times, it’s from writing the same SQL clause over and over again, copying and pasting and setting up hazards for my future self.

Since I work mostly in Jupyter for my analysis contracts, I wanted to share with others how I use SQLAlchemy to DRY up my queries and get auto-complete for table and column names.

Getting started

First off, I take advantage of SQLAlchemy’s reflection functionality. I’m often working with tables that have dozens of columns, I don’t want to go through the pain of specifying every single column and column type. Rather, I write something like this to load a SQLAlchemy table object based upon a table in the database:

conn_string = "postgresql://postgres:@localhost:5432/postgres"
engine = create_engine(conn_string)
meta = MetaData(bind=engine)
t_table = sqlalchemy.Table('names', meta, autoload=True)
Screenshot of table ‘names’ with three records

For the table names shown above, the t_table object gets loaded with this:

Reflected SQLAlchemy Table object shown in Jupyter

Now that we have tables represented as Python objects, I can start doing some fun things like parameterizing queries in loops and then kicking it off as I walk away to get a pastry while data gets fetched. Below, I show a simple query that gets run in a for loop with results saved as pandas Dataframes.

The first loop uses string substitution to parameterize a query. The second loop runs the same exact query, but uses more SQLAlchemy features.

For really simple or one off queries, I generally prefer just writing the raw SQL. For more complicated ones and especially if I have many queries that are subtly different, I like using the more programmatic SQLAlchemy way.

Auto-complete column names

Above, the second loop containing the more SQLAlchemy-ish query, you’ll see that I was able to access the first_name column of the names table by writing t_names.c.first_name. I originally thought that Jupyter’s built-in auto-complete of defined variables and functions would also extend to whatever was held in this mysterious c collection.

I was wrong. When I type t_names.c into a Jupyter cell and then hit Tab, I get nothing. So instead, with a little meta-programming magic, I wrote a loop that goes through a SQLAlchemy table object’s columns and explicitly defines them as attributes back on the object and gives it the name c_<column name>.

Dynamically defining column attributes

After running this loop, I can type t_names.c_ and then hit tab to let Jupyter do its normal auto-complete thing and we see a list of available column names. Tada!

Auto-complete table names

While the code I’ve shown so far will reflect a table and then tack on attributes for each column, you still need to reflect each table individually. One simple fix is to use SQLAlchemy’s Inspector object to grab a list of all the table names in a schema and then go through and reflect them individually. I tried that and found it incredibly annoying to have to wait for possibly hundreds of tables to finish reflecting before I could run anything.

Instead, I like doing it on-demand. Essentially, for a schema named ‘public,’ I create an empty class. I then use an Inspector to loop through all the table names, writing a little lambda for each one and tacking them on as properties onto the Public class.

I’ve included the full Jupyter notebook for doing this magic below. The punchline is that in my Jupyter notebooks I can type public. and hit tab to get a list of table names. Accessing public.names will on-demand reflect the table from the database. And after said reflection, I then get auto-completion of columns names, too.

Well, I hope the above was useful. If none of it made sense, I apologize for wasting your time. If your interest is piqued, you can see the full Jupyter code here. It will connect to a local Postgres instance and assume that a table called names exists in the public schema.

Redshift: easy to start, annoying to maintain

Redshift was the first cloud data warehouse that I used. When I chose it to be the platform for my team, the qualities that drew me to it were:

  1. It was well-established. I knew I wouldn’t be getting a piece of technology from a start-up that might not be around next year.
  2. It was popular. It seemed like it had lots of market share and so if I ran into any problem, a quick Google search would often resolve the problem.
  3. My team started off using services in Amazon already, so I didn’t have to worry about storing sensitive student data in a whole new service.
  4. There were a ton of third-party vendors/tools that were compatible with Redshift.
  5. Redshift’s SQL dialect was based off of PostgreSQL, which my team was already using.
  6. I didn’t need to contact anyone in Sales or sign additional contracts. I could just click on a few buttons and spin up a small deployment to test it out.

Unfortunately, I ran into some obstacles along the way that I couldn’t design around and that eventually led me away from Redshift. These obstacles are the kind that won’t go away, no matter what team or project you’re on.

Before getting into these obstacles, let’s get some background on Redshift’s architecture. When setting up Redshift, you specify what type of nodes you want (lots of storage or lots of compute) and how many nodes. You click a button, wait a few minutes, and then…tada! You have a cluster of machines working together to give you a data warehouse. It’s pretty great. Those nodes have a leader. The leader listens for queries, figures out how to parcel out the work to other nodes, collects the results, and hands them to your analytics tools (like Tableau).

Problem: leader bottleneck

By default, that leader node maintains a single queue in which queries start execution in the order they are received. Also by default, at most 5 queries can run at the same time. This level of concurrency was fine when it was just me. As my team grew though, we would hit the limit multiple times a day and people would complain that their queries were taking too long. In reality, as more analysts were using the warehouse, more poorly written queries were getting written as well. Those queries were tying up the system, sometimes taking hours if not days to finish.

Queries that take up too many system resources are always going to be a drain on the system; no warehouse can make that problem go away. We ended up implementing multiple queues for different priorities as well as a query killer that would automatically terminate any query that took too long.

In the end though, we knew the leader was eventually going to be a bottleneck as we grew and grew. There was little we could do to design around this problem other than have multiple identical Redshift clusters, whose maintenance sounds like one of the rings of Hell.

Problem: SORTKEY and DISTKEY

In any data warehouse, your data is distributed across nodes. Some ways of distributing your data are going to lead to faster queries than other ways. Data that is often JOINed together benefits from being co-located. Data that you would typically create a database index for would also benefit from being sorted on disk to make searches faster. Redshift addresses these issues with their SORTKEY and DISTKEY features. When you create a table, you can specify which column to use as the key for SORTing the rows and which column to use as the key for DISTributing the rows to different nodes.

When I was creating tables I often didn’t know what the best choice was, so I’d leave it blank or guess. Unfortunately, Redshift doesn’t give you do-overs. If I changed my mind, I had to create a different table with the appropriate SORTKEY and DISTKEY set, copy all of the data into it, DROP the original table and then rename my new table to the name of the original. This was often made worse because any VIEWs I may have created based on the original table would prevent me from DROPping it in the first place.

This was arguably one of the main reasons my team stopped using Redshift.

Problem: Resizing and costs

In the beginning, I wanted to start small and provision 1 Redshift node. That went well and the amount of data I was storing was getting pretty large, so I took the system down one day and resized it to 3 nodes. I went to lunch and when I came back, everything was working and I had 3x more capacity than before.

At some point when my team started using more and more data, I needed to resize again. I did it over the weekend and grew it to a 5 node cluster. I knew that one day I would need a 10 node cluster, but I didn’t want to waste the money until I got closer to that point. That little resizing effort took about 10 hours to finish. I was astounded.

In order to have enough capacity, I had to resize my cluster. In order to resize, I had to take down the cluster. The bigger my cluster, the more downtime I’d incur. The downtime from resizing was so large that it really encourages users to way over-provision and spend more money just to not have to do it as frequently.

Given that other cloud solutions let you only pay for what you use and obviate resizing at all, Redshift’s model seemed unnecessarily archaic.

Wrap

I truly appreciate how easy it was for me to learn about data warehousing and performance tradeoffs through Redshift. It really is ridiculously easy to get running and there still are a lot of tools that work seamlessly with the AWS ecosystem.

The drawbacks though for someone maintaining or scaling the system are painful and not something I want to do again.