/etc

2019-11-27

For my Book Aligner project (among others), I’ve been relying on Google Fusion Tables to provide a simple database backend which I could query from JavaScript. Since Fusion Tables is going away on December 3rd, 2019, I needed a replacement, and something preferably free or low-cost which could quickly handle querying several million rows of data. After surveying some alternatives, I decided to migrate the data to Amazon Web Services’ DynamoDB offering. This post is just a quick outline of my initial experiences, in case they’re useful to anyone else.

The first problem was getting my data out of Fusion Tables and into DynamoDB. I had a CSV data export of my Fusion Tables from my google-fusion-tables-backup script (though, Google Takeout now also provides CSV exports for Fusion Tables and is much easier to use). But how to get that into DynamoDB? AWS provides a tutorial for “Importing Data into DynamoDB”, but this doesn’t take a CSV - it instead takes an idiosyncratically-delimited file of the same format as a DynamoDB export. It also uses AWS Data Pipelines and Elastic Map Reduce, which may increase the cost of importing data over simply writing the data into DynamoDB directly. After looking around, I found some solutions to crib from for writing a simple Ruby script for reading a CSV into a DynamoDB table, a somewhat-slow 25 records at a time (you could, of course, speed this up by doing the batch inserts asynchronously). Since there are around 7.5 million rows of data in the Book Aligner alignment tables, this wound up costing around $9 worth of writes to import completely into DynamoDB. Ideally, if I were regularly updating the backend tables, I’d also write a smarter process that could just delete deprecated records and add new ones, instead of writing the entire table every time.

The next problem was querying the data. Previously, I used an application-specific Google API key which had read-only permissions for the Fusion Tables, which I could query using AJAX calls from JavaScript. You might think you could do something similar with directly querying DynamoDB tables, and you probably could, but if it started being abused you could find yourself facing a hefty bill (whereas if a Google API key hits its quota limit, it just starts returning errors). Instead, I decided to use the AWS API Gateway service to proxy the DynamoDB requests, since it offers more control (and quotas/throttling/caching). AWS provides a handy tutorial for some of this, which I was mostly able to follow, with a few hiccups. The first was getting the API Gateway service a proper “Execution role” ARN - without it, attempting to test the API would result in the error “Execution failed due to configuration error: API Gateway does not have permission to assume the provided role”. The solution was to manually edit the role’s Trust Relationship, as described in this Stack Overflow answer. The second major issue after testing my API and making sure it worked how I wanted it to was actually deploying the API live on the web. The documentation is not fantastic on this, and it seems like you can only deploy using the command-line aws utility, not the web interface - and the documentation also isn’t clear that you must specify the stage name with --stage-name when creating the initial deployment, otherwise you’ll get the cryptic error “An error occurred (BadRequestException) when calling the CreateDeployment operation: Invalid deployment content specified.CreateDeploymentInput should not be null.

Aside from those relatively minor issues, I’m pretty happy with how easy it was to move everything over. The old Fusion Tables backend for the Book Aligner only got around 4,000 queries per month, so I don’t anticipate read costs being a major issue with DynamoDB (at $0.25 per million read requests). For comparison, a service like AWS Athena or Google BigQuery which costs $5 per terabyte of data scanned would wind up costing around $10/month in reads with the approximately 500MB of data in the Book Aligner tables.