Redshift Research Project




So! last Friday I finally released a white paper, first in a little while, and a lot of work, but work which mainly is of framework-type code which provides the foundation for a bunch of other investigations I have in mind.

The very next white paper ought to be “Query Compilation”, because it’s up in the air - I need to run the tests in a couple of regions, and then I need to make a Cross-region query compilation checker, to determine the query compilation behaviour in all regions on an ongoing basis (the test suite runs in one region at a time only, and takes a long time - the query checker, now I know what to look for, will be a lot quicker).

Significantly, I am now putting together an AMI, which contains a web-server providing Redshift tools - the Amazon Redshift Cluster Toolkit.

To start with, there’s just a table analysis tool - I want to check AMI is a viable platform - but I have a bunch of ideas for essential Redshift tools, like a slow node checker, and Permifrost.

On the subject of slow nodes, I’m thinking now about a white paper which examines the range of node performance for the three small node types - I’ve written all the code I need to do this; I just need to run the benchmarks. That’s pretty cool.

So, lots of ideas, lots of things to get done :-)


systemd Strikes Again

I’ve been working on making the AMI for the Amazon Redshift Cluster Toolkit.

There’s just one tool at the moment, the Table Analyzer, and part of how it works - since it can be long-running - is that you kick off an analysis from the web-based GUI and it detaches itself and gets on with its job.

This means, to be more detailed, that the Python script which starts the analyzer calls subprocess.Popen() and executes the Python script which is the analyzer.

Problem is PYTHONPATH.

This is an environment variable, which tells Python script where to look for shared libraries, and I have written a number of them.

So, to begin with, I had Apache running a Python script which spawns a Python script, and none of them had PYTHONPATH.

First, I used SetEnv in the site config on Apache to set the variable (which sucks, as it’s now hard coded in the site config file) and this allows the first Python script to see PYTHONPATH, but not the second.

The second Python script I believe is executing as user www-data, but without Apache handing it any environment variables; subprocess.Popen() lets you pass in environment variables, but first it wipes all the environment variables of the child process, so it’s no use at all.

I then spent all of today, from 9am to 11pm, trying to figure out how to make PYTHONPATH available to the table analyzer Python script.

The main suspects were /etc/environment and /etc/profile, and maybe /etc/default, and none of them worked; none of them would set PYTHONPATH for a Python script spawned from a Python script.

Then I began to find people writing that systemd now did this.

With systemd, there is an option in the system.conf file where you can specify, on a single line, var=value pairs, separated by spaces. That’s it, and what goes there is visible to systemd and anything it spawns (which presumably includes Apache, and so scripts spawned by Apache or its children), but to nothing else - it doesn’t show up for you own shells, for example. You also have to specify the variables in /etc/environment or in your bashrc or whatever.

So I did this.

It doesn’t work.

A Python script spawned from a CGI still does not have PYTHONPATH.

I feel very much like I’m sinking into the Pripet Marshes.

Moving on, from this unsolved problem, I discovered a second problem : __pycache__.

I have a shared, global Python library directory.

Whoever (“whom” I consider obsolete grammar) first calls a Python script which uses those libraries creates - and so owns - the __pycache__ directory and the cache files created, and no one else can modify the cache files which have been created.

So how the hell am I supposed to use this? umask is per user, not per directory.

The only solution I can see right now is to touch all the cache files which will exist, so they exist, and chmod them to 777, which is completely insane.

Looking at the system-wide Python libraries installed with Python, they’re all owned by root and all the cache files are owned by root. It looks a bit like they’re all made, or pre-populated, and never change? which is not use at all to me, because I’ll be changing my libraries all the time.

I dimly wonder if the intended approach is to publish shared libraries to some shared location, make all their cache files, and do no dev work there; do the dev work in your own copy of those files, and when you’re done, republish and wipe/retouch the cache files.

There are days when dev work is just one long struggle with problems which shouldn’t exist and have no answers, and this has been one of them.



Oddly, back in Jan, when I made the price tracker, the Chinese regions from the pricing API had no region name given (and if I remember correctly, also no prices).

As of today, they do, prices being in “CNY” (Yuan) as the Yuan is not freely convertible; you have to have permission from the State to sell Yuan (i.e. buy foreign currency), or to buy Yuan. It’s part of the State controlling people.


Near Future Work

So, I had hoped to publish the white paper on query compilation on Saturday (yesterday).

The write up took some time, more than I expected, and as a part of that process I realised the way the results were presents was really not right, which meant I needed to re-run the tests.

Trying to re-run the tests, I found that ds2.xlarge nodes could not at this time be created in eu-north-1 - which I needed to do.

I’m also noting that other node types are taking one hell of a long time today to start in that region.

Anyways, it’s serendipitous; thinking about it, only dc2.large is actually needed for the white paper, the others just make the data too large and bring almost no value.

While I was re-running the test, I started working on Permifrost.

However, having made some progress, I decided not to make a direct copy of Permifrost.

The design work belongs to someone else, so it’s not okay that I take it, implement it, and then maybe in the future begin to charge for it.

Moreover, it’s a fair bit of work and so will take some time - and I think I can get vital functionality available nowhere else out much sooner.

What I’m thinking to do to start with then is simply a page or two which displays privileges, by group and by user, something Redshift cannot do; so to begin with, displaying information. Management of groups/users - and everything else - can be added incrementally.

So I need to make those few initial pages, and also there’s a bit of work which has to be done on the Slow Node Checker, because currently it doesn’t check network performance, and it has to.



I am as I type finishing off the AZ64 white paper. I need shortly to do a full test run, update the Discussion with those results, and then I can publish.

I’ve been working on an AMI of Redshift tools; and I had thought this would be limited to such tools as I could allow the SQL used to be visible (and so copyable) by users, since any SQL issued of course shows up in the system tables (well, leader-node only stuff only briefly and awkwardly shows up, but still).

I’ve spent years writing a comprehensive set of replacement system tables, and I had not seen a way to release these as a product, because they can so easily be copied.

Then the obvious finally struck me : why not copy the necessary system tables into the AMI, and run the SQL of the replacement system tables there, emitting the output to a web-page?

So that’s what I’m now working on, except for getting the AZ64 paper out this weekend.

Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers