Redshift Research Project



AWS Bill

I spent 600 USD on Redshift last month =-)


Materialized Views

A paper describing materialized views has been published.

I’m looking to stick to a weekly rate, publishing on the weekend, but this one was a lot of work - I kept on figuring out new things.

As such, it’s published, the content is good, but I’m not happy with the presentation of SQL in the results section. In future I’m going to start using pglast to format SQL. It’s an extra dependency (the only dependency now is boto3) but the results have to be readable.


System Table Explorer Update

Redshift 1.0.30840 is out. The System Table Explorer now has a dump from this version.


RA3 Investigation

I’ve finally had a decent idea about how to begin comparing RA3 node disk performance to non-RA3.

An ra3.xlplus has 932 GB of SSD.

So - make 932 1 GB tables - table_0, table_1, etc, to table_931.

I know how many rows you get in one raw block of say int8, so that’s straightforward to do with INSERT and self-join, to make the first table, and the later tables will just select from the first table.

Now create one more table - this should evict the first table from the cache.

Now benchmark a full scan (sum(column_1), or something like that anyway as sum() can fail due to overflow) on the most recent table, and a read on the first table.

We should see a big difference (SSD vs S3).

We can repeat the test on a dc2.large, where of course it’ll all be SSD, and then get an idea of the cost of cold read from S3 for RA3. Be interesting to do the same on a ds2.2xlarge, but then I’ll need to make 2048 tables - that probably will take hours.

Actually, this leads to another thought - maybe there is a much simpler way.

When an RA3 cluster is rebooted, is the SSD cache flushed? if it is, I only need to make one table (well, I mean, some extra whatever so I can benchmark cleanly from SSD and S3, but you get the point).

If reboot doesn’t flush, would making a snapshot and restoring the cluster from snapshot clear the cache?

I think I may well try both before taking the original, slow route.

I just had another thought, too… so, imagine the 1 GB table I benchmark is equally distributed across all RA3 slices - okay, I benchmark it. Now imagine it’s skewed, and the entire table is on a single slice/node. Will I find only one slice/node is pulling down data from S3, and so the read time is much longer?

With an SSD, this wouldn’t matter much - it would mean only one slice was accessing the SSD, but then that entire slice will get all the bandwidth from the SSD, so total read time doesn’t change.

With S3, if there is bandwidth per slice, and you only use one slice, then it matters.

One thing though - this is a great start, measuring read performance, but what I really, really want to know about it write performance. S3 lets you perform reads of arbitrary byte ranges, but it only allows you to write an entire file at once. That’s a fundamental difference to SSD/HD. I really want to know what’s happening with that in practise.

I suspect though the writing to the SSD is decoupled from the writing to S3. If I write 932 1 GB tables, and wait, then all of them will be in S3, and writing another table will not be slowed down in any way; the oldest table will simply be immediately evicted, as it’s already in S3.

If I can think of a way to measure write performance, then I’ll have enough (with the read performance) for a white paper.



Had an idea for a method to perform some decent basic initial testing with auto vacuum. It’s worked well. Script is written and tested, run time I think will be about 85 minutes, will kick it off in the morning. Once the results are in, I’ll do the write up and publish.

(Sneak peak - auto-vacuum is a black box, which was known, but a surprise is that one of the inputs looks to be a second black box.)


Amazon Redshift Cluster Toolkit AMI

So a while back I created some code which would analyze tables - for example, checking primary keys, foreign keys and uniques were valid (no duplicates), figuring out the smallest possible data type for columns, checking if a column can be made NOT NULL, count(distinct) of values, min() and max() and so on - and produce a very nice PDF report.

It’s an excellent way to get to know a database, also to validate existing databases, and to see where some types of improvement (data types) can be made.

Problem is, how to make this code available?

Answer is an AMI.

You spin up the AMI in your Redshift cluster VPC, so you have security - the AMI can’t emit data outside of your VPC.

The AMI runs a web-site, which provides a WWW based interface.

The interface guides you through creating a read-only user with the minimum privileges necessary to generate the report - you enter a username/password and the schemas/tables you want analyzed, and it emits the list of SQL commands necessary to create and configure the user - and then you kick off report generation.

Report generation takes as long as the amount of work you give it - so it can be ten seconds or some hours. You can view a list of running reports, and of course you can view the list of generated reports, PDFs, and download them.

I’ve spent today working on an MVP.

Most of it is done, I have to debug now the code which kicks off report generation (which means firing up a cluster and so on) and check reports are produced correctly.

One more day of dev time will have the system ready, then I need to produce an AMI for the first time, and test that out inside a VPC and so on.

In the longer term, there’s actually a number of other cluster-level utilities I’ve written (such as copy and pasting a schema from one cluster to another) and those too can become functions offered by the AMI.


New Data Types for Time

Right now, there are three data types for time; date, timestamp and time.

date is four bytes, and delta compression works really well - you get one or two bytes of signed integer of difference, where one unit is one day. That’s most of a year, and most of one hundred years, respectively.

timestamp is eight bytes and has microsecond resolution, so delta (which I think but I don’t think I’ve proved) has one unit being one microsecond, which is useless. All you can do is zstd.

We could imagine a variant of timestamp which has a resolution of seconds. Then the two delta encodings would be about two minutes and about half a day, which looks like it could be useful.

Finally, we have time, which is nuts - it’s still eight bytes, because although it’s only HH:MM:SS, it’s still microseconds, which means delta is again no use.

So we’d also want a variant of time which is seconds resolution only. This would be a four byte data type.

Then in fact what we could do is store a timestamp as two columns, one date and one time, four bytes each, both with delta. You’d get a year or one hundred years on the date (for one and two bytes per value), and two minutes or half a day, on the time. So you’d probably go from eight bytes with zstd (what you have to do now) to three bytes with two columns, delta for date and delta32k for the time part.


Replacement System Tables

Been doing finally again some more development work on the replacement system tables.

Modified lf.views_v1 to show the type of the view (normal, materialized, late-binding).

Modified lf.objects_v1 to use the new view code to distinguish between view types.

As ever when working with the system tables, you find issues.

First, with the improved lf.views_v1 code, I made three test views (one of each kind). If I run the query manually, and in the where add a clause indicating I’m only interested in schema ‘public’, only one row is returned. If I make the query into a view, and query the view and indicate I want schema public only, I get all three.

Second, there was a problem for quite a while with the system tables where in a query which used the leader node and work nodes you could no longer simply join to a system table if it had leader-node only type data types; even though you were not using those columns, the query planner would still reject the query. I had to modify a lot of code in the replacement system tables, converting joins to system tables (like pg_namespace, which contains an array) into sub-queries, where I pulled out only the columns I actually wanted.

That problem seems to have gone away.

Will have to return the code to its original form, at some point.

Also, just had cause to examine the official docs for STL_HASH. It documents 23 columns. Then I looked at the doc made by the System Table Explorer. It lists 27 columns, and comparing the two documents also shows 6 of the columns in the official docs no longer exist (although I’ve not yet spun up a cluster to manually check this, to make sure).

I’m pretty sure all the system table docs are like this. The thing is the system tables are a catastrophe. I think no one is using them, because they’re impossible to use, so no one notices the docs are staggeringly inaccurate.

Fortunately the one column I care about right now is still there :-)


Text Formatting

Very happy - finally found a way, and a fairly good way, to get right justified text :-)

Given a string column, use a window function to find the maximum length in the column, and then use greater() to chose the larger of that maximum length, or the length of the column title - call that resultant length field_width.

Then use lpad( string, greater(max(len(string)) over (), length('column_title')) ) as right_justified_string.


I think if you compute the length in a subquery and pass that column name to max(), the query planner might be able to do something efficient to figure out the max value, but I’ve not investigated to see if this is so.


A fascinating question turned up today on StackOverflow.

A Redshift user found that ILIKE was not working correctly for accented characters, but if he issued UPPER(string) ILIKE 'pattern', it worked (the user in this case passing in an upper case version of the pattern).

What’s going on here is that Redshift is not actually Unicode aware. You can store any bit-pattern you like in a varchar.

However, some functions are Unicode aware, like upper() and lower() - they have to be, you can’t change case on a multi-byte Unicode character without knowing about Unicode.

The core code base, though, and that includes the operators, such as ILIKE (rather than functions, which are written separately) do not understand Unicode.

So it turns out LIKE and ILIKE, being operators, are not Unicode aware, and perform byte-by-byte comparison.

The upper() and lower() functions are aware, though.

So what you have to do is help LIKE/ILIKE out. You use upper() (or lower(), either will be fine) on both the string and pattern, so they have the same case in Unicode, and then LIKE/ILIKE will work.

Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers