Redshift Research Project



MIME Considered Harmful

I’ve finally got round to writing my own email client (I’m on Linux and there are none which I’m happy with).

I’ve now fully encountered MIME, and I have to say, I have come to the view MIME was a mistake. It has too much flexibility, and so much complexity, and that makes implementation complex, and that complexity simply almost all of the time is not used for email.

MIME has made something simple - email - into something complex, where that complex is not used (but you have to support it for the occasional case where it arrives, so your email client doesn’t fall over).


End of the Road for Apache

I’ve just been burned, again, by Apache web-site config.

This has happened too many times.

I’ve ended up serving the wrong sites, on the wrong IP names, again.

The way Apache works, which is to silently match to a default site rather than throw an error, is the core problem.

Silently doing the wrong thing is not a good idea.

Secondary problem is that the docs are a sprawling mess, so there’s nothing you can find to read to save yourself.

I use Apache ATM because I use CGI. It’s quick and easy to use, and not generally supported by web-servers.

Time has come to find one of the successors and use that.

Snapshot Isolation

Well, I just had a surprise.

A little while back RS devs implemented Snapshot Isolation, a second type of transaction serialization.

I assumed it was superior to the existing implementation, because Postgres implemented it in 9.1 (I think it was), which was after 8.0 (which RS came from).

I’ve finally now sat down and looked at what it is and how it works, and it turns out to be what is at first glance to me a basket case.

It’s a weakened form of transaction serialization, which allows for types of failure the existing method does not allow.

I need to read further into it, to see what benefits come from this (presumably performance, for whatever performance means - faster, or perhaps more able to cope and correctly with situations which otherwise would cause SIV), but at first glance it’s a case of why on earth would you do this?

I’m thinking maybe a lot of the people who are using this with RS have switched to it thinking it’s better and with no idea about what they’ve actually bought into.



I’ve been digging into the system tables to try to find out if I can know when a table has been ANALYZEed.

So far, it looks like;

  1. If reltuples in pg_class is 0.001 (it’s a float4, that’s the printed value from SELECT) then ANALYZE has never been run (and also anything else which would update that column has never been run).

  2. In pg_statistic_indicator, if staiins or staidels are greater than 0, then the table has seen changes since the most recent ANALYZE (and the large number numbers in staiins or staidels, which are rows counts, the more change).

  3. In an up-to-date table, pg_statistic_indicator.stairows equals pg_class.reltuples and staiins and staidels are both 0.



I’m investigating COPY.

In particular, the first question I had is whether or not the amount of memory avaliable in the slot for the query makes a difference to performance.

I’ve just discovered that it looks like an all-or-nothing scenario; if you have enough memory, the sort step runs in memory. If not, on disk.

If the sort phase runs on disk, then the amount of memory allocated to the slot makes zero difference to performance. I have a slot with 77% of system memory (about 6gb) loading at the same speed as a slot with 1/20th of system memory (1mb).

However, if the data is small enough to be sorted in the memory of the slot, it will be.

Next questions;

  1. Can you end up in a situation where a COPY runs in memory but ends up swapping a lot and is worse than running from disk?

  2. What’s the performance difference between in memory and on disk? (I suspect very little in absolute terms, because to be in memory the data has to be fairly small, so the sort time even on disk is never going to be very large).



My second question regarding COPY has an answer.

When COPY runs, each slice loads one file at a time (well, in theory - sometimes, as with normal queries, you see some slices doing 2x or even 3x work, and other slices doing no work; but broadly speaking).

The question was; is there a limit of one file at a time per slice regardless of the number of queries running COPY, or can a slice load one file per query running COPY?

And the answer, unexpectedly, at least for dc2.large and dc2.8xlarge, which I have tested on, is that there is a limit of 15 (fifteen) queries per slice.

One minor exception was on dc2.8xlarge, where of the 32 slices (two nodes), I saw 4 slices max out at 16. This might just be some sort of off-by-one behaviour. In any event, it is not a substantial change from 15.

So, on the face of it, the correct config for COPY is to set up a single queue of 15 slots, give that queue 1% of memory, and direct all COPY commands to that queue.

I have not, repeat NOT, tried this on a real, live cluster.



It looks to me like another problem with SQA and also the system tables is that when a query is routed to SQA, you have no direct way of knowing via the system tables which queue it was originally intended for.



I’ve just been looking into UNLOAD.

On dc2.large, there is a limit of three concurrently running UNLOADs.

Issue more than that, and they’re under the hood being queued.

Also, UNLOAD for simple queries (plain SELECT with no join) is wholly insensitive to memory.

It looks like then for simple UNLOADs, a single three slot queue with 1% of memory.

I have not REPEAT not tried this on a real cluster.

I’ve not yet had a look at other node types.


PSA : AQUA Has Become Mandatory

This is bad news.

On or about the 10th September 2022 AWS emailed end-users running RA3 clusters and who had used AQUA in the past (there was no public announcement) with the following message;

We want to inform you that Amazon Redshift will now automatically determine when to apply performance acceleration techniques that leverage AQUA technology. Thus, the ability to enable and disable AQUA will be removed from the console starting the week of September 12, 2022.

That grandiose first statement made me think AQUA was being made mandatory.

This prompted me to drop what I was doing and spend what turned out to be the following eleven days, all day, every day, investigating while I still could (to investigate, you need to be able to compare with to without, which means being able to disable AQUA).

In fact, at the time, as a bit of time passed, it became clear all AWS had done was… …remove the option from the Redshift console.

From Python using boto3, the AQUA configuration options are unchanged.

However, as of 18th May 2024, I happened purely by chance to look at the create_cluster command for Redshift in boto3 and saw this;

AquaConfigurationStatus (string) – This parameter is retired. It does not set the AQUA configuration status. Amazon Redshift automatically determines whether to use AQUA (Advanced Query Accelerator).

To my knowledge there has been no announcement of this change. It looks like AQUA really is now mandatory - or rather, it’s up to Redshift whether it is in use or not.

On my clusters, where I do at bring up set this parameter to False, I see the following;

  "AquaStatus": "disabled",
  "AquaConfigurationStatus": "auto"

I’ve also seen this on a client’s cluster. It looks to say that Redshift can enable or disabled AQUA as it sees fit, and it is currently disabled.

This is a fundamental problem. To put it simply : it is no longer completely safe to use LIKE or SIMILAR TO on Redshift.

The problem is that AQUA makes a copy of the tables in use by a query which runs on AQUA, and so for a query running on AQUA the first run is very slow (tables being copied), the excess time taken being gradually recovered over the course of further runs of that query or queries which are using the same tables.

In other words, if you run queries on AQUA just once or twice or a few times, AQUA can perfectly well hammer your system, and by a lot, not speed it up.

Note that I’ve no information on how long AQUA keeps its copy of a table on Redshift, or what happens when that table changes.

So, now, we have in the heart of your system an undocumented black box controlled by a third party which can sabotage your system performance and it can be enabled or disabled (or rapidly toggle, who knows?) at any time for reasons of which you know nothing, where you have no control over whether AQUA is enabled or disabled, and you cannot disable it if it is a problem.

The devs are lunatics - actually, I think they’re academics, and wholly lack real-world experience.

There was a incident a few years back with the Redshift version string. It’s been like this since day one, so ten years plus;

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.28422

Then one fine day back in late 2021, without warning, the devs changed it, to this;

dev=# select version();
 Redshift 1.0.32574 on Amazon Linux, compiled by gcc-7.3.0
(1 row)

It’s much nicer (though it turned out the original version string has been lying about the GCC version for about a year, and I imagine the OS for much longer).

Only problem, as you can imagine, is that the original format string as you can see is unstructured and software around the world uses a thousand different ways to pull out the version string, and so changing the version string format broke a lot of systems - it broke SQLAlchemy. It broke AWS’s own JDBC driver (at which point it instantly became clear the RS test suite, which has long been considered weak and questionable, could not be including even a single connection to RS over AWS’s own JDBC driver).

Now you and I would know without a microsecond of thought this would happen.

I heard (directly - not via the usual channels) from someone on the Support team who said “we were going to introduce an option to use the old format of version string in the next release”.

These are the devs responsible for the black box which is AQUA sitting in the heart of your business critical Redshift cluster.

Until this point, you could turn off everything which the devs had done which was a fundamental problem - this is the first time when it is no longer so. I knew this had to come sooner or later; the most recent close call was with auto-MV, but you can disable that in the cluster configuration. With AQUA, you could only work around the risk by not using LIKE or SIMILAR TO any more - and that’s central SQL functionality.


I’m afraid I must retract one of the earlier claims about UNLOAD.

My deepest apologies.

The earlier post about UNLOAD was correct with regard to memory use, but incorrect with regard to an internal under-the-hood three query limit.

The reason for this is that I’m actually currently working on a cluster cost reduction contract.

I originally threw some test code together as a means to an end (I’m currently looking at WLM config) rather than an end in itself; and the problem with that is that if the code lacks structure you’re more likely to make mistakes and also to miss them, and the code gradually become more complex and got to a point where I ha to sort it out, because it was no longer structured enough to be safe to rely on for correctness.

This was after I emailed the list.

I then found a bug in the test code for parallelism, such that queries were going to the wrong queue.

Having fixed this and re-structured the code, I’ve found UNLOAD has parallelism for at least 20 queries (which is the max I’ve tried on for now - probably there is no limit).

I’m about to release a PDF for UNLOAD, with the test code and proper write up.

My apologies again.


Strange Goings-On In The Land of Version Numbers

So, late reporting on this, I’ve been very busy with a client.

Strange goings-on in the land of Redshift version numbers - so much so that there is a possibility here this is actually a bug in my code, so keep a bit of caution in mind.

What’s happened is that on 2024-05-16 (probably earlier - I’ve not been checking as often as usual, busy), Redshift went back to a previously release version number (back to 1.0.66954, which was out on 2024-04-23).

This has never happened before. When RS does a rollback, you get the old version but the version number increases.

Emergency? blunder? change in policy? no clue.

In other news, on 2024-05-26, maintenance track moved forward, in a few regions, to 1.67699.

Again, this turns out to be a version which has never been in current. Given that RS when rolling back takes an old version but gives it a new version number, I would guess these are older releases, but they’ve been given a new version number - but this means we have no clue from the version number how far back maintenance track actually is, or indeed, simply, what version it actually is. Which is bloody stupid. It also means there’s no compile cache “warming” on the maintenance track.

However, we can make a guess as to the original version by looking at the system tables.

We see 1.0.67699 has 1255/756/2978 tables/views/functions, which was first seen 2024-02-02 with version 1.0.62614, and last seen 2024-04-27 with version 1.66960.



Every system table is broken.

When you come to actually need to use them, you become a dwarf, because now you dig like crazy to find truth at the bottom of it all.


Here’s the docs page;

Here’s my dump of the table (view really) in latest version of RS at time of writing;

So, first things first.

What basically (exceptions will be discussed) happens when there’s a commit is;

  1. the transaction goes into the queue for commit
  2. once out of the queue the commit begins
  3. the commit begins with a bunch of work on each worker node - any number of commits can be doing this concurrently
  4. then a final stage on the leader node - and only a single commit can be be doing this at any one time
  5. and we’re done

So to be clear - there is a stage in the commit process where a single and only a single query can run at a time, which is the leader node completing the commit.

(I’ve not checked, but it may be the column queuelen is measuring the queue into this bottleneck stage - it should be possible to (at least roughly) check this by computing the cumulative number of queries where the workers have finished but the leader node has yet to start.)

If you want to measure the performance of the single-commit stage of the commit queue (which I expect - but I’ve not actually checked - will be the bottleneck), you’re looking at the time between startwork and endtime for when node is -1, and that’s it (well, after you’ve handled all the issues in the table so you have a sanitized version - see below).

If you feed transactions into the commit queue faster than the leader node can commit then, you get a huge backlog forming on the leader node.

This table basically (we’ll get to the exceptions) for each transaction has one row for the leader node and one row for each worker node.

The leader node row has node set to -1, and this is the only row which sets startqueue, and this row does not set endflush, endstage, or endlocal (as these timestamps are for work performed only on worker nodes).

Note also each node is setting times using its local clock, and different clocks are never exactly sychronized. As such you can see seemingly out-of-order events.

Now for the fun stuff.

  1. A timestamp of 2000-01-01 00:00:00 means NULL.
  2. There are a number of transactions for xid 0. Presumably something to do with Redshift itself; I don’t know what these are, so I exclude them.
  3. There are some transactions where startqueue is 2000-01-01 00:00:00 - I’m not sure what this means (there was no queuing?), so I ignore them.
  4. VACUUM queries totally spam this table with garbage. You get a ton of commits for the same transaction - yes, really. It’s incredible. Go to STL_VACUUM, get distinct xid from there, and ignore them all.
  5. This leaves you with a good chunk of transactions which have two identical leader node rows (node set to -1, every other column has the same value). This looks like a bug to me, unintended duplicate rows, so I use a CTE to produce a sanitized version of STL_COMMIT_STATS by DE-duplicating node -1 rows.

Having done all that, I cannot now see further errors (and I mean exactly what I write :-), and that’s enough for tonight.

Here’s the current code to produce some stats on that single-query bottleneck;

  -- MG2 : de-dupe what look like duplicate leader-node rows

  valid_stl_commit_stats as
      xid, node, startqueue, startwork, endflush, endstage, endlocal, startglobal, endtime, queuelen, permblocks, newblocks, dirtyblocks, tombstonedblocks, tossedblocks, headers, numxids, oldestxid, extwritelatency, metadatawritten, batched_by
      node != -1

    union all

      xid, node, startqueue, startwork, endflush, endstage, endlocal, startglobal, endtime, queuelen, permblocks, newblocks, dirtyblocks, tombstonedblocks, tossedblocks, headers, numxids, oldestxid, extwritelatency, metadatawritten, batched_by
      node = -1
    group by
      xid, node, startqueue, startwork, endflush, endstage, endlocal, startglobal, endtime, queuelen, permblocks, newblocks, dirtyblocks, tombstonedblocks, tossedblocks, headers, numxids, oldestxid, extwritelatency, metadatawritten, batched_by

  -- MG2 : get rid of all questionable rows (VACUUM, odd data values, the unexplained xid 0, etc)

  valid_xids as
      extract( epoch from (endtime - startwork) ) as duration
          node       = -1
      and xid       != 0
      and startwork != '2000-01-01 00:00:00'::timestamp
      and xid not in ( select distinct xid from stl_vacuum )

  min( duration ) as min_duration,
  max( duration ) as max_duration,
  avg( duration ) as avg_duration,
  stddev_pop( duration ) as dur_sd,
  count(*)        as count
      sum( dirtyblocks ) as sum_dirtyblocks,
      sum( headers     ) as sum_headers
      join valid_xids on valid_xids.xid = valid_stl_commit_stats.xid
    group by

This should be easy. I’ve spent hours discovering and then working around the issues - and I’ve spent ten years plus getting used to the system tables and Redshift, to be able to do so, so quickly. I could be wrong, but I think a lot of people using this table are going wrong, and everyone using this table is having to repeat, with their varying measures of success, the same work to get something useful from it. Given how much AWS charge for Redshift, this is not good enough.

Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers