Redshift Research Project



Amazon Redshift Weather Report

  1. In af-south-1, ap-northeast-2, ap-northeast-3, ap-southeast-1, ap-southeast-2, ap-southeast-3 (no ra3 nodes in this region), ca-central-1, eu-north-1, eu-south-1, eu-west-1, eu-west-3, me-south-1 and sa-east-1, dc2.large and ra3.xlpus nodes have received the one third improvement in processor performance, first seen in a number fo other regions two weeks ago (other node types are not tested in this region).

  2. In ap-east-1, a region where all node types are tested, has seen a one third improvement in all node type processor benchmarks. Curiously, the dc2.8xlarge read benchmark became slow (0.14s vs the usual 0.06s).

Apart from that, all quiet on the Western Front.


Compilation Information In The System Tables

Compilation information all comes from the system table svl_compile, which originally recorded compilation times, but there has been so much change with regard to compilation over the years, without corresponding changes to that system table, that the meanings of what you see in that system table are now a complete mess.

Back when Redshift first came out, compilation always happened on the leader node, and there was a compiled segment cache on the leader node (note queries which run purely on the leader-node never compile; they are interpreted, rather than compiled, by the Postgres-descended leader node).

Typically compilation time for a single segment on an unloaded dc2.large leader node were a few seconds, and so crazy queries (such as those issued by BI tooling) with dozens or even most of a hundred segments, on a busy leader node, would take several minutes to compile.

The compiled segment cache was kept in memory on the leader node, and so was small, but fast; if a segment in the cache could be re-used, it would be retrieved from the cache in about ten microseconds.

(Whether or not a segment can be re-used is a rather vague matter. It’s not just a case of checking the SQL and seeing if it matches the SQL of an existing segment; segments apparently are constructed rather like function calls - the internal code is fixed, but you can pass in values, and those values are used. So a segment compiled for one query can in principle be used by another query.)

When there was a cache hit, and so an already-compiled segment was re-used, obviously no compilation occurred, and the “compile time” (in quotes, as no compilation actually occurred) in the system table was exceedingly short - tens of microseconds. What this means is that the times in svl_compile are not really compile times, but something more like “how long it took to obtain this segment”.

Compilation occurred serially on a per-query basis; so multiple queries could be compiling concurrently, but each would be compiling one segment after another, and all segments compiled before a query began to execute.

Now, in svl_compile is a column which indicates if compilation occurred, or if the cache served the segment instead; and if it did not, then it must be that the segment cache was used.

Originally all was clear - either the leader node has compiled (and the column was “true”), or the segment came from the leader node segment cache (and then the column was “false”).

Then in time a second segment cache was added, external to the cluster and originally on a per-account basis but then later made per-region (but always also per-Redshift version - a segment compiled on one version of Redshift cannot be used by a different version of Redshift), and now the compilation flag if True meant either cache, where the only way to know which was by looking at how long it took to get the segment; 10 microseconds meant the leader node cache, 300 or so microseconds meant the external cache.

(Already at this point svl_compile should have been improved so it could represent this new behaviour.)

AWS has always (as ever, though) spoken in the most fulsome terms of the efficacy of the segment caches. I recall claims of 95% hit rates, then 99% hit rates, then 100%.

In truth, in my experience, it’s very hit and miss; I would actually also call it erratic - and here you have to remember Redshift appears not to get much in the way of testing. I suspect the caches are actually unreliable, but we have no way of knowing, other than observing from the outside their behaviour.

For example, with one particular client, we ran a test where an ETL system ran the same UPDATE query 100 times, where the SQL did not change, but the values being updated did change.

48 of those queries were compiled, which took a few minutes in total.

One of the chaps I was working with told me he, just once, had the entire set of 100 queries run through in seven seconds - which is to say, although he did not realise what it mean - without any compilation at all.

I usually find that segments I have had compiled one day are often re-compiled a day or two later.

Another client ran an overnight test, with a moderate range of queries, but generally varying by the value used, being generate by BI tooling; they saw about 25% compilation.

My general view on the segment cache is that yes, it’s there, but you can’t predict what you’ll get from it.

Also, of course, being a cache, it’s no use at all for novel queries, and the flexibility of segments to be re-used in different queries is certainly not so great as to be able to address this problem.

As mentioned, the external cache is per-Redshift version. (I’ve not checked if it’s per-AZ, and I’ve not checked if it’s per-node-type.)

As such, when a new Redshift version is released, you’re facing an empty cache. You need to make sure your ETL system can handle the situation where all queries are compiling - and there’s no way to test for this, because there is no way to disable caching. This is a major design flaw, and to my eye symptomatic of a lack of real-world experience in the dev team. They simply do not know to think about this issue, but this is a critical issue for users.

Moving on; a couple more years passed, and what happened next was that some regions had compilation off-load added. This is a system whereby the leader node no longer performs compilation, but farms the work out to helpers, who concurrently compile, and then the leader node serially (but extremely quickly) retrieves those compiled segments.

Compilation off-load is the one major feature I’ve seen added to Redshift in the last however-many years which I thoroughly and completely approve of. The implementation isn’t perfect (when I investigated it, I found ways in which to crash the cluster) but these issues were not something you’d run into in normal use.

In particular, there was one important (but improper) use of Redshift where compilation off-load removed one of the key problems; the use of BI tools.

BI tools have no clue about sorting, and so cannot handle Big Data and are entirely inappropriate for use with Redshift. Everyone and their dog uses them, because neither AWS nor the BI tooling vendors inform users about these issues (indeed, Tableau’s marketing is a constant wall of how scalable it is - in fact, it scales only for number of concurrent users; it does not scale for data volume).

Setting aside scalability for a moment, a second killer problem was of course compile time. BI tools emit crazy and novel queries, and humans do not respond well to multi-minute pauses before their dashboard updates.

Compilation off-load reduced compile times to a consistent 4 to 8 seconds, and so the compilation problem went away. This is important, because, as we will come to later, this improvement has recently been undone.

Now, this new functionality, compilation off-load, did for svl_compile what the Yucatán meteorite did for the dinosaurs.

In regions where there is no compile off-load, the compilation flag if True means compilation occurred, and this is always on the leader-node, and if False, means either the leader node or external segment cache were used - same as before.

In regions where there is compile off-load, the compilation flag now has a different meaning.

First, it is now that the leader node can access only the leader node segment cache; the compilation off-load system is now what accesses the external segment cache.

When the compilation flag is set to False, it means the leader node obtained the segment from the leader node cache.

When the compilation flag is set to True, it means the leader invoked the compilation off-load system to compile the segment. However, the compilation off-load system checks the external segment cache, and can return a segment from there. If the segment is not in the external segment cache, then it really is compiled.

If all segments were found in the leader node segment cache, then the compilation off-load system would not be invoked, and the “compilation” phase would be extremely brief - 10 microseconds per segment.

If however any segment was not in the leader node cache, then the first segment would always be marked as compiled, and also the segment(s) which were not found in the leader node cache would be marked as compiled.

I think the first segment was not really compiled, but rather it was doing the work to invoke the compilation off-load system and measuring the time taken for the compilation off-load to complete to the point the compiled segments are ready to be retrieved, as the duration for the first segment is always be about 4 to 8 seconds, and then all the other segments would “compile” very quickly, where the “compile” now in fact means the leader node retrieving the already compiled segments from the compilation off-load system.

Then another year or two pass, bringing us to about Q3 2022, and two more major changes occur.

First, it used to be that all segments were compiled before query execution, but this changed and now the compilation off-load system can be invoked per stream, and that this invocation occurs while the query is running (which fairly often induces pauses in query execution, where the query must wait for segments to be compiled), and each compilation takes a few seconds - and so where-as before, there was only one compilation per query, there can now be two, three, or even more.

This is absolutely fine for Big Data queries, where the query execution time massively dominates compile time, but it once again meant BI tools were no longer viable for human users, because now there are often multiple 4 to 8 second delays, one for each invocation of the compile off-load system, and this is too slow for interactive use. I think the dev team did not perceive the consequence of their change to compilation off-load.

Second, it used to be that the compiled segment cache could only be interrogated for queries which had compiled. What happens now is that once compilation begins, the properties of the segment are registered with the cache, so that other queries which will be able to use that segment once it is compiled, can find that segment - and when they do so, those queries will pause, waiting for the compilation to complete and the compiled segment to be placed in the compiled segment cache, and then they will retrieve the compiled segment from the cache.

However, and yet another blunder to add onto the… collective nous for blunders, anyone? the catastrophe of blunders? but anyway - yet another blunder, when a query waits for compilation to occur, the compilation flag for that query for the segments which were waited for is set to False - and this when the query has waited usually the full time needed for compilation (as this situation of waiting usually happens when something like a BI tool issues essentially simultaneously a number of very similar queries).

The only way you can tell this is happening is to look at the compile plans (a term I use to indicate the set of segment compilation information for a query) for the immediately preceding queries, examine their query text, and see that when the earlier similar query finishes compiling a particular segment or segments, the later queries, which have very long “compile times” for a non-compiled segment, immediately retrieve their segment.

This catastrophe of blunders is representative of the system tables in general.

I am a Redshift specialist. I’ve been investigating Redshift full time for many years. As such, I have some idea of what’s going on; but there is no other way to obtain this information. I am of the view AWS publish no meaningful information about Redshift, and indeed I aver what they do publish is dangerous, profoundly misleading, and should not be read unless you already know what Redshift is actually doing, so you are safe from being misled.

Given the knowledge I have accumulated, and with some specific investigation into svl_compile, I am in a position to figure out what’s going on, and so to write this post and describe the situation.

If you’re a normal person, who does not spend you life investigating Redshift, what possible chance do you have?

Redshift being sorted is a highly knowledge-intensive database, where AWS publish no meaningful information, and the system tables are catastrophically neglected.

I have the impression many businesses now are migrating from Redshift to Snowflake, and I think this is the primary reason. Other reasons include Support (I say no more - if you’ve interacted with Support you know exactly what I mean and you’re nodding right now), and lack of reliability.


Robust ETL Design and Implementation for Amazon Redshift

Usually ETL systems are provided with only minimal information and so are capable only of minimal actions. This leads to simple ETL systems, which for example check for new data files in some location, load them to a table, then move those files out of the way, and have no other capabilities. Such ETL systems are not robust, as they require arbitrary, unplanned human intervention to fix and to recover from bugs or data errors. The key to robustness in ETL systems is the provision of additional information such that the ETL system can undertake a wider range of actions, reducing the need for human intervention. A simple but profound enabler is the provision to the ETL system of the information of the set of available data files, and the set of data files which have been loaded into the database. This allows the automation of a wide range of behaviour, including the recovery process once human intervention has fixed bugs or data errors, and so provides a robust ETL system.


Weather Report

  1. Asia-Pacific region for and only for dc2.large, is experiencing significant to large slow-downs running the network benchmark.

    1. ap-northeast-2 and ap-southeast-1 both slowed by about 5 seconds, from about 2.5 seconds to about 7.5 seconds.
    2. ap-northeast-1 and ap-northeast-3 slowed by 3 seconds (from 2s to 5s, and 9s to 12s, respectively).
    3. ap-south-1 and ap-southeast-2 both slowed by about 2 seconds (from about 2s to to about 4s).

    In all cases, however, ra3.xlplus is unaffected (larger node types are not tested in these regions).

  2. us-east-1 dc2.large nodes have become slow for disk-read, going from 0.06 seconds to 0.18 seconds.


Queries in the System Tables

Redshift has a leader node, and worker nodes.

Queries can execute solely on the leader node, or on the leader node and the worker nodes (which I’ll call “worker node queries” from now on, for brevity).

Leader node queries have a pid and transaction ID (also known as an xid), but no query ID.

Worker node queries have a pid, a transaction ID and a query ID.

A query which has a query ID went into WLM, because the query ID is an artefact of WLM, and WLM is an artefact of the worker nodes.

(Leader node queries will internally have a query ID, but it’s a different query ID, leader node only, and as far as I can tell it is exposed nowhere in the system tables. The only query IDs we see in the system tables are for worker node queries.)

Queries additionally are either basically running or completed (there are lots of other states - queued, aborted, etc, but I’m not worried about them - they all come down to either “not running yet”, or “completed” by whatever means).

When managing a cluster it is necessary to have a full list of all queries; sooner or later, to fix some problem or other, you need to look at what’s been going on.

Getting this list is in fact, as far as I can tell, problematic : Redshift does not expose to users a way to see the full list of all queries which are or have been executed.

So, first, we have four fundamental classes of query;

  1. running leader node queries (running, pid, xid, no query ID)
  2. completed leader node queries (completed, pid, xid, no query ID)
  3. running worker node queries (running, pid,xid`, query ID)
  4. completed worker node queries (completed, pid, xid, query ID)

Not all information about a query is consistently present in all tables, as we will see; stv_recents for example does not hold a transaction ID, stv_wlm_query_state does not hold a pid.

Note the transaction IDs on the leader node and on worker nodes are in very different number ranges. It doesn’t change anything, far as I can see; you simply use them as you normally would. Transactions involving leader node and worker node tables appears to function correctly, no problems due to crossing the leader/worker boundary, but remember that STV tables-like objects do not participate in MVCC and so do not participate in transactions.

(The STV tables are really table-like interfaces to in-memory state, with no support for MVCC, and so when you read them, you get whatever rows are present a the table is scanned - rows can perfectly well be being added and remove while you read rows. By contrast, MVCC ensures that the rows you see, in all tables, are those which were present at the time your query started; you never see any changes, and so have a consistent view of the data.)

Note that queries which are handled by the leader node query result cache are not run, and so are not present in any system tables.

Finally, note the char columns for query text actually contain UTF-8. If you export the system tables, to archive them, and they contain any non-ASCII characters, you cannot load them back into the DDL of the original tables, because you will end up trying to load UTF-8 into char, and Redshift will not permit this.

(An open question is what happens to a multi-byte UTF-8 character which would be truncated by the 200 byte limit. I suspect part of it is in one row, and the rest in the next, and this in some situations may be problematic, where command-line based SQL clients in my experience fail hard on broken UTF-8; such clients will not be able to read such rows, because their UTF-8 parser will fall over. However, I could be wrong - I will need to conduct this little investigation at some point.)

System Tables for Queries

So, now, let us turn to the system tables which hold information about queries.

  1. stv_recents

    This table holds one row for each of the most recently started 100 queries, of all types - leader node, worker node, running and completed. Note I say “started” and I mean this very particularly; queued queries are not shown. Only those queries which have started, and those which have (by whatever means) completed.

    Historically, this table showed running queries; however, checking it now right now, it appears to show only completed queries. I cannot get it to show a running query of any kind. I can see completed leader node only queries (select 1;) and completed worker node queries.

    I do not know if queries running via CSC are listed.

    The table holds little information about a query. There’s the user ID, query status (running, completed, etc), when the query started (but this timestamp differs by a small fraction of a second to the start time for the query as recorded in other tables holding information about queries), and the pid, but the pid is set to -1 if the query has completed, so you cannot use it once the query has completed. There is no transaction ID.

    To some extent the lack of information is because the table is holding information about leader node queries and worker node queries, which are rather different to each other, but some of it is just seems poor implementation. The transaction ID could have been given, for example, and I can’t see why the pid is set to -1 on completion; the leader node knew the process ID when the query was running, so it does know. Why throw the information away?

  2. stv_wlm_query_state

    This table holds per-query information about worker node queries, in every state, but I think completed queries are rapidly removed (and placed into stl_query and stl_wlm_query, which are the archive of completed queries).

  3. stv_inflight

    This table holds per-query information about running queries only, and I think it holds information only for worker queries.

    This table has a column slice, which makes it look like the table should hold per-query, per-slice information. I have never seen this; there has been one and only one row per query, and the slice is always set to the slice number of the leader node (that number had varied over the years, and these days appears to vary a bit by cluster - it looks to be a number larger than about 12000).

    I do not know if CSC queries are listed.

  4. stv_exec_state

    This table holds per-query, per-slice, per-step information about worker node running queries only. I do not know if CSC queries are listed.

  5. stl_query

    This table holds per-query information about completed worker node queries.

    CSC queries are listed. Aborted queries are listed.

    This was the original table for completed query information. This table contains some columns which you can need when dealing with queries but which are not available in stl_wlm_query (such as label and pid) and so I find stl_wlm_query often needs to to joined to stl_query.

    Note also here the querytxt column a while ago began to be used not only for query texts, but also for status messages. If you parse it, you need to be able to tell the difference between SQL and messages.

  6. stl_wlm_query

    This table holds per-query information about worker node completed queries.

    CSC queries are listed. Aborted queries are listed.

    This is a later table, which more comprehensive WLM information.

  7. stl_ddltext

    This table provides the full query texts of a particular set of SQL commands (relating to table and schema creation and modification, but the docs do not provide a complete list), with one row for every 200 bytes of query text.

    It turns out these commands are leader node only, do not have a query ID, and so this table is how you find out such queries have been run. Where there is one row for every 200 bytes of query text, if you’re building a list of all queries, you need to select only one row from every query.

  8. stl_utilitytext

    This table provides the full query texts of a particular set of SQL commands (which looks like pretty much everything which is not in stl_ddltext and which does not run on a worker node, but the docs do not provide a complete list, but it’s a different list to that for stl_ddltext - obviously begs the question, why split this information over two tables?), with one row for every 200 bytes of query text.

    It turns out these commands are leader node only, do not have a query ID, and so this table is how you find out such queries have been run. Where there is one row for every 200 bytes of query text, if you’re building a list of all queries, you need to select only one row from every query.

Building the Query List

So, we have four classes of query, and we need to find a way to produce each of them.

  1. running leader node queries

    If a query is in stv_recents and not in stv_inflight, and the query status is Running, then it is a running leader node query.

    Currently, I am not seeing any Running queries in stv_recents, and so this set of queries, to my eye, is not available.

    (Note we have to use stv_inflight here (rather than stv_wlm_query_state) because stv_wlm_query_state does not contain pid, and pid is the only column in stv_recents which allows us to join it to other query-information tables.)

  2. completed leader node queries

    Leader node queries which are commands which are stored in stl_ddltext or stl_utilitytext can be found in those two tables. Commands which are not stored in those two tables, such as a plain select, to my eye, are not available in the system tables.

    In principle, stv_recents offers limited visibility of all completed leader node queries, even those not in stl_ddltext or stl_utilitytext. In practise, it seems to be not possible to figure out which queries in stv_recents are both leader node and completed.

    It is easy to check if a query is completed - the pid is -1 and/or status is Completed, but because I no longer have the pid I can see no way to check if the query is present in stl_query or stl_wlm_query, and indeed even the start timestamp cannot be used because it is not consistent across these tables, and I need to perform this check, for it is if query is not in those tables that it is a leader node query.

    So, in short, there is a considerable but not full record of completed leader node queries.

  3. running worker node queries

    This is simple; all queries in stv_wlm_query_state which have state Running. We could also simply take all rows from stv_inflight. In fact, where stv_wlm_query_state lacks some important information, which is present in stv_inflight (such as pid), it’s often necessary to join the two tables.

  4. completed worker node queries

    This is simple; all queries in stl_query, or stl_wlm_query. I believe all worker node queries are in both of these tables, and they often need to be joined, again, due to missing columns stl_wlm_query.


This is a mess.


Explosions in Kyiv this morning, within earshot. Three louder explosions (I don’t have enough experience yet to tell the difference between explosion size and distance, although I begin to think I am starting to know the difference between an in-air explosion, something being shot down, and a ground explosion, when the missile or drone hits the ground to explode), one or two more further away. A bunch of car alarms going off after the first, then after a bit, emergency services sirens. Birds all unsettled and up in the air. Almost nothing visible from my window - I can see an arc about 130 degrees, so it’s not hard for events to be out of sight.

Explosion in Sky

However, one photo - see the little dark puff a little to the left of center? something exploded in the sky, to make that.

It’s interesting how independent the body is, in its reactions. In my mind, and as far as I can tell in myself, I feel fine. I was never much bothered by this, the danger is zero because Kyiv is huge, and I’m also used to it now. However, someone just pulled some furniture around in a nearby apartment - so a sudden, deep sound. My guts jumped, because it was sudden and nearby.


So here’s a cool little thing.

This is the page showing information about normal views (not late-binding, not materialized).

What’s cool is that the AMI figures out how many and which tables are used by each view - fully recursing into the view, so if a view uses other views, it’s the total set of tables for all of them - and it computes the sorted, unsorted and total blocks and rows for each view.

Now you can finally know how big that “big view” actually is :-)

Views Page


Major Ongoing Missile Strikes In Kyiv

  1. Incoming missile and explosion (audio only, AAC)
  2. Incoming missile and explosion (audio only, mp3)

Well, that got busy. About 15 or so large detonations in the last few minutes, often in rapid succession. I’m pretty sure many were explosions in the air - so, missiles or drones being destroyed by defensive fire - but I saw one plume of smoke, big enough I would say to be from a ground hit - although whether it was the intended target, or a shot down drone coming to ground and then exploding, I know not.

That’s the largest number of explosions, in the shortest time, I’ve had experience of, but I have almost no experience, and you only know of what’s close enough to hear or see. I also suspect a lot of incoming drones are being blown up in the air, and that the sound from that travels a lot further. I felt one or two blast waves, but I would say they felt like they were coming downwards, not upwards.

Couple of late-comers going off now, further away.

Hoping and praying (metaphorically - not religious) for the people near those explosions, but it’s pretty certain a couple of people will have caught it.

Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers