Redshift Research Project



System Table Tracker

The STT now has MVP for diffs.

Each release of Redshift now has a link to the diff with the previous version.

Currently shows tables and views (not functions), and the diff is computed by comparing columns - what columns there are, their names, data types and constraints. So it doesn’t compare, say, view text (which I do want to do).

Also, if a table or view is dropped, there’s no indication of this.

As I say - MVP - I was kicking back for a bit and finally got something going for diffs, been thinking to do so for a long time.


Weather Report

I recently investigated variability across dc2.large nodes, and found it to be high. As such, I am now ignoring dc2.large variations unless they are outside of the normal (large) range.

  1. ap-northeast-2 dc2.large network performance is slow again, at 7 seconds rather than 4 seconds.

  2. ap-northeast-2, ca-central-1, eu-west-1, eu-west-2, us-west-1 and us-west-2 ra3.xlplus the unusual slowness of the disk-read-write test two weeks ago has passed, and performance has returned to normal.

  3. eu-central-1 dc2.large disk-read-write has significantly improved, now about 3 seconds instead of 4.5s. If this persists, it’s an upgrade.

  4. us-east-1 all nodes types except dc2.large have received significant processor performance upgrades, loosing about one second each from the processor benchmark. This is big news, given how widely used this region is.


Missile Attacks This Morning in Kyiv

Woken up about 7am-ish by one of those big rolling booms.

First missile attack in a month, and there was a month or so peace and quiet prior to the attack a month ago.

I would have been asleep when the air raid warning went off.

I think Russia saves up its missiles for a month, from current production, and then launches an attack.

varbyte sorting

So, I’ve been focusing on the Comboboulator, until it’s released, but I took an hour or two off recently to have a quick look at how the varbyte data type sorts.

This is not a full investigation, just something done by hand, looking at two data types, which throws some light on this otherwise uninvestigated matter.

You can’t put a varbyte into a sortkey, but the Zone Map is always present, and so Redshift is computing a sorting value for every value in a varbyte column, just as it does for every column.

The method by which sorting values are computed is intuitive and easy to reason about for about half of the data types, and is varying degrees of screwy and/or crazy for the others.

A sorting value is an int8, so its eight bytes, and I obtain by making a table (table_1) with a single varbyte(64) column (so plenty of length), and issuing the following commands, and then converting minvalue or maxvalue (they should always be the same, as there is only one value, but this is not in fact always the case in Redshift, although it does seem to be the case with varbyte) to a bit pattern.

dev=# truncate table table_1;
dev=# insert into table_1( column_1 ) values ('\000'::char(1)::varbyte);
dev=# select minvalue, maxvalue from stv_blocklist where tbl = 110205 and col = 0;
     minvalue      |     maxvalue      
 72057594037927968 | 72057594037927968
(1 row)

My initial guess was that the first eight bytes of the varbyte value would be used, and if the value was shorted than eight bytes, padding would be with binary 0.

Here’s the first set of test data;

      0::int2::varbyte ->  144115188075855872 -> 0000001000000000000000000000000000000000000000000000000000000000
      1::int2::varbyte ->  144115188075856128 -> 0000001000000000000000000000000000000000000000000000000100000000
      2::int2::varbyte ->  144115188075856384 -> 0000001000000000000000000000000000000000000000000000001000000000
    255::int2::varbyte ->  144115188075921152 -> 0000001000000000000000000000000000000000000000001111111100000000
    256::int2::varbyte ->  144115188075855873 -> 0000001000000000000000000000000000000000000000000000000000000001
    257::int2::varbyte ->  144115188075856129 -> 0000001000000000000000000000000000000000000000000000000100000001
  32767::int2::varbyte ->  144115188075921279 -> 0000001000000000000000000000000000000000000000001111111101111111

So, what do we have?

We have a two byte value, the int2, which where Intel is little-endian has the most significant bytes to the right (i.e. the byte order looks reversed) and then, finally, in the most significant byte of the sorting value, we have the decimal value 2 (00000010).

Let’s gather some more evidence, and see what happens with int4.

      0::int4::varbyte ->  288230376151711744 -> 0000010000000000000000000000000000000000000000000000000000000000
      1::int4::varbyte ->  288230376168488960 -> 0000010000000000000000000000000000000001000000000000000000000000
      2::int4::varbyte ->  288230376185266176 -> 0000010000000000000000000000000000000010000000000000000000000000
    255::int4::varbyte ->  288230380429901824 -> 0000010000000000000000000000000011111111000000000000000000000000
    256::int4::varbyte ->  288230376151777280 -> 0000010000000000000000000000000000000000000000010000000000000000

We see the four bytes of the int4, and now the most significant byte contains the value 4.

So what happens with the first byte when we have an int8, which is going to occupy the whole sorting value?

      0::int8::varbyte ->                   0 -> 0
      1::int8::varbyte ->   72057594037927936 -> 0000000100000000000000000000000000000000000000000000000000000000
      2::int8::varbyte ->  144115188075855872 -> 0000001000000000000000000000000000000000000000000000000000000000
      3::int8::varbyte ->  216172782113783808 -> 0000001100000000000000000000000000000000000000000000000000000000
     15::int8::varbyte -> 1080863910568919040 -> 0000111100000000000000000000000000000000000000000000000000000000
     16::int8::varbyte -> 1152921504606846976 -> 0001000000000000000000000000000000000000000000000000000000000000
    255::int8::varbyte ->  -72057594037927936 -> 1111111100000000000000000000000000000000000000000000000000000000
    256::int8::varbyte ->     281474976710656 -> 0000000000000001000000000000000000000000000000000000000000000000

And the answer is, it goes away.


Let’s take a quick peak at char, before drawing some conclusions

\000::char(1)::varbyte ->   72057594037927968 -> 0000000100000000000000000000000000000000000000000000000000100000
\001::char(1)::varbyte ->   72057594037927937 -> 0000000100000000000000000000000000000000000000000000000000000001
\002::char(1)::varbyte ->   72057594037927938 -> 0000000100000000000000000000000000000000000000000000000000000010

Mmm. So, I think \000 writes a binary 0, but what I get in the sorting value is binary 32 - a space.

I understand that char terminates with a NULL, C-style, and so I suspect what’s happened here is I’ve fooled Redshift into thinking the string is actually an empty string, and then Redshift has padded with spaces.

So we get one byte of a space, and the most significant byte (which seems to hold the number of bytes in the sorting value, when that number is 7 or less) holds a 1 (00000001).

Moving on to \001, now I really do get a binary 1, no space padding, and the count byte remains 1.

We see that \002 then does as we would expect.

So what happens with longer chars?

\001::char(2)::varbyte ->  144115188075864065 -> 0000001000000000000000000000000000000000000000000010000000000001
\001::char(3)::varbyte ->  216172782115889153 -> 0000001100000000000000000000000000000000001000000010000000000001
\001::char(4)::varbyte ->  288230376690688001 -> 0000010000000000000000000000000000100000001000000010000000000001
\001::char(8)::varbyte -> 2314885530818453505 -> 0010000000100000001000000010000000100000001000000010000000000001

Well now, isn’t this interesting. In all cases we’re writing binary 1. What we see is that the binary 1 is written (it’s the right-most byte in the displayed sorting value), the char is then padding with spaces (00100000) to its length, and after its length, it is padded with binary 0, and then - if we’re 7 bytes or less, we get a count byte.

That’s quite something.

So, where does this leave us?

This is not easy to reason about.

The varbyte type stores bytes, which come from any sources - we could be issuing queries which convert char, or int, or what-have-you, in varbyte, and all going into the same column.

Firstly, we see that the data type of the original data matters - not just the value of the original data.

If we put the value 5 into a varbyte, its sorting value will differ depending on whether it is an int4 or an int8.

That’s not good. That’s hard to reason about.

Secondly, for values which are 7 bytes or less, we have this seemingly completely strange byte-count to reason about. What is that about?

Where it is in the most significant byte, the effect of it is to sort short values such that values with the same number of bytes are sorted together - we get all the 1 byte values, which are sorted with respect to themselves, then all the 2 byte values, which are sorted with respect to themselves, and so on.

Why? why I would want to sort my data first according to its length in bytes and then its value, rather than only by its value?

Why would an four byte decimal 10 be sorted as a larger value than a two byte decimal 10?

I can’t see a reason for this, and I think sorting should be based on and only on value.

Thirdly, the padding behaviour of char shows up in sorting, so we have different sorting behaviour depending on whether we’re looking at bytes padded with a space (which is up to the end of the chars length) or with binary 0 (up to the end of the sorting value) - so here again the data type, rather than the value, is driving sorting.

I may be wrong, but I think this is too hard to reason about, and so we must limit our use of varbyte to that which we can reason about.

I completely understand the third issue (but not that it is not documented), because char is defined as being padded with spaces, so that really is its value, but sorting in general should have been arranged so it depends on and only on value. That’s the only way that can reasoned about.

What this means is that if you actually want to knowingly control the sorting of your varbyte, you need to ensure all the data in the column comes from the same single data type. Anything else is likely rapidly going to become too complicated to reason about correctly.

Sheltering from Missile Attacks

So, I read a lot in the press about people sheltering from or during missile attacks, in their basements, or in shared shelters, in subways, or what-have-you.

I have never done this, not once, ever.

Now it might be the shelters are populated and people are using them, I wouldn’t know; I’ve never been in them.

But when the air raid warnings go off while I’m outside, absolutely nothing happens - everyone just gets on with their business, although I do typically go home, because all the public buildings close so I normally can’t do whatever I need to do, but that’s purely practical.

Kyiv is huge, the number of missiles is relative to its size is tiny, and the area of any single explosion is relatively minute, no one is aiming at me, and I’m not near a target (which would lead to real risk, from near-misses).

Of course, when there is a missile attack, people do die and are maimed - Kyiv has three million people in it. If you fire missiles into it, you are going to kill people.

But the odds of it being me are so staggeringly minute that going into shelter is completely needless.

I’m of the view people who do shelter are those who are risk averse, maybe also they have kids who could be nervous, and also who lack the knowledge to accurately judge the risks involved during these attacks.

So today was a typical missile raid.

I slept through the air raid alarm, was woken up the deep, rolling boom of a missile strike, I’d missed that one, I listened for a bit to hear if there were any more (which would mean there might be enough going on it was worth getting out of bed to have a look), but there wasn’t, so I stayed in bed (I was warm, comfy and sleepy), and went back to sleep.

The power hasn’t been interrupted, I have a meeting in an hour or so, and then I’ll be going swimming as usual.

But we must note also much as this was uneventful for me, Putin has taken goodness knows how much money from ordinary people in Russia, used it to make missiles, had those missiles fired them into Kyiv, and about a dozen people completely ordinary people, people who are you and me, people who are those who did the work which made the wealth which Putin took to build missiles, have been killed or maimed.



For any given entity, there is how complex it is, and to what extent the documentation can explain that complexity.

I am coming to the view that IAM is too complex to use, in part because of its intrinsic complexity, and in part because of the confused and confusing documentation.

Whenever I find I must turn to IAM for functionality, I then spend days blundering around in dense fog, getting nowhere. It is the death of productivity.

I am currently trying to figure out how to share an AMI, which is in one region only, with another account, to another region.

It’s been two days.

Last time I was involved with something like this was VPC.

It took me five days of sheer hell to figure out how to start a Redshift cluster in a VPC.

I am now starting to think if there are other ways to share an AMI, which do not involve IAM, because anything is more productive than using IAM.


New Version Rolled Back (and Diffs!)

So, interesting news.

A few days ago, a new Redshift release came out in four regions, 1.0.47470.

This version has been rolled back, and these four regions are now back on 1.0.47357.

This is very unusual. I think I’ve seen once before, in over a year.

Note this is fine if you’re on maintenance track - it’s there to insulate against such issues - but remember AWS removed tracks from serverless; you get whatever they give you.

I archive the system tables for each release, and I recently introduced a diff between a release and the previous release (currently for tables and views - not functions yet). The diff is currently based on changes to columns, either adding, removing or changing, and so shows new objects, and changed objects, but not yet deleted objects.

For the last couple of releases, there have been no system table changes but now, finally, there are!

And from this we can get an idea of what work has been going in since the previous release, which in this particular case, perhaps may shed some light on why the rollback occurred.

However, right now, where this is MVP, it’s hard work to use. It needs to show which objects are new, and the number of changes columns, and the system table tracker in general needs to show if an entity is accessible or not.

There are about one hundred changed objects, which is too many to inspect by hand.

I’ve had a look round, manually, and what I noticed is that the new version, which has been rolled back, brought in what I think is the beta functionality to automatically copy files from S3 into Redshift.

We see this new table;

Which is now gone.

This is a SYS view, and their SQL is censored by AWS so we can’t know what it does, but I note no matching table - I suspect this view may be a veneer wrapper over a function. Once I add in diffs over functions, we’ll be able to see.

There are a few interesting new views (which hopefully are accessible);

We can also note details like SVV_TABLE_INFO having a new column (create_time) added;

I think there’s a fair chance the docs will not be updated to include this new column, but we’ll see. I’ll keep an eye out for it.


Site Domain Name Change

It has come to my attention that AWS have a trademark on “Amazon Redshift”, so I’ve moved the site from “Amazon Redshift Research Project” to “Redshift Research Project”.

Weather Report

A little late this fortnight, as I’ve been travelling.

  1. ap-northeast-2 dc2.large networking remains slow, at 7s rather than 4s.

  2. ca-central-1 dc2.large disk-read-write unusually fast, at 3.29s (normally 4s to 5.5s).

  3. eu-south-1 dc2.large disk-read-write unusually fast, at 3.18s (normally 4.5s).

  4. eu-west-1 dc2.large disk read staggeringly slow, at 0.64s (normally 0.06s) - and the standard deviation here is 0.04s, so this was not the result of a freak test result, although it could be the result of a freak cluster.

  5. us-east-1 dc2.large processor benchmark faster than it has ever been, at 2.90s. Processor benchmarks are normally extremely consistent, so changes are meaningful. Previous benchmark was 3.35s, before that 3.81s, and has been at 3.8s for a long time.

There’s quite a few new AWS regions, where I need to add in those which support Redshift.

Hopefully for the next benchmark run.



Surprise for me just now - I issued an explain, and it’s turned up in the query history page. I’m pretty sure they did not show up before.

VACUUM Requirement for Merge Joins

So I’ve just spent half an hour investigating something which has been in the back of my mind for quite a while now.

The official docs do not actually define what is required for a merge join to occur, but one requirement is specified (although now I look for it, I cannot find it - all I can find is an AWS blog post which mentions it), which is that the tables involved must be 80% or more sorted.

This is an interesting statement - what does 80% sorted actually mean?

Remember that a table is distributed over slices, each slice holding a part (ideally equal in number of rows) of the table. Does 80% sorted mean every single slice must be over 80% sorted? or does it mean all slices, taken together, are 80% or more sorted? does it mean blocks, or does it mean rows?

No one knows.

So I’ve done a little digging.

It looks like 80% sorted means by rows and for the table as a whole, so for all slices.

In fact, it looks like Redshift is using the unsorted column in SVV_TABLE_INFO, which to me is just mind-blowing. That column counts the total number of unsorted rows and the total number of rows, and uses those two numbers to say how sorted the table is.

I consider this is flatly wrong, because the speed of a query is that of the least sorted slice. A table, to my eye, is as sorted as its least sorted slice, and sorting must be considered on the basis of blocks (disk I/O), not rows.

So I made two tables, both with a single int8 column, key distribution, compound sorting on that one column.

I put ten rows in both, vacuum full to 100 percent, and I get a marge join.

If I add enough extra rows (one by one) to a single slice that unsorted in svv_table_info hits 20%, I get hash join.

If I truncate both tables, populate again to 10 rows, and now add an extra row to every slice, so four unsorted rows at a time (2x dc2.large), when I get to my third insert, and so have 12 unsorted rows, I get to a bit more than 20% unsorted and the merge join returns.

I think Redshift is using the wrong metric to determine whether or not to merge join.

This leads to another question.

If I have a slice which is, say, 50% sorted, while the other slices are 100% sorted, and I’m getting a merge join, what’s actually going on under the hood?

Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers