Redshift Research Project



AUTO Sorting

I’ve just rediscovered something I bumped into quite a while ago.

It is not possible for me to write a view to tell if a table has AUTO sorting. This is because the system table which contains this information is not accessible to the user (admin or otherwise).

The only way to find out is to use SVV_TABLE_INFO, as this view is owned by Amazon, and has the privs necessary to access this information.

I’ve said it before and I’ll say it again : the system tables are an unbelievable mess.

Note though AUTO is the default - if you do not specify sorting, that’s what you get - and so if you actually want an unsorted table you now need, after making the table, to then issue an ALTER TABLE command, to deactivate AUTO.

Somebody hand me a weeping emoji, please.


Current Timestamps

There is at times a need in a query to obtain the current timestamp.

In principle, current timestamps can either be of the time the current transaction started, or the time at which the current query started, or the time the function to produce the timestamp was called (so multiple calls in one row will product multiple differing timestamps).

Redshift however has only three documented aliases or functions to return a current timestamp.

name type output occurrence docs
getdate() function timestamp transaction here
sysdate alias timestamp transaction here
timeofday() function varchar call here

So, what have we actually got here?

First, getdate() returns no fractional seconds. It’s a deformed timestamp. No use to anyone, for anything. Why does it even exist? I could just use date_trunc().

dev=# select getdate();
 2022-04-17 19:33:04
(1 row)

Second, sysdate. This gives is the timestamp for the beginning of the current transaction. Good!

dev=# begin;
dev=# select sysdate;
 2022-04-17 15:22:11.981487
(1 row)

dev=# select sysdate;
 2022-04-17 15:22:11.981487
(1 row)

Third, timeofday(). This, rather baroquely, gives us a varchar human-readable timestamp for the time the function was called. Weird, but also the only alias or function which provides a timestamp of any kind at the time the function was called. You can then go right ahead and it convert it back to the timestamp it was originally by casting to timestamp, but… like… the engineer in me is crying on the inside, you know?

So this leaves us with fundamental functionality missing.

What we need, for the basic full set of functionality, is this;

  1. timestamp for start of current transaction
  2. timestamp for start of current query
  3. timestamp at time of the function call

What we have is this;

  1. timestamp for start of current transaction
  2. nothing
  3. a human-readable varchar of the timestamp at time of the function call

This is really basic missing functionality. Ga-ga baby-talk “do you have a real database” stuff. Redshift keeps adding large blocks of new external functionality - Concurrency Scaling Clusters, Spectrum, AQUA, you name it - but this is an example of the bread and butter basics, the stuff everyone uses, not properly implemented and not being fixed.

Now let’s take a look at the docs, and see what problems they have.

  1. getdate().

    The following example uses the GETDATE function to return the full timestamp for the current date.

    Except it doesn’t provide a full timestamp, as the timestamp has no fractional seconds.

  2. sysdate.

    There’s only the one error here; the page talks of sysdate as a function, when it’s an alias.

  3. timeofday()

    This is where all the cogs whirr and fling themselves out of the mechanism in a desperate bid for freedom.

    Here’s the one-liner of documentation;

    TIMEOFDAY is a special alias used to return the weekday, date, and time as a string value. It returns the time of day string for the current statement, even when it is within a transaction block.

    It’s not an alias, it’s a function. More importantly, this function does not return the current time for the current statement. It returns the current time at the time the function is called; so if you called it say twice in one row, you’d have two different times.

    We can see this with the following demonstration;

    dev=# select timeofday(), timeofday() from stl_bcast limit 5;
                  timeofday              |              timeofday              
     Sun Apr 17 15:06:25.590826 2022 UTC | Sun Apr 17 15:06:25.590833 2022 UTC
     Sun Apr 17 15:06:25.590837 2022 UTC | Sun Apr 17 15:06:25.590839 2022 UTC
     Sun Apr 17 15:06:25.590842 2022 UTC | Sun Apr 17 15:06:25.590844 2022 UTC
     Sun Apr 17 15:06:25.590846 2022 UTC | Sun Apr 17 15:06:25.590848 2022 UTC
     Sun Apr 17 15:06:25.590850 2022 UTC | Sun Apr 17 15:06:25.590852 2022 UTC
    (5 rows)

Finally, we can now take a detour into the world of undocumented current timestamp aliases and functions.

Here though, we must now remind ourselves to be careful of the difference between the leader node and the worker nodes; very likely everything from Postgres will run on the leader node, but it is an entirely different matter as to whether any given functionality is available on the worker nodes.

Everything that follows has been tested on worker nodes, by dint of having a Redshift table in the from clause.

name type output occurrence
current_time alias timetz transaction
current_timestamp alias timestamptz transaction
localtimestamp alias timestamp transaction

So, basically, a weird little bunch.

All issue timestamps for the transaction, and so are the same as sysdate, but the output varies.

Obviously current_time returns only the time part only, but both it and current_timestamp return a timestamp with time zone, not timestamp; but localtimestamp returns a timestamp, rather than a timestamp with time zone in our local time zone. Inconsistent behaviour, to my eye.

On a Redshift system, all these will normally be UTC, and you can convert to a timestamp easily enough by using at time zone 'utc' (you cannot mix timestamp with time zone and timestamp).

Still, if you’re porting code to or from Postgres, which lacks sysdate (it comes from Oracle), current_timestamp at time zone 'utc' is useful, as it works in the same way on both systems.

It’s actually hard to know if these aliases are actually supported or not, as the lack of documentation may just be an oversight (there is at least one other docs page I know of which simply does not exist, that for the SQL command ALTER FUNCTION).

An as aside, there’s a docs page here, which lists “Deprecated leader node-only functions”. In one sense, the page is correct - all of these aliases and functions do run on the leader node, they’re part of Postgres - but in another it’s not, because as we’ve seen, the devs somewhere along the line have actually and deliberately spent the time to implement three of these aliases on the worker nodes. Aliases and function don’t turn up on the worker nodes unless the RS devs put them there. So… is this page out of date? or are we implicitly being told not to use these aliases even though they are on the worker nodes? who knows. The Redshift docs should come with a crystal ball.

Note that Redshift doesn’t actually understand timezones, and so just has “current time”, which AWS keep set to UTC. You can nowadays set a per-session offset to that time, and the offset for that timezone will be applied to the output from time function, but the cluster still only knows about a single time, not about timezones, and it still isn’t using your time; you’re just getting the offset you specified for your session.

Functions from Postgres which do not work on worker nodes are;


Note the alias localtime does not work, which is curious, given that localtimestamp does. Probably an oversight.



So, to begin with; where Redshift has a leader node which is derived from Postgres, and worker nodes which are not derived from Postgres, any given functionality (such as a function or data type) is implemented either only on the leader node (often the case), only on the worker nodes (very rarely the case), or is implemented on both (often the case).

When functionality is implemented on both, behaviour and output can be different on the leader node and the worker nodes. The docs say absolutely nothing about this, and so it’s a booby-trap just waiting to catch the unwary.

We can see a simple (and in this case, harmless) example of this with the interval data type.

First, on the leader node;

dev=# select '2020-01-01 10:00:00.123456'::timestamp - '2020-01-01 09:00:00.000000'::timestamp;
(1 row)

Second, on a worker node;

dev=# select '2020-01-01 10:00:00.123456'::timestamp - '2020-01-01 09:00:00.000000'::timestamp from stl_bcast limit 1;
(1 row)

So we see here the leader node renders an interval to text as `HH:MM:SS.ssssss’, where-as a worker node prints as the total number of microseconds (I suspect internally, they’re both a count of microseconds).

A query runs on the worker nodes if it uses any resource on the worker nodes, such as a normal Redshift table or a worker node specific function. So here, the first query, which uses no worker node resources, runs on the leader node. The second query, which has a from clause specifying a table on the worker nodes, executes on the worker nodes.

Now, the data type here is an interval, not an integer, and it cannot simply be cast to an integer; both the leader node and worker nodes object to this. The problem is that the database doesn’t know what integer you want - the number of days in the interval? number of minutes? number of microseconds?

You need to be able to convert interval to integer, though, because although some functions, such as avg(), work directly with interval, others, such as stddev_pop(), do not; and also because sometimes you want to divide a quantity by an interval, to figure out number-of-whatevers per minute or second or whatever, and integers cannot be divided by an interval. You need to convert the interval to an integer, so that the division give yous 1 unit of whatever per however many microseconds, and you then covert the microseconds integer into something more human-friendly, like seconds or minutes.

To convert an interval to an integer, you need to use extract.

Now, the very first thing to notice is that the documentation is wrong, because it indicates extract does not support interval. The SQL specification requires interval is supported, and if interval were not supported, there would be no way to convert an interval to an integer, and so fundamental SQL functionality would be missing; so I think it’s a documentation error, and it’s not like it’s going to be lonely in there, you know?

The extract function takes two arguments, separated by the keyword from, like so;

extract( unit-of-time from time );

The second argument is a time, timestamp, date or interval, so one of the time/date data type, and the first argument is the unit of time to extract - such as the number of seconds, or days, or years, and so on.

Now, where I write “the unit of time to extract”, what I mean to say is that we pick a unit of time - say, day, or hour, or millisecond - and we obtain by using extract not the value of the entire time in that unit, but rather, the number in that unit of time which are specified in the time.

So if we have say five days, ten hours and six minutes, and we ask for hours, we’ll get 10.

Here’s an example, where we have an interval of 1 day, 3 hours, 15 minutes, 0 seconds, 123456 microseconds.

dev=# select extract( hour from '2020-01-02 12:15:00.123456'::timestamp - '2020-01-01 09:00:00.000000'::timestamp );
(1 row)

dev=# select extract( minute from '2020-01-02 12:15:00.123456'::timestamp - '2020-01-01 09:00:00.000000'::timestamp );
(1 row)

dev=# select extract( microsecond from '2020-01-02 12:15:00.123456'::timestamp - '2020-01-01 09:00:00.000000'::timestamp );
(1 row)

So, on the face of it, this is pretty useless for converting a time of any kind into an integer, because what we want is the entire value of the time.

However, there is a unit of time to extract called epoch.

The docs are great (as in completely not great). What does an epoch give you? who knows. No clue. Nothing. Nada. It’s only the most important function of extract; I think this is an example of a general and widespread problem with the docs that the author is not technical, and no technical staff ever read the docs, so oversights like this (or the numbers of flat-out blatant factual errors which I’ve found over the years) are never noticed.

What epoch gives you is the total time in seconds, including the fractional part of seconds, like so;

dev=# select extract( epoch from '2020-01-02 12:15:00.123456'::timestamp - '2020-01-01 09:00:00.000000'::timestamp );
(1 row)

So, bingo, right? problem solved! we now have our interval as a float, and we can do maths.

Well… no. Why do think this I’m writing this article? :-)

All these queries, you’ll note, have been on the leader node.

The leader node was originally Postgres, and in my experience, it does the right thing; but it’s totally irrelevant because there’s very little work you ever do which is leader node only (some system table work, maybe).

We have to look at what happens on the worker nodes.

Let’s try epoch on a worker node.

dev=# select extract( epoch from '2020-01-02 12:15:00.123456'::timestamp - '2020-01-01 09:00:00.000000'::timestamp ) from stl_bcast limit 1;
(1 row)

Well, seems reasonable, right? it’s the same as before.

Sorry. Still wrong. I have a card up my sleeve ;-)

You see, what you’ve seen there is a worker node operating an interval computed from two literal timestamps; that works.

But let’s now make a table and put our two example timestamps in there.

dev=# create table table_1 ( column_1 timestamp, column_2 timestamp );
dev=# insert into table_1( column_1, column_2 ) values ( '2020-01-02 12:15:00.123456'::timestamp, '2020-01-01 09:00:00.000000'::timestamp );
dev=# select * from table_1;
          column_1          |      column_2       
 2020-01-02 12:15:00.123456 | 2020-01-01 09:00:00
(1 row)

And now epoch again…

dev=# select extract( epoch from column_1 - column_2 ) from table_1;
(1 row)

Oh dear. Not great. It turns out - as we’ll see in more detail - that extract on worker nodes does not work correctly with an interval computed from two timestamps taken from a row from a table.

But there’s more - as it also turns out extract on worker nodes does work correctly when you’re using interval literals.

It’s only computed intervals which go wrong.

dev=# select extract(millisecond from interval '1 day 500 millisecond') from stl_bcast limit 1;
(1 row)

dev=# select extract(epoch from interval '1 day 500 millisecond') from stl_bcast limit 1;
(1 row)

Finally, included for completeness, we also have extract on single timestamps, both literals and from tables.

With this, everything works except epoch, but which is missing the fractional part of the seconds.

First, here’s from a literal;

dev=# select extract( microsecond from '2020-01-02 12:15:00.123456'::timestamp );
(1 row)

dev=# select extract( year from '2020-01-02 12:15:00.123456'::timestamp );
(1 row)

dev=# select extract( epoch from '2020-01-02 12:15:00.123456'::timestamp );
(1 row)

Second, the same value but from our test table;

dev=# select extract( microsecond from column_1 ) from table_1;
(1 row)

dev=# select extract( year from column_1 ) from table_1;
(1 row)

dev=# select extract( epoch from column_1 ) from table_1;
(1 row)

So we have five different sets of behaviour;

  1. The leader node
  2. Worker nodes with intervals computed from literal timestamps
  3. Worker nodes with intervals computed from timestamps (from rows)
  4. Worker nodes with literal intervals
  5. Worker nodes with single timestamps from literals or from rows

Of these five, #1 and #2 work correctly, but you basically never use them, or they exist within their own bubble (leader node queries on the system tables).

The important use case, #3, is broken.

Just to make life a bit more complicated, #4, which at times you will use in conjunction with #3, works correctly, and so when used in conjunction either the output from #3 and #4 has to be normalized, so they’re in the same terms (either both microseconds, or both seconds with the fractional part); because if you use microseconds with #3 you get the full time, but if you use microseconds with #4, you get only the microseconds part of the time.

Finally, with #5, epoch is broken.

(As an aside, there is one other function which is very similar to extract(), this being date_part(). On worker nodes this function does not support interval - it will throw an error - but it does support timestamp, and in this case both for literals and from rows, when used with epoch it works correctly and provides fractional seconds; so you can use date_part() to work around this bug with epoch in extract(). Horrific mess, isn’t it?)

Let’s look at #3 and #4 in more detail, to see what’s going wrong.

To begin with, looking at #3, let’s try minute, and see what we get.

dev=# select extract( microsecond from column_1 - column_2 ) from table_1;
(1 row)

dev=# select extract( minute from column_1 - column_2 ) from table_1;
(1 row)

dev=# select extract( epoch from column_1 - column_2 ) from table_1;
(1 row)

What’s happening is that rather than getting the number of the specified unit-of-time, we’re getting the entire period of interval in the specified unit-of-time, but never with any fractional part.

So rather than getting 15 for minute, we get 1635.

The epoch unit of time is giving us seconds, which is correct, but with no fractional part, which is not, and we have no way of getting the fractional part.

So, generally speaking, in case #3, where extract is broken, where we would use epoch and get seconds and fractional seconds, we now must use microsecond, as this is the only way to get the whole time, and that means converting that value to what we expect by dividing by 1000000 (number of microseconds in a second); or, alternatively, converting the values we obtain from correctly working extract (which will use epoch) into microseconds, by multiplying by 1000000.

So for example, say I had a sum of intervals, let’s say a total time spent on processing something, and I wanted to put that in terms of minutes; I need to divide the sum by the interval of one minute. This is what has to be done;

extract(microseconds from total_processing_time) / (extract(epoch from interval '1 minute') * 1000000)

Or, alternatively;

(extract(microseconds from total_processing_time) / 1000000) / extract(epoch from interval '1 minute')

This is simple enough, but it took quite a bit of investigation to work out what was happening to be able to figure it out.

Two final notes.

First, that the Redshift docs say nothing of it, but interval literals in Redshift does not support microseconds or milliseconds. It’s just terrible. No one checking, no one notices, Support deflect all incoming customer feedback. The easiest way I’ve found around this is to use a fractional second, e.g. interval '0.000001' second.

Second, this leaves us with the problem of making millisecond or microsecond intervals from a column of integer milliseconds or microseconds (which you will sometimes find in the system tables), rather than as literals. I was not able to work out myself has to do this; I had to ask in the dbt slack, and then someone pointed out you can use dateadd() for this - and you can - and the reason I missed it is because the documentation has a bug in the prototype for the function. It lists the second argument as an interval, rather than an integer.

So what do is you add your microseconds to an arbitrary timestamp, and then subject your arbitrary timestamp, like so;

dateadd( 'microsecond', stv_wlm_query_state.queue_time, '2000-01-01 00:00:00.000000'::timestamp ) - '2000-01-01 00:00:00.000000'::timestamp as query_duration

Bi-Weekly Benchmarks

Bi-weekly Redshift benchmarks are up.

  1. The ra3 node types appear to now be available in af-south-1, ap-northeast-3 and eu-south-1. The only region now (of those monitored - no China, no Gov) which lacks ra3 is ap-southeast-3.

  2. No further regions have received the disk read performance upgrade seen over the last two benchmarks (which halves disk read benchmark duration). Currently nine regions have this, thirteen do not.

  3. The disk read improvement has been revoked from dc2.large in ap-northeast-3.

  4. Node/region benchmark performance is quite a mish-mash, but there seem to be certain common themes.

  1. dc2.large in some regions, but not all, is particularly slow for disk write (normal region ~5 seconds, slow regions ~15 seconds)
  2. a bit less than half of all regions have received the disk read performance improvement
  3. some regions with the disk read performance improvement have not applied it to dc2.large

Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers