Redshift Research Project

Blog

2021-10-01

Redshift Improvements

I’ve made a list of some improvements I’d like to see in Redshift.

2021-10-08

CTE

I have almost never used CTEs. I’ve never had a need to.

I’m now on a gig (contract) where I have learned something; it looks to me like one of the reasons CTEs are used is in fact because the underlying data is badly disorganized and non-relational.

You end up needing CTEs to get from the starting data to something which works with the relational data model.

2021-10-09

octet_length()

Just found a bug, either in octet_length() or in the docs, since they differ.

Docs are here.

Docs sayeth;

Length calculations do not count trailing spaces for fixed-length character strings but do count them for variable-length strings.

My cluster sayeth;

dev=# select column_01 from table_1;
    column_01     
------------------
 oink            
(1 row)

dev=# select octet_length(column_01) from table_1;
 octet_length 
--------------
           16
(1 row)

The column column_01 is a char(16).

2021-10-11

White Paper Delayed

I’m afraid the white paper for this week is late. I hope to publish on Monday. I’ve been investigating random numbers, and there’s been much more development of investigative method than usual, in part because of unrelated discoveries during investigation, one of which is central to Redshift architecture and which is described (otherwise the white paper can’t make any sense), but not investigated (as this white paper is about random numbers), and in part because some of the tests are time consuming, as large numbers of single queries must be issued.

UNLOAD

I very rarely use UNLOAD.

I’m using it now, bit of work for client, I’ve just noticed when you output to CSV, it forces all column titles to lower case. That’s a problem, because I’m trying to exactly match an existing CSV.

2021-10-12

Delayed White Paper

It’s been another full evening and early morning trying to figure out why two of the PRNG tests are not producing the same results on each run - which they should do, since seed is set to 0.

I ended up going down into the lowest level of debugging and I’ve finally figured it out. I believe the conclusion is that due to design flaws in the PRNG, these tests cannot produce the same results, even with the same seed.

Exactly why will be explained in the white paper.

PRNG White Paper Update

I have finished the investigative work.

I now need to generate the bitmap images of the various PRNG outputs, which I think will take about four or five hours, and write up the results.

2021-10-14

White Paper Update

Last night, overnight, the final run of the evidence-generating script occurs, which took just over 19,000 seconds.

I’m now fully into write-up mode; unless I discover a flaw or bug or something in the data which is strange and had been overlooked, dev work has finished.

2021-10-17

Yet Another Update

Well, I’m still working on the PRNG white paper.

Every time I think I’m done, there’s some loose thread, which I have to pull on to check, and then it turns everything I thought I understood upside-down. Again =-)

Fortunately, the weekend now, so I have much more time to work on this.

Doing my very best to publish tomorrow (Sunday 17th Oct).

2021-10-18

Published - Finally

You know, I thought it was done and I would be publishing Saturday morning, for once, but again there was yet another thread which didn’t quite make sense, and pulling on it led to another round of everything I thought I understood being turned upside-down :-)

Finally, though, all the pieces fit and there are no more threads to pull on.

Three weekends and most evenings during the two weeks. Problem is I’m contracting at the moment, so I’m working full-time, so I don’t have much time left over to work on research. With most of the white papers, I’m basically tidying up and making a script for material I’ve investigated over the last few years, but pinning down exactly what was wrong with the PRNG was new, and so it took as long as it took.

2021-10-20

Redshift Compiler Change

The latest release of Redshift has moved from GCC 3 (from about 2008) to GCC 7.3.0.

This is a scary change to make on any large (and I suspect unmaintainable) code base which needs to be reliable and where you have large numbers of existing customers, but it’s absolutely necessary, something which I’ve been unhappy about for a very long time, and I’m very, very, very glad to see it’s been done.

Unfortunately, in doing so, the output from version() has changed, to this;

dev=# select version();
                          version                          
-----------------------------------------------------------
 Redshift 1.0.32574 on Amazon Linux, compiled by gcc-7.3.0
(1 row)

It used to look like this;

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.28422

This is going to break a lot of existing code, which is going to be processing this string using fixed offsets of various kinds. I’ve seen a confirmed report that it’s breaking SQL Alchemy.

I’m a bit alarmed that the team which moved from GCC 3 to 7 is also the team which did not realise changing the version string would cause problems. But it might have been they did know this, and this change is accidental.

Version String Update / Remediation

So, new clusters are now returning the old style of version string. The Redshift version has risen, but the other version numbers are unchanged (and so are now in fact incorrect, but given the fall-out and the need for speed, changing them now would be imprudent).

Existing clusters which produce the problem version string can be fixed by Support, by them making a parameter change (which you cannot make) followed by a reboot.

It’s that or shut down your cluster and start it up again with the fixed version which is now in use.

Support are now aware of the issue, and if you contact them, they’ll know what you’re talking about

You can either contact Support, or if you don’t have support (I don’t :-), tack your name on to the end of this thread, which Support are watching;

https://forums.aws.amazon.com/thread.jspa?messageID=998260

View SQL Dump Broken

In Redshift, there are one or two ways to get hold of the SQL for a view. There are functions which get the SQL, given a view oid, and the system table pg_views lists the SQL (almost certainly using one of the functions).

In all cases, until today, the function returned either SQL or NULL.

Now the RS devs have it returning arbitrary error text in that column - which SQL parsers barf on, oddly enough.

The same thing happens in the system tables showing query texts.

This is absolutely and categorically wrong.

What you do here is add a column; you don’t start using an existing column in a wholly new way which breaks existing code. This is bloody obvious.

This problem is of course exactly the same in its nature as the problem we just saw earlier today where the version string format was changed.

This is a database - absolutely central to large, commercial systems. It’s practically an OS kernel in terms of its significance and criticality at the center of things. Users should never find any code broken because of a version upgrade.

I am of the view, from what I’ve seen over time, that the devs do not have the sense and experience to know that changes like this are wrong, cannot be done, and will cause problems if they are done.

New Redshift Release

As you will have noticed from the recent flurry of posts, there’s a new RS version out.

I’ve published the system table dump over in the System Table Explorer.

2021-10-21

Leading Questions

It looks like the recent version string change issue breaks the AWS RS JDBC driver - which is to say, their own driver.

I’ve read one direct report, citing driver version 2.0.0.3 with RS version 1.0.32574, and that the previous version works fine, and I have a link to the relevant source in the driver;

https://github.com/aws/amazon-redshift-jdbc-driver/blob/51a529e844e1f742ba2ccc5e6395247d6048735e/src/main/java/com/amazon/redshift/core/ServerVersion.java#L103

It’s obviously not great that the AWS RS team were unaware of the behaviour of their own driver, but it might be a different team make the driver.

However, this leads to another question - does this not imply that the RS test suite does not contain a check that JDBC connectivity is working?

2021-10-24

Encodings

So, this week’s white paper turned out to be about and only about bytedict encoding.

I have investigated about half the encodings.

I wanted to produce a white paper covering all encodings, but this was utterly unrealistic over the course of a weekend :-)

I completed the introduction (programmatic enumeration of data types and encodings) and the investigation of raw, by the end of Friday evening, and then got started on bytedict.

That took until about 5pm Sunday, and that was one of the encodings I’d already investigated! I learned quite a bit more in the process of making the white paper.

So I’m thinking to publish one white paper per encoding and once they’re all done, bind them together into a single white paper, the original idea, which covers all encodings.

2021-10-25

SUPER

I’ve now finally read the docs for SUPER. I think it’s a stupendously bad idea. The whole point of SQL is strict typing to reduce data bugs. If you make a type which has relaxed typing, you are now relying on software engineers to write code with no bugs, as opposed to relying on the SQL engine with strict typing to tell them when they have made bugs.

2021-10-26

hllsketch Dummy Data

So I’ve been spending a bit of time now to get to know this data type, and SUPER, and geometry (I’ve done GIS before, so it’s more getting to know what Redshift is doing with it).

I’m now trying to make some dummy data for hllsketch, to populate a test table, one of the columns is hllsketch.

Mmm.

Problems.

So, there’s only four HLL functions. One takes an argument indicating what data to examine and returns an int8 - it returns the cardinality. Then there’s a function which takes a hllsketch and gives you the cardinality from it. That leaves two more - one combines two hllsketch data types and combines them, returning a hllsketch (so you need two of them in the first place) and, finally, we have hll_create_sketch(), which takes an argument of what data to examine and returns a hllsketch.

Problem is that’s an aggregate function. I can’t use it in INSERT with VALUES when there’s more than row being inserted.

It’d be like doing this;

insert into
  test_table( column_1 )
values
  ( max(10) ),
  ( max(5) );

Doesn’t work, ERROR: cannot use aggregate function in multi-row VALUES.

Problem is, there is no other way to generate values of this data type.

So I think what I have to do now is an INSERT from SELECT, but where the SELECT is one row per select, UNIONing them all together, and then I think it’ll work. Not nice though.



Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers