Table (analysis) v1

Description

It is often the case the table data type and encoding choices are not optimal. For encoding choices, this is always the case if you are using the encoding choices made by Redshift (by not specifying encodings, or using create table as); I consider its choices appalling.

My usual experience is that manually choosing data types and encoding reduces table size by a factor of two; which broadly speaking is the same as reducing query times by a factor of 2, where this reduction in storage requirements and the improvement in performance is perfectly likely to materially reduce the size of your cluster, and so your monthly AWS bill.

Once the sorting order has been chosen, it is almost always that selecting the correct data type and encoding for each column is almost a mechanical operation; one particular data type and one particular encoding almost always stand out as being correct.

The sorting order can only be chosen by knowing what queries will be issued upon the given table; I can’t help you with this, and it has nothing to do with any data which can be gathered about the table. You need to understanding sorting, and what it means for tables and queries, and that’s a big subject which will be covered in a future white paper (“Introduction to the Fundamentals of Amazon Redshift”, coming soonish to a Redshift research project web-site near you :-)

However; once the sorting order has been chosen, the selection then of the data type and encoding for each column, although almost a mechanical operation, does require a set of information about each column - the minimum and maximum values, and the count of distinct, NULL and NOT NULL values.

The analysis performed here provides this information, and figures out for you, as a guide, the smallest possible data type for the current data. It does not make a recommendation for the encoding - this requires too much knowledge about the data, now and in the future, in the column. This is also true, really, for the data type - an int2 for example might be fine now, but not in a year - but actually being shown the minimum possible data type I found quite useful, so I’ve included it here.

The code for figuring out the minimum possible data type is I believe exhaustive. I’ll go into the details below, but for example, if you have float4 column but the fractional part is always zero and the integer part is within the range of one of the integer types, the suggested new data type is the smallest integer type which can take all the existing data.

Similarly, if you have a varchar, but the content is actually all, say, floats, the proposed type will be the smallest float type which can hold those values, or if you have a timestamp but the time part is always identical (usually that means midnight, of course), the proposed new type is date.

There’s a column at the end of each row which explains the reasoning for the proposed new data type. Remember - an expert system does two things; it produces a recommendation, and explains its reasoning.

ML/AI is not an expert system, because it cannot explain its reasoning; it’s really one of the those loot boxes you get in games these days :-)

Additionally, the analysis also performs validation work, and checks;

  1. primary key and unique columns really are unique
  2. foreign key columns also really are unique

This is not done in situ, which to my eye is a and common blunder, as the query planner assumes these constraints are correct, and so if you for example asked for the count of a given value in a unique constrained column, the planner is perfectly within its rights as soon as it finds one row with the given value to stop scanning the table and return the count as 1.

When these checks fail, you get a nice big fat warning in the output. It’s currently in caps lock, but I might even make it red :-) (but no blink - I’m not a sadist :-)

What’s done here is that a temp table is created, with a single column which matches the column being examined but without the constraint. The original column is copied over, and the check for uniqueness is performed on the temp table.

Finally, the analysis code also does simple obvious things, like letting you know for each column the constraints which could be applied (but currently are not). So if a column is marked NULL, but has no NULL values, you’ll see an indication saying NOT NULL could be applied.

Note you should always use NOT NULL whenever you can, not simply for data integrity, but because it saves you 1 bit per value, except for varchar, where it saves you 1 byte per value.

Columns

Name Datatype
schema name
table name
column name
data type:current varchar
data type:proposed varchar
encoding varchar
sortkey int4
constraints:NOT NULL bool
constraints:primary key varchar
constraints:unique varchar
constraints:foreign key varchar
counts:NOT NULL int8
counts:NULL int8
counts:distinct int8
values:minimum varchar
values:maximum varchar
proposed data type reasoning varchar

Column Descriptions

schema

The schema name.

table

The table name.

column

The column name.

data type:current

Current column data type.

data type:proposed

Proposed column data type. The goal is to minimize data storage cost.

encoding

Value Encoding Type Encoding ID
raw raw 0, 128
bytedict Byte Dictionary 1
delta Delta (one byte) 2
lzo LZO 3, 131
runlength Runlength 4
delta32k Delta (two byte) 5
text255 Text255 (255 byte limit) 7
mostly8 Mostly (one byte) 15
mostly16 Mostly (two byte) 16
mostly32 Mostly (four byte) 17
text32k Text32k (no limit) 18
zstd ZSTD 19, 147
az64 AZ64 20, 148
unknown Unknown n/a

I’ve included the encoding IDs, because they’re a little intriguing. First, we can see some encoding have more than one ID, which is unexpected. Then we can see what looks like a missing encoding, number six. What was that? what happened to it?

No automated recommendation for encoding can be made, because encoding choices depend upon actually understanding the data in the table, not just analyzing it. For example, encoding choice is dependent upon sorting order, and sorting order is in my experience with clients often incorrect.

sortkey

The sortkey ordinal of the column, NULL if the column is not part of the sortkey.

constraints:NOT NULL

Value Meaning
True NOT NULL is set
NOT NULL is not set
Viable NOT NULL is not set, but it could be set

When a column is NULL, this field is left empty, to make reading easier.

When a column is NULL, but has no NULL values, it is marked as “Viable”.

Always set NOT NULL when the data permits it; for data quality, but also this saves storage space. A NULLable column consumes one extra bit per value, to store the NULL flag, except for varchar, which consumes one extra byte per value.

constraints:primary key

This column provides information only about single-column primary keys.

Value Meaning
Valid Primary key is set and is valid
(empty) Primary key is not set
INVALID Primary key is set and is NOT valid
Viable Primary key is not set, but it could be set

When a column is not a primary key, this field is left empty, to make reading easier.

constraints:unique

This column provides information only about single-column unique constraints.

Value Meaning
Valid Unique is set and is valid
(empty) Unique is not set
INVALID Unique is set and is NOT valid
Viable Unique is not set, but it could be set

When a column is not unique, this field is left empty, to make reading easier.

constraints:foreign key

Value Meaning
Valid A foreign key is set and is valid
(empty) A foreign key is not set
INVALID A foreign key is set and is NOT valid

When a column has no foreign key, this field is left empty, to make reading easier.

Foreign key columns must be primary key columns in their tables. Analysis checks to see those primary keys are valid, and this is the validity/invalidity information shown here.

counts:NOT NULL

Number of rows which are not NULL.

counts:NULL

Number of rows which are NULL.

The sum of counts:NOT NULL and counts:NULL is the number of rows in the table.

counts:distinct

Number of distinct values. Very important for encoding choices.

values:minimum

The smallest value in the table.

values:maximum

The largest value in the table.

proposed data type reasoning

There are a fixed set of rules which drive the proposal of new data types. When a new data types is proposed, this column gives the reasoning which led to the proposal being made.

Remember : an expert system has two properties. It provides a recommendation, and explains its reasoning.