It is often the case the table data type and encoding choices are not
optimal. For encoding choices, this is always the case if you are using
the encoding choices made by Redshift (by not specifying encodings, or
using create table as
); I consider its choices
appalling.
My usual experience is that manually choosing data types and encoding reduces table size by a factor of two; which broadly speaking is the same as reducing query times by a factor of 2, where this reduction in storage requirements and the improvement in performance is perfectly likely to materially reduce the size of your cluster, and so your monthly AWS bill.
Once the sorting order has been chosen, it is almost always that selecting the correct data type and encoding for each column is almost a mechanical operation; one particular data type and one particular encoding almost always stand out as being correct.
The sorting order can only be chosen by knowing what queries will be issued upon the given table; I can’t help you with this, and it has nothing to do with any data which can be gathered about the table. You need to understanding sorting, and what it means for tables and queries, and that’s a big subject which will be covered in a future white paper (“Introduction to the Fundamentals of Amazon Redshift”, coming soonish to a Redshift research project web-site near you :-)
However; once the sorting order has been chosen, the selection then
of the data type and encoding for each column, although almost a
mechanical operation, does require a set of information about each
column - the minimum and maximum values, and the count of distinct,
NULL
and NOT NULL
values.
The analysis performed here provides this information, and figures
out for you, as a guide, the smallest possible data type for the current
data. It does not make a recommendation for the encoding - this requires
too much knowledge about the data, now and in the future, in the column.
This is also true, really, for the data type - an int2
for
example might be fine now, but not in a year - but actually being shown
the minimum possible data type I found quite useful, so I’ve included it
here.
The code for figuring out the minimum possible data type is I believe
exhaustive. I’ll go into the details below, but for example, if you have
float4
column but the fractional part is always zero and
the integer part is within the range of one of the integer types, the
suggested new data type is the smallest integer type which can take all
the existing data.
Similarly, if you have a varchar
, but the content is
actually all, say, floats, the proposed type will be the smallest float
type which can hold those values, or if you have a
timestamp
but the time part is always identical (usually
that means midnight, of course), the proposed new type is
date
.
There’s a column at the end of each row which explains the reasoning for the proposed new data type. Remember - an expert system does two things; it produces a recommendation, and explains its reasoning.
ML/AI is not an expert system, because it cannot explain its reasoning; it’s really one of the those loot boxes you get in games these days :-)
Additionally, the analysis also performs validation work, and checks;
This is not done in situ, which to my eye is a and common blunder, as the query planner assumes these constraints are correct, and so if you for example asked for the count of a given value in a unique constrained column, the planner is perfectly within its rights as soon as it finds one row with the given value to stop scanning the table and return the count as 1.
When these checks fail, you get a nice big fat warning in the output. It’s currently in caps lock, but I might even make it red :-) (but no blink - I’m not a sadist :-)
What’s done here is that a temp table is created, with a single column which matches the column being examined but without the constraint. The original column is copied over, and the check for uniqueness is performed on the temp table.
Finally, the analysis code also does simple obvious things, like
letting you know for each column the constraints which could be
applied (but currently are not). So if a column is marked
NULL
, but has no NULL
values, you’ll see an
indication saying NOT NULL
could be applied.
Note you should always use NOT NULL
whenever
you can, not simply for data integrity, but because it saves you 1 bit
per value, except for varchar
, where it saves you 1 byte
per value.
Name | Datatype |
---|---|
schema | name |
table | name |
column | name |
data type:current | varchar |
data type:proposed | varchar |
encoding | varchar |
sortkey | int4 |
constraints:NOT NULL | bool |
constraints:primary key | varchar |
constraints:unique | varchar |
constraints:foreign key | varchar |
counts:NOT NULL | int8 |
counts:NULL | int8 |
counts:distinct | int8 |
values:minimum | varchar |
values:maximum | varchar |
proposed data type reasoning | varchar |
The schema name.
The table name.
The column name.
Current column data type.
Proposed column data type. The goal is to minimize data storage cost.
Value | Encoding Type | Encoding ID |
---|---|---|
raw | raw | 0, 128 |
bytedict | Byte Dictionary | 1 |
delta | Delta (one byte) | 2 |
lzo | LZO | 3, 131 |
runlength | Runlength | 4 |
delta32k | Delta (two byte) | 5 |
text255 | Text255 (255 byte limit) | 7 |
mostly8 | Mostly (one byte) | 15 |
mostly16 | Mostly (two byte) | 16 |
mostly32 | Mostly (four byte) | 17 |
text32k | Text32k (no limit) | 18 |
zstd | ZSTD | 19, 147 |
az64 | AZ64 | 20, 148 |
unknown | Unknown | n/a |
I’ve included the encoding IDs, because they’re a little intriguing. First, we can see some encoding have more than one ID, which is unexpected. Then we can see what looks like a missing encoding, number six. What was that? what happened to it?
No automated recommendation for encoding can be made, because encoding choices depend upon actually understanding the data in the table, not just analyzing it. For example, encoding choice is dependent upon sorting order, and sorting order is in my experience with clients often incorrect.
The sortkey ordinal of the column, NULL
if the column is
not part of the sortkey.
Value | Meaning |
---|---|
True | NOT NULL is set |
NOT NULL is not set |
|
Viable | NOT NULL is not set, but it
could be set |
When a column is NULL
, this field is left empty, to make
reading easier.
When a column is NULL
, but has no NULL
values, it is marked as “Viable”.
Always set NOT NULL
when the data permits it; for data
quality, but also this saves storage space. A NULL
able
column consumes one extra bit per value, to store the
NULL
flag, except for varchar
, which consumes
one extra byte per value.
This column provides information only about single-column primary keys.
Value | Meaning |
---|---|
Valid | Primary key is set and is valid |
(empty) | Primary key is not set |
INVALID | Primary key is set and is NOT valid |
Viable | Primary key is not set, but it could be set |
When a column is not a primary key, this field is left empty, to make reading easier.
This column provides information only about single-column unique constraints.
Value | Meaning |
---|---|
Valid | Unique is set and is valid |
(empty) | Unique is not set |
INVALID | Unique is set and is NOT valid |
Viable | Unique is not set, but it could be set |
When a column is not unique, this field is left empty, to make reading easier.
Value | Meaning |
---|---|
Valid | A foreign key is set and is valid |
(empty) | A foreign key is not set |
INVALID | A foreign key is set and is NOT valid |
When a column has no foreign key, this field is left empty, to make reading easier.
Foreign key columns must be primary key columns in their tables. Analysis checks to see those primary keys are valid, and this is the validity/invalidity information shown here.
Number of rows which are not NULL
.
Number of rows which are NULL
.
The sum of counts:NOT NULL
and counts:NULL
is the number of rows in the table.
Number of distinct values. Very important for encoding choices.
The smallest value in the table.
The largest value in the table.
There are a fixed set of rules which drive the proposal of new data types. When a new data types is proposed, this column gives the reasoning which led to the proposal being made.
Remember : an expert system has two properties. It provides a recommendation, and explains its reasoning.