Clusters, Nodes And Slices

A cluster is composed of nodes, which are virtual machines, and nodes contain slices, and slices need to be explained.

When you issue a query to Redshift, the SQL is compiled into a set of binaries and the set of binaries is distributed to the nodes and the nodes execute the set of binaries and the work performed by the set of binaries is the query.

Now here’s the key to it all - Redshift executes multiple copies of the set of binaries; they all run in parallel - each set of binaries is a slice.

Now it used to be there was only one type of slice, what I’m here calling a full slice.

This was the normal type of slice; it held its (ideally equally sized) part of every table, as row distribution is on a per-slice basis, and participated fully in all queries - it could run every step.

Each node type in its specification indicates how many slices it has, and these are full slices. A dc2.large has two slices; a ds2.8xlarge has sixteen.

Now cluster re-sizing comes into play.

There used to be only one type of cluster re-size, which these days is called “classic re-size”.

What would happen is that the number of nodes would change, which necessarily changed the number of slices, and every row would be redistributed to the new slices.

For any non-trivial data volume, this is a slow process - for more meaningful data volumes, you’re looking at days.

AWS then in time introduced a second form of resize, “elastic re-size”.

This is a faster, but less capable re-size.

What happens is that the number of nodes changes, but the number of full slices does not. The full slices you have are redistributed en bloc, over your new nodes. What’s more, the move of all the data does not need to be completed before the slice can begin processing queries; so the slices get moved around, the cluster comes back on-line, and in the background, the data is being moved. If a query needs to use any data which is not yet in the slice, the query blocks until that data arrives. (In theory, the slice will prioritize data which is needed by queries; I have strong doubts about how effective this is, but I’ve not investigated).

One of the downsides to this is that you are now changing the number of full slices per node; and if you have too few, and you’re missing opportunities for performance gain through parallelism, and if you have too many, and you begin to overload the hardware. This is why there are limits on how many nodes you can add or remove, using elastic resize; you can’t have less than one full slice per node, and you’re not allowed to have too many full slices per node.

(Another downside relates to Redshift Spectrum. Each full slice is allocated up to 10 Redshift Spectrum workers, so if you use elastic re-size to increase the number of nodes in a cluster, you will not be increasing the number of Redshift Spectrum workers, because the number of full slices does not change. You will need a classic re-size, which does change the number of slices, to obtain the full compliment of full slices and so the full compliment of Redshift Spectrum workers.)

Now, a recent innovation are what I’ve called “partial slices”.

These are available on and only on ra3 type nodes. They are not available on dc2 or ds2 type nodes.

When after an elastic re-size a node has less than the specification number of full slices, partial slices are added to that node, in lieu of the missing full slices.

A partial slice owns no rows and can perform no disk access, and can perform only certain steps, so it cannot participate fully in queries execution.

So it’s better than nothing, but it’s not a full slice.

AWS in their terminology call full slices “data slices”, and partial slices “compute slices”.

I’ve heard directly conflicting information about whether or not partial slices can access Redshift Spectrum, so I just don’t know right now.

Note that as far as I know, the system tables do not indicate the type of node in a cluster, so I can here provide no information about the nominal number of slices.

It’s also worth noting here an elastic resize can result in an unequal number of slices per node. This is bad, because nodes then have an unequal amount of work to perform, and the cluster runs as slowly as the slowest node.

Finally, and most importantly, be aware that cluster re-size, classic or elastic, is not a wholly reliable operation. Sooner or later, your cluster will enter re-size and not emerge, and then it’ll be crisis time. I advise clients never to use cluster re-size, but to build the ETL system such that multiple clusters can run concurrently, where cluster re-size is achieved by bringing up a new cluster of the desired size, loaded it with data, moving users over to it, and then shutting down the original cluster.

This seems I’m sure like a lot of trouble, and it is, but the alternative is a business critical system going out of service for an indeterminate period and you having to fix that problem in a hurry while under enormous pressure, where the solution anyway may well involve braining up a new cluster and populating it with data.

Disks (`dc2` and `ds2`)

With the dc and ds node types, which have local disk only for storage, nodes are over-provisioned for disk; their actual disk space exceeds their specification disk space by something like 15%. I think the over-provisioning varies by node type. I’ve seen 12.5% for large nodes, and 16% for small nodes, although both figures have a certain uncertainty about them.

The system tables provide no information about the disk space in the published specification for node types; they contain only information on the actual amount of disk space, which is the over-provisioned size.

As such, with these node types, the disk space values shown on the page are for the over-provisioned size.

Note that free disk is a node-level concept, not a slice-level concept, because slices can and do use different amounts of disk; when a slice allocates disk, it consumes from the total free space on the node.

Disks (`ra3`)

With the ra3 node types, the information provided by the system tables is simply the size of the local disk. I understand that the disk is used essentially as an LRU cache over RMS, except that a block does not exist on the local disk and in RMS at the same time; blocks move - they exist in only one place at a time. K-safety is provided by some other means (of which I am unaware).

As such over-provisioning is not relevant here.

Disks (dc2 and ds2)

Disks (ra3)

Disks (`dc2` and `ds2`)

Disks (`ra3`)