Limitations Of The Queries, Store And Io Columns

The system tables are very difficult to use, because they are essentially undocumented and such documentation as exists is in places incorrect, are quite poorly designed, have quite a lot of unfriendly and unexpected behaviour, and are presenting information about Redshift, where what Redshift is actually doing is also undocumented, and Redshift has changed considerably over time, such that even poor as the system tables were originally, they have become even less able to express the actual events or state in the system.

I’ve spent many years working full-time with the system tables, and the columns I’ve created here are an attempt to produce reliable and useful information, as best can be done, from the system tables.

Where the system tables are very difficult to work with, what I’ve ended up with is necessarily imperfect and limited. Exactly how needs to be explained, so you are aware of the limitations and nature of what you’re looking at.

Limitations of The Cluster-Level System Tables

With Redshift, there is the cluster and there are databases, and information about the cluster and the databases are stored in the system tables, which are tables Redshift uses to keep track of what it’s doing, and also to publish information to users.

Many system tables are arranged such that when you connect to a database, you see only data for that database. Some system tables, however, provide information for all databases.

When connection to a Redshift cluster, a database must be specified, and that is the database which is connected to; only one database can be connected to at any time, and there can (at the time of writing) be up to sixty databases.

It is awkward and difficult, I would even say unnatural, to collect information from all databases. A connection would be needed for each database, and then the data taken from each would have be joined in the back-end of the Redshift Workbench.

It is then that this cluster level pages use information only from the system tables which contain information about all databases.

However, in these system tables, which contain information about all databases, it is not possible to know which user owns a table.

As such, it is not possible to show store information for groups or users at the cluster level, because although I know which user belongs to which group, I do not know which block belongs to which user.

Queries

For leader node queries, the system tables do not provide a complete list of all queries executed on the leader node.

To my knowledge there are three system tables which provide information.

The first is stv_recents, which lists the most recent 100 issued queries, be they leader node or worker node, but carrying as it does only the most recent 100 queries, this table is only useful for finding queries which are running on the leader node, but currently there appears to be an regression, and this table appears now no longer seems to show running queries.

The other two are the table stl_ddltext and stl_utilitytext (actually intended to provide query texts) provide a lot of coverage and the only record of leader node queries, but in particular they do not record select statements.

For leader node queries then the count is as accurate as I can make it.

For worker node queries (on main, or CSC clusters), I believe the count is correct.

Store

The values shown in these columns come from stv_blocklist. Whatever shows up in there, shows up in these columns. I need to investigate and pin down exactly what shows up in this table. Standard Redshift tables, of course, and temporary tables, but transient results? spill to disk? I do not yet know.

By and large then, I consider the store values correct and meaningful.

Remember though, there’s no easy way to figure out how many rows (and so blocks) in a table have been deleted, and are now old enough to never be seen by any executing query, and so would be expunged by VACUUM. All the block and row counts include these rows, because they’re all in stv_blocklist.

I/O Bytes Read

Queries are composed of streams, which execute serially, where streams are in turn composed of segments, which execute in parallel, and segments are in turn composed of steps, all of which naturally also execute in parallel.

Steps are the atomic units of work - like reading from a disk, performing a hash, or an aggregation, and so on.

Most steps have a system table which records their work, with one row per step, per slice, per query (a query may well have multiple steps of the same type, for example, one scan step for every table which is read).

This column is derived from one of these step tables, stl_scan.

The scan step reads data from a source, and that source can be disk (of various types - leader node tables, worker node tables, temporary tables, etc) or network, and there are various network sources, such as rows broadcast by other slices, data from S3, and so on.

There is a column in stl_scan which indicates the type of scan performed, but it’s a little hazy, and it’s also not kept up to date. Every now and then a net step type turns up (there’s one now, type 9) which is not documented.

The two columns for “byte read” then show the total number of bytes read from disk (both leader node and worker node), and from network (all sources - S3, RMS, other nodes, etc), as best I can judge the types indicated in stl_scan.

Note, and importantly, for network, I am counting only the receive side of network activity - so we’re not counting bytes twice.

I/O Rows Inserted / Deleted / Returned

One of the few really simple and robust measures of work present in the system tables is the count of the number of rows inserted into, and deleted from, a table. Although obviously that data has limitations, where simple and robust data is hard to find, it is valuable.

Note that for tables with all distribution, this is the physical number of rows (i.e. one per node), not the logical number of rows.

In addition, there is often a third column, which is the number of rows returned, either by the leader node to the user, or by worker nodes to the leader node. The table which provides this information is stl_return, which has no column to indicate a table (which makes sense, as rows are returned by a query, not by a table), and so this column is missing when a page does not examine queries (such as the Tables page) when computing these columns.

I/O Bytes Processed

These are the most “thumb in the air” columns. They are most definitely incomplete, and the numbers presented, “bytes processed”, are the most vague here, as it’s not clear what that really, specifically means, excepting that “more bytes” can probably be read to mean “more work”, and that in memory is okay, while on disk is bad.

So, as described earlier, steps are in Redshift the atomic unit of work and most of them have their own system table, which records work they do, and so there’s quite a lot of these low-level tables.

Now, for a few steps, but important steps, the step has a choice of whether it performs its work in memory, or on disk. A step will use memory if it can, and it can if the amount of data is less than the memory available to the step. The amount of memory available to the step depends on how much memory has been allocated to the queue and so to the slot the query is running in; and the amount of data to be processed, using that memory, depends on the query the user has issued.

If there is not enough memory for the step to run in memory, then perforce it goes to disk. This is always bad. The more data there is, the more work the disk will have to do, and this is harmful to all users on the cluster, as disk is the slowest shared resource.

Of the thirty of so steps, only five have system tables which indicate both the number of bytes processed and whether or not the step ran in memory or on disk, these being stl_aggr, stl_hash, stl_save, stl_sort and stl_unique. However, this short list does include the hash step, and that I think is the most important of all steps with regard to performance, because improper queries usually fail in that they end up issuing huge hash steps.

The two columns here then reflect the count of bytes from these five steps, for when they run in memory or on disk. The other steps are not considered (as they have no column indicating the number of bytes processed, usually I think because it’s not a meaningful value for the work they do).

Broadly speaking, large values for on disk are a very bad sign, and that’s the intent here - to provide visibility into how much this is happening.