Queries v1

Description

This source provides information about all queries - queued, running, aborted, completed, the lot.

Query Ordering

The first question is the ordering of queries.

Now, in Redshift, there are leader node queries and there are worker node queries, and they are not quite the same.

Leader node queries run directly on the leader node; they do not participate in WLM (the queuing system for worker node queries), and so have no notion of queuing and are not compiled; they have only a start time, when they began executing.

Worker node queries participate in WLM, have a notion of queuing and are compiled; so they have the time they enter the WLM system, the time they begin queuing, and once they begin executing, the time they began executing (note, these days, compilation is now intermixed with execution - it used to be all compilation occurred prior to execution - more about this in cluster_query_compile_plan).

(Leader node queries have only a process ID and a transaction ID. Worker node queries also have a task ID and a query ID. Note the transactions IDs for leader node queries are in a very different number range of the transaction IDs for worker node queries. However, transactions which contain both leader node and worker node queries do seem to work correctly.)

For ordering, leader node queries use their start time, and worker node queries use the time they enter WLM; the actual state of the query (queued, running, etc) is not considered in the ordering of queries.

The idea here is to see the flow of queries in actual chronological order, to convey the actual query flow as it arrive at the cluster.

A query - with the same query ID - can queue and execute more than once, and so can here have multiple rows. As far as I can tell, this is rare, so I’ve not taken any particular steps to deal with it, but be aware query ID is, surprisingly, not unique.

Queries which use Python functions obey different and more restrictive queuing rules, so you can see queries which apparently are being leap-frogged in their queue; this is why.

  1. Limitations of the queries, store and I/O columns
  2. Volume of records

Columns

Name Type
start timestamp
end timestamp
state varchar
routing:condition varchar
queue_id int8
routing:queue varchar
routing:csc varchar
owner_user_id int4
owner varchar
pid int4
xid int8
tid int4
qid int4
segments:compiled int8
segments:total int8
durations:queue interval
durations:compile interval
durations:execute interval
durations:perceived interval
i/o:bytes read:disk int8
i/o:bytes read:network int8
i/o:rows:inserted int8
i/o:rows:deleted int8
i/o:rows:returned int8
i/o:bytes processed:in memory int8
i/o:bytes processed:on disk int8
text varchar
related pages varchar

Column Descriptions

start

For leader node queries, the time the query began executing, as there is no queuing on the leader node.

For worker node queries, the time the query was recognized by the WLM system (which is to say, the very first moment the query existed, prior even to the routing rules dispatching the query to a queue).

end

The time the query completed.

state

The state of the query. According to stv_wlm_query_state, some possible values are;

State
Classified
Completed
Dequeued
Evicted
Evicting
Initialized
Invalid
Queued
QueuedWaiting
Rejected
Returning
Running
TaskAssigned

The docs are ten years old and manually updated, and I think not reviewed by technical staff, and so are quite often incorrect in some way, and specifically indicate this is not an exhaustive list, so take this as rough guide.

It’s not obvious to me what all of these mean, and the documentation is silent beyond providing this list.

routing:condition

The full text of the conditions of the routing rule which routed this query.

queue_id

The queue ID. This column is emitted in CSV exports only.

routing:queue

The name of the queue which received this query.

routing:csc

CSC if this query went to a CSC cluster, NULL otherwise.

owner_user_id

The owner user ID. This column is emitted in CSV exports only.

owner

The owner user name.

pid

The process ID.

When a user connects to the cluster, a process is spawned, which accepts that connection. That’s the process the ID is for. That process is also the process which issues queries, and when you need to cancel a query, you in fact cancel the process, which by being cancelled, inherently also cancels the query it is running.

xid

The transaction ID.

tid

The task ID.

I’m not clear what tasks are really about. I think it’s to do with queries being bounced around queues by routing rules, but I’ve not looked into it. I did think it would mean that a query ID was used for and only for one query, but this is not the case; to my considerable surprise, you can find the same query ID showing up more than once in stl_query, for example.

qid

The query ID. This is really the WLM query ID; leader node queries have no query ID, at least, not that you see in the system tables. They do really have query IDs, of course, but they’re on the leader node only and they never show up in the system tables. All we see from the leader node are transaction IDs and process IDs.

segments:compiled

Of the total number of segments, the number which were marked as compiled in svl_compile_info.

However, this system table has over the years not changed, and so now is inadequate to represent what’s actually going on with compilation.

segments:total

The total number of segments for this query.

durations:queue

The time spent queuing. I take this from stv_wlm_query_state. I know the same query can be queued multiple times; I do not yet know what happens to the queue time, whether it represents the cumulative queue time, or only the time of the most recent queuing.

durations:compile

It used to be compiling occurred in full, prior to execution. These days, compiling occurs in-line with execution, with each stream, if it needs it, performing compilation before it begins.

Additionally, it is possible for a query to be executed more than once - for example, it might enter SQA, be executed for a while, then be stopped, and be queued again, and execute again, in a normal queue.

In this situation, it could be some or even all compilation occurs during the first execution in SQA, and so occurs potentially long before the second execution of the query.

Given all this, I can’t see any reliable way now to untangle compilation time from execution time.

So this column is simply the total time of every compile that has been performed, regardless of when or how it happened.

durations:execute

It used to be compiling occurred in full, prior to execution. These days, compiling occurs in-line with execution, with each stream, if it needs it, performing compilation before it begins.

What time means is that the times recorded in the system tables, which are the start of execution and the end of execution, no longer mean what they used to mean.

When compilation occurred prior to execution, the start and end execution times showed how long the query took to execute.

These days, compilation time is mixed into the execution time, so you would need to subtract the compilation time.

Unfortunately, it is also now possible for a query to be executed more than once - for example, it might enter SQA, be executed for a while, then be stopped, and be queued again, and execute again, in a normal queue.

In this situation, it could be some or even all compilation occurs during the first execution in SQA, and so occurs potentially long before the second execution of the query.

I think, although I am not yet certain, that execution time is immediately prior to the beginning of the first compilation.

As such, in the usual case, where a query executes only the once, subtracting the compile time from the execution time should give the actual execution time.

In the cases with multiple execution, of course, this will be wrong.

Right now, I simply subtract the execution start from the execution end, and that’s the execution time.

You need to remember it includes compile time.

durations:perceived

This is the time from the query entering the WLM system, until the query completes - this is how long the user waits for their query to complete.

Note that it is possible for a query - with the same query ID, no less - to run more than once. I suspect in this case the times in the system tables for each run of the query are for that one query, and so the perceive time for the user is actually from the first query entering WLM, to the final run of the query completing.

Right now I do nothing about this - I look at only the current query.

i/o:bytes read:disk

This column then shows the total number of bytes read from disk, as best I can judge the types indicated in stl_scan.

i/o:bytes read:network

This column then shows the total number of bytes read from network, as best I can judge the types indicated in stl_scan.

Importantly, it only counts the receive side of network activity - the step is scan, after all, not broadcast, so we’re not counting bytes twice.

i/o:rows:inserted

The number of rows inserted into the table.

For tables with all distribution, this is the physical number of rows (i.e. one per node), not the logical number of rows.

i/o:rows:deleted

The number of rows deleted from the table.

For tables with all distribution, this is the physical number of rows (i.e. one per node), not the logical number of rows.

i/o:rows:returned

The number of rows returned from the leader node to the SQL client.

i/o:bytes processed:in memory

This column then shows the total number of bytes processed by the stl_aggr, stl_hash, stl_save, stl_sort and stl_unique steps, when running in memory rather than on disk.

i/o:bytes processed:on disk

This column then shows the total number of bytes processed by the stl_aggr, stl_hash, stl_save, stl_sort and stl_unique steps, when running on disk rather than in memory.

text

The query text, truncated at the 48 character mark.