Abstract
The Table Query Language (TaQL) is an SQL-like high level language to do operations like
selection, sort, update and other modifications of a casacore table. It is a very versatile language
with full support for table columns containing array data. It has inherent support for masked arrays,
units, and astronomical coordinates. It has a very rich set of functions (such as cone search and
array reduction) making it very suitable for astronomical applications. User defined functions can
be added easily. It also has full support of equi-join, grouping/aggregation and nested queries. An
operation that can be expressed in a single command is the matching of two sky catalogues.
It can be used from C++, Python, and the Casacore program taql.
1.0 | 1997 Feb 9 | Original version |
2.0 | 2010 Nov 5 | UPDATE, INSERT, DELETE and COUNT commands |
3.0 | 2015 Jul 29 | GROUPBY and HAVING clause |
3.1 | 2016 Apr 4 | Masked arrays; ALTER TABLE and SHOW commands |
3.2 | 2018 Feb 14 | WITH clause |
3.3 | 2018 Jun 10 | Meas support for frequency, doppler, radialvelocity, earthmagnetic |
3.4 | 2022 Jan 25 | LIKE clause in table or column creation, COPY COLUMN and DROP TABLE |
3.5 | 2022 Sep 20 | AROUND/IN intervals |
3.6 | 2022 Dec 20 | JOIN clause |
A pdf version of this note is available.
The Table Query Language (TaQL, rhymes with bagel (though some people pronounce it as tackle)) is a language for querying and manipulating data in Casacore tables. It makes it possible to get information from the data content in the columns and keywords of arbitrary tables. It supports arbitrary complex expressions including units, extended regular expressions, and many functions. User defined functions written in C++ are supported, which is used to support coordinate conversions in TaQL. TaQL also makes sorting and column selection possible. Furthermore, TaQL has commands to update, add or delete rows, columns and keywords in a table and to create, restructure or delete a table.
The first sections of this document explain the syntax and show the options. The last sections give several examples and show the interface to TaQL using C++ or the program taql.
TaQL is modeled after SQL and contains a subset of SQL’s functionality. Some familiarity with SQL makes it easier to understand the TaQL syntax. The most important features of TaQL different from SQL are:
TaQL has a keyword that makes it possible to time the various parts of a TaQL command.
A command can be followed by a semicolon and/or a comment, indicated by a leading hash-sign. They are ignored. For example:
The available TaQL commands are shown below. The square brackets are not part of the syntax, but indicate the optional parts of the commands.
can be used to give some TaQL explanation or to show table information. A sole show command shows the possible options. HELP is a synonym for SHOW.
It can be used to get an optionally sorted subset from a table. It can also be used to do a subquery (see section 4.11 for more information on subqueries).
It can be used to update data in (a subset of) the first table in the first table list.
It can be used to add and fill new rows in the first table in the table list.
It can be used to delete some or all rows from the first table in the table list.
It can be used to count occurrences of column values. Although the command can still be used, it is
basically obsolete because the same (and more) can be achieved with the GROUPBY clause and
aggregate functions in the SELECT command.
Furthermore, usually GROUPBY is faster.
It can be used to calculate an expression, in which columns in a table can be used. It returns a list of values instead of a table.
It can be used to create a new table with the given columns and number of rows. Optionally specific table and data manager info can be given.
It can be used to drop (delete) one or more tables.
The subcommands of the ALTER TABLE command can be used to
Multiple such subcommands can be given, separated by white space.
The SELECT, COUNT, CREATE TABLE and ALTER TABLE commands can be used as a table in another command making it possible to directly use the resulting table. The following example creates a table (with column NAME) and puts values into the column.
All TaQL commands, except SHOW, can be preceded by the clause
which can be used to create one or more temporary tables to be used in the subsequent clauses. Section 3.2 contains a detailed description of this clause.
The clauses and verbs in the commands are case-insensitive, but case is important in string values and in
names of columns and keywords. Whitespace (blanks and tabs) can be used at will. The HELP
command can be used to obtain brief information about each command and the available functions.
The SELECT command is fully explained in Section 3.1 (Making a selection from a table)
The UPDATE, INSERT and DELETE commands are explained in Section 7 (Modifying a table)
The CREATE TABLE command is explained in section 8 (Creating a table)
The DROP TABLE command is explained in section 9 (Removing a table)
The ALTER TABLE command is explained in section 10 (Modifying the table structure).
The CALC command is explained in Section 12 (Calculations on a table)
The COUNT command is explained in Section 11 (Counting in a table)
TaQL can be used from different languages, in particular Python and Glish. Each has its own conventions breaking down into three important categories:
The user can set the style (convention) to be used by preceding a TaQL statement with
The possible (case-independent) values are:
The following values are also possible and are described in the next subsections.
If multiple values are given for a category, the last one will be used. The default style used is GLISH, which is the way TaQL always worked before this feature was introduced.
It is important to note that the interpretation of the axes numbers depends on the style being
used. e.g., when using glish style, axes numbers are 1-based and in Fortran order, thus axis 1 is
the most rapidly varying axis. When using python style, axis 0 is the most slowly varying axis.
Casacore arrays are in Fortran order, but TaQL maps them to the style being used. Thus when using python
style, the axes will be reversed (data will not be transposed). Note: unless said differently, all examples
in this document are done using the Python style.
The style feature has to be used with care. A given TaQL statement will behave differently if used with another style.
The style clause can also be used to define synonyms for the library names of user defined functions. For example:
defines the synonym mscal. Synonyms make it easier (i.e., less typing) to specify user defined functions.
Note that the synonym in the example above is automatically defined by TaQL as well as the synonym py for
pytaql.
It is possible to get some tracing output during the execution of a TaQL command by using the case-insensitive value TRACE in the using style clause. It can be useful for debugging purposes.
It is possible to time a TaQL command by using the case-insensitive value TIME in the using style clause. For historical reasons it is also possible to use the the case-insensitive keyword TIME before or after the optional style clause.
Timing shows the total execution time and the times needed for various parts of the TaQL command on stdout. For example:
shows the time to do the where part (i.e., row selection on FLAG), projection (selection of columns), and distinct (unique column values).
TaQL uses the following reserved words as part of its language.
Furthermore, some word combinations are reserved.
Note that the words in the TaQL vocabulary are case insensitive, thus the lowercase (or any mixed case) versions are also reserved.
The reserved words cannot directly be used as column name, keyword name, or unit. However, a reserved word can be used that way by escaping it with a backslash like \AS. When reading further, the meaning of
might become clear. It means: use unit in (inch) for column IN and test if it is in the given set.
Note this is unlike SQL where quotes have to be used to use a reserved word as a column name.
The SELECT is the main TaQL command. It can be used to select a subset of rows and/or columns from a table and to generate new columns based on expressions.
As explained above, the result of a selection is usually a reference table. This table can be used as any other table, thus it is possible to do another selection on it or to update it (which updates the selected rows in the underlying original table). It is, however, not possible to insert rows in a reference table or to delete rows from it.
If the select column list contains expressions, it is not possible to generate a reference table. Instead a normal plain table is generated (which can take some time if it contains large data arrays). It should be clear that updating such a table does not update the original table.
The FROM clause can be omitted from the select. In that case no columns can be used in the selection, but functions like rand and rowid make variable output possible. Clauses like ORDERBY can be given. The GIVING (or INTO) might be useful to store the result in a table.
The JOIN clause can be used to left join a table with another table. It also is possible to join tables on row number, where the tables involved must have the same number of rows. One can also join on row number the main table of a MeasurementSet with a subtable such as the ANTENNA table using a subquery. Joins are explained further in section 3.6.
The SELECT command consists of various clauses of which most are optional. The full command looks as follows where the optional parts are shown in square brackets.
The clauses are executed in a somewhat different order.
All clauses are explained in full detail in the subsequent sections.
Expressions in the various clauses will normally use column names to select, sort, or group a table. It is also possible to use table keywords or column keywords by giving their names. Furthermore, it is possible to use a column, created in the SELECT clause, in the HAVING and ORDERBY clauses. This can save time in both specifying and executing the command, because a possibly complicated expression can be used to create such a column. If such columns are used, that part of the SELECT is executed before HAVING.
TaQL uses the following lookup scheme for column/keyword names.
See the discussion of column names and keyword names for more details.
Sometimes it is useful to create a temporary table to be used in a SELECT command (or another TaQL command) containing subqueries. It can make the command more clear or it can optimize the command by factoring out subqueries that are used multiple times. For example:
This command counts the number of flagged visibilities per antenna. It looks somewhat complicated and can
only be fully understood once the entire TaQL note has been read.
The important thing is that the WITH command creates a small temporary table containing the number of
flagged visibilities per baseline. It is used twice in the subsequent SELECT command (by concatenation).
First for ANTENNA1, thereafter ANTENNA2. The above query command is about twice as fast as
something like
because it processes all flags in the MeasurementSet only once instead of twice.
It is important to note that, unlike the SQL WITH clause, the order is ’WITH table AS alias’. This is the
same order as used in the FROM clause.
Another property is that WITH is nestable, thus can be used in a nested query.
Columns to be selected can be given as a comma-separated list with names of columns that have to be selected from the tables in the table_list (see below). If no column_list is given, all columns of the first table will be selected. A selection results in a so-called reference table. Optionally a selected column can be given another name in the reference table using AS name (where AS is optional). For example:
It is possible to precede a column name with a table shorthand indicating which table in the FROM or JOIN clause has to be used. If not given, a column will be looked up in the first FROM table. Note that if equally named columns from different tables are used, one has to get a new name, otherwise a ’duplicate name’ error will occur. For example:
Apart from giving exact column names, it is also possible to use wildcards by means of a UNIX filename-like
pattern (p/pattern/) or a regular expression (as f/regex/ for a full match or m/regex/ for a partial
match). They can be suffixed with an i indicating case-insensitive matching. See section 4.3.6 for a
discussion of these constants. An operator has to be given before the pattern or regex. Operator ~
means inclusion of the matching columns. Operator !~ means exclusion of the matching columns
included so far by means of a pattern or regex since the last explicit column name or expression.
A special pattern is * (which is the same as p/*/). If !~ is used at the first pattern or regex, it is assumed
that all columns are included (as if * was given before).
The pattern or regex (except *) can be preceded by a table shorthand denoting that the columns have to be
taken from that table. For example:
selects all columns except the ones ending in _DATA.
selects columns with a name containing DATA except the ones ending in _DATA.
does select the CORRECTED_DATA column (in the first case because it is explicitly selected).
Note it is not possible to change the name or data type of wildcarded columns.
It is also possible to use expressions in the column list to create new columns based on the contents of other columns. When doing this, the resulting table is a plain table (because a reference table cannot contain expressions). The new column can be given a name by giving AS name after the expression (where AS is optional). If no name is given, a unique name like Col_1 is constructed. After the name a data type string can be given for the new column. If no data type is given, the expression data type is used.
Note that unit conversion can be (part of) an expression. For example:
to store the time in unit d (days). Units are discussed in section 4.9.
It is possible to change the data type of a column by specifying a data type (see below) after the new column name. Giving a data type (even if the same as the existing one) counts as an expression, thus results in the generation of a plain table. For example:
Note that for subqueries the GIVING clause offers a better (faster) way of specifying a result expression. It also makes it possible to use intervals.
Special aggregate functions (e.g., gmin) exist to calculate an aggregated value (minimum in this example) per group of rows where the grouping is defined by the GROUPBY clause. The entire column is a single group if no GROUPBY is given. Aggregation is discussed in more detail in section 5.
If a column_list is given and if all columns (and/or expressions) are scalars, the column_list can be
preceded by the word DISTINCT. It means that the result is made unique by removing the rows with
duplicate values in the columns of the column_list. Instead of DISTINCT the synonym NODUPLICATES or
UNIQUE can also be used. To find duplicate values, some temporary sorting is done, but the original order of
the remaining rows is not changed.
Note that support of this keyword is mainly done for SQL compliance. The same (and more) can be achieved
with the DISTINCT keyword in the ORDERBY clause with the difference that ORDERBY DISTINCT will
change the order.
For full SQL compliance it is also possible to give the keyword ALL which is the opposite of DISTINCT, thus
all values are returned. This is the default. Because there is an ambiguity between the keyword ALL and
function ALL, the first element of the column list cannot be an expression starting with a parenthesis if the
keyword ALL is used.
If an expression in the column_list is a masked array, it is possible to create two columns from it: one for the data, one for the mask. This can be done by combining them in parentheses like (DATA,MASK). A possible data type given after the column names only applies to the data column, since the mask column always has data type Bool. For example:
The select results in a masked array containing the means along axis 0. Both column MD and MM are filled with the contents of the masked array. MD (with data type C4) contains the means over the first axis of the unmasked elements; MM contains the resulting mask.
This indicates that the ultimate result of the SELECT command should be written to a table (with the given name). This table can be a reference table, a plain table, or a memory table. It can also be a subtable of an existing table using the :: notation. For example:
creates subtable SUBTAB in parent table ’my.tab’. It also defines the keyword SUBTAB in the parent table to refer to the subtable.
The table argument gives the name of the resulting table. It can be omitted if a memory table is created.
The options argument is optional and can be a single value or a list, enclosed in square brackets, consisting of values and key=value. They can be used to specify the table and storage type. All keys and values are case-insensitive.
For backward compatibility, it is possible to specify an option directly without having to use ’key=value’.
The standard TaQL way to define the output table is the GIVING clause. INTO is available for SQL compliance.
If the INTO (or GIVING) clause is not given, the query result will be written into a memory table. In this way queries done in a readonly directory will not fail if a result table cannot created. However, if the result is expected to not fit in memory (which will seldomly be the case), type SCRATCH should be used to make it fit.
If the result is stored in a plain table, it is possible to give detailed data manager info for the result table using the DMINFO clause. See section 8.2 how the data manager info can be specified.
It is also possible to store the result in a subtable of another table using the :: notation, which is similar to specifying an input subtable as described in the next subsection. This will also create the keyword in the main table referring the subtable. For example:
will create subtable NEWSUB of table my.ms and define keyword NEWSUB to refer to the subtable.
The FROM part defines the tables used in the query. It is a comma-separated list of tables, each followed by an optional shorthand (alias).
The full syntax is:
Similar to SQL and OQL the shorthand can also be given using AS or IN. E.g.
Note that if using IN, the shorthand has to precede the table name. It can be seen as an iterator variable.
The shorthand can be used in the query to qualify the table to be used for a column, for example t0.DATA. The first table in the list is the primary table which will be used if a column is not qualified by a shorthand. Often a query uses a single table in which case a shorthand is not needed. Multiple tables require a shorthand and are useful if:
If the table is normal table with a fully alphanumeric name, the shorthand defaults to that name. In practice a shorthand is always needed if multiple tables are used.
The FROM clause can be omitted, in which case the input is a virtual table with no columns. The number of rows in it is defined by the LIMIT and OFFSET value; it defaults to 1 row. It makes it possible to select column-independent expressions (such as function random) in the SELECT command. Note that these expressions do not need to be constant. For example
creates a temporary table with column Col_1 and 31 rows containing the values 0..30.
A table can be given in a variety of ways.
In this example key is a table keyword of column col2 in table mytable (note that tab is the
shorthand for mytable and could be left out).
It can also be used for another table in the main query. E.g.,
In this example the keyword key1 is taken from the subtable given by the table keyword key in the
main table.
If a keyword is used as the table name, the keyword is searched in one of the tables previously given.
The search starts at the current query level and proceeds outwards (i.e., up to the main query level). If
a shorthand is given, only tables with that shorthand are taken into account. If no shorthand is given,
only primary tables are taken into account.
The colons refer to the latest table used, thus my.ms.
is a command to find shadowed antennas for the VLA. Without the query in the FROM command the
subqueries in the remainder of the command would have been more complex. Furthermore, it would
have been necessary to execute that select twice.
The command above is quite complex and cannot be fully understood before reading the rest of this
note. Note, however, that the command uses the shorthand TIMESEL to be able to use the temporary
table in the subqueries.
Also note the use of :: in the second line which refers to my.ms.
Finally note that the new WITH clause is an easier way to use temporary tables.
does a query on the three parts of an MS which are seen as a single table.
It is possible to use glob filename patterns in such a list. For example
is the same as the example above if no other files with such a name exist. An error is given if no table is found matching the pattern.
Subtables of the concatenated tables can be concatenated as well. Alternatively, they can be assumed to be the same for all tables meaning that the subtable of the concatenation is the subtable of the first table. For example, when partitioning a MeasurementSet in time, the ANTENNA subtable is the same for all parts, while the POINTING and SYSCAL subtables depend on time, thus have to be concatenated as well. Concatenation of subtables can be achieved by giving them as a comma-separated list of names after the SUBTABLES keyword. For example:
Usually the result of a TaQL query references the first table given in the FROM. In this example the
FROM table is the concatenation, which is only known during the query. In such a case the
concatenation must be made persistent, which can be done by using a GIVING (or INTO) inside
the concatenation specification. Only the table name can be given, because the persistent
concatenation only keeps the original table names; it does not make a copy of all data.
For example:
selects the cross-correlation baselines from the concatenation. Note the two GIVING commands. The first one makes the concatenation persistent, the second one is the query result of the query ms.cross. It references the matching rows in the persistent concatenation ms.conc which in its turn references the original parts.
TaQL has various ways of joining tables which are described in the next sections.
The JOIN clause can be used to left join tables in order to find (meta) data. For example, a MeasurementSet
can be joined with its ANTENNA subtable to find the name or position of the antenna in each row in the
MeasurementSet. Columns from the join table can be used in any other clause (e.g., WHERE) in a TaQL
command.
TaQL only supports left joins. These are joins where the rows in the left (the main) table are matched
against a single row in the joined table. If no matching row is found, ’none’ values are used if data from the
join table is used. If multiple rows match, the first matching row is used. Other types of joins (such as full
join) are not supported.
The full syntax is:
Multiple JOIN clauses can be given to join with multiple tables. It is also possible to join a joined table with another table. This is shown in an example below.
The table-list is similar to that in the FROM clause, thus one or more table names followed by a
shorthand. If multiple tables are given, they must have the same number of rows. The tables are used in the
join expression, so usually a shorthand is needed.
The join-expression tells how to join tables. It is an expression consisting of one or more subexpressions
separated by AND. Each subexpression compares a column in one table with a column (or the row number)
in the join table. For example for MSv3:
For each row in my.ms it finds the row in the ANTENNA subtable matching ANTENNA1. In the SELECT clause t2.NAME can be used to get, for instance, the name of the antenna like:
Another join can be used to get the name for ANTENNA2.
In MSv2 quite often the row number is used as an implicit id. This can also be used in a join like:
In MSv3 an explicit ANTENNA_ID is used instead of a row number. To make it possible to use the same query for an MSv2 and MSv3, the MSID function can be used for a column in a table. It takes a column name as argument. If the column exists, it will be used. Otherwise the rownumber function is used. A shorthand can be given before the column name. For example:
Besides joining using the equality operator, it is also possible to join using an interval in the join table. For example:
In this example the SYSCAL subtable contains information for time intervals defined by its center and width. The AROUND/IN interval specification can be used to find the matching row for each TIME in the main table of the MS.
Alas, finding SYSCAL information is not as easy as shown here, because it also depends on antenna and spectral window. So the full join is more complex. For MSv2 it should look like:
First a join with the DATA_DESCRIPTION subtable is done to find the spectral window. Thereafter a join with the SYSCAL subtable is done using the antenna, spectral window and time.
Multiple tables can be given in the FROM or JOIN clause. They are implicitly joined on row number as long as a column is used from such a table. Hence they must have the same size. If only a keyword is used from a table, that table is not joined.
This command can be used to check if the data in mytable is about equal to the data in othertable. Both
tables have to have the same number of rows.
The join is done on row number, thus the data in corresponding rows are compared.
This example shows how a subquery is used to join the main table of a MeasurementSet with its
ANTENNA subtable. The subquery returns a list with the names of all antennae, which subsequently
is indexed with the antenna number to get the antenna name for each row in the main table.
The join is done using the ANTENNA1 column which gives the row number in the subtable, thus the index
in the subquery result.
This example shows another way to use a subquery for a join of the main table of a MeasurementSet with its ANTENNA subtable. It selects all baselines for which the first station is a core station. The subquery returns a set containing the ids of the core stations, which is used to select the correct stations in the main table.
Several UDF’s in the derivedmscal library make it possible to easily join a MeasuementSet or CASA Calibration Table with a subtable like ANTENNA or SPECTRAL_WINDOW. These functions know which columns to use making the join straightforward like in
The library also contain the more general SUBCOL function making it possible to join any table with a subtable. For example:
to get the parameter name for a LOFAR ParmDB table. A ParmDB table has a subtable NAMES containing the NAME and other info of a parameter. The column NAME_ID is used to reference that subtable.
It defines the selection expression which must have a boolean scalar result. A row in the primary table is selected if the expression is true for the values in that row. The syntax of the expression is explained in a section 4.
It defines how rows have to be grouped. Usually a result per group will be calculated using aggregate functions. A group consists of all rows for which the columns (or expressions) given in the group_list have the same value. The (aggregate) expressions in the SELECT clause are calculated for the entire group. In this way one can get, for example, the mean XX amplitude and the number of time slots per baseline like:
It results in a table containing nbaseline rows with in each row the antenna ids, mean amplitude, and
number of rows.
If no aggregate function is used for a column, the value of the last row in the group is used. Note that in this
example ANTENNA1 and ANTENNA2 are the same for the entire group. However, if TIME was also
selected, only the last time would be part of the result.
Note that each expression in the group_list has to result in a scalar value of type bool, integer, double, date,
or string.
Aggregate functions are discussed in more detail in section 5.
This clause can be used to select specific groups. Only the groups (defined by GROUPBY) are selected for
which the HAVING expression is true.
Note that HAVING can be given without GROUPBY, although that will hardly ever be useful. If no
GROUPBY is given, but the SELECT statement contains an aggregate function, the result is a single
group. HAVING cannot be used if neither GROUPBY nor SELECT aggregate functions are used.
It is discussed in more detail in section 5.
It defines the order in which the result of the selection has to be sorted. The sort_list is a comma separated
list of expressions. It operates on the output of the SELECT, thus after a possible GROUPBY and HAVING
clause are executed.
The sort_list can be preceded by the word ASC or DESC indicating if the given expressions are by default
sorted in ascending or descending order (default is ASC). Each expression in the sort_list can
optionally be followed by ASC or DESC to override the default order for that particular sort key.
To be compliant with SQL whitespace can be used between the words ORDER and BY.
The word ORDERBY can optionally be followed by DISTINCT which means that only the first row of multiple rows with equal sort keys is kept in the result. To be compliant with SQL dialects the word UNIQUE or NODUPLICATES can be used instead of DISTINCT.
An expression can be a scalar column or a single element from an array column. In these cases some
optimization is performed by reading the entire column directly.
It can also be an arbitrarily complex expression with exactly the same syntax rules as the expressions in
the WHERE clause. The resulting data type of the expression must be a standard scalar one,
thus it cannot be a Regex or DateTime (see below for a discussion of the available data types).
E.g.
It indicates which of the matching and sorted rows should be selected. If not given, all of them are selected.
The word TOP can also be used instead of LIMIT.
LIMIT and OFFSET are applied after ORDERBY and SELECT DISTINCT, so they are particularly useful in
combination with those clauses to select, for example, the highest 10 values.
It can be given in two ways:
For example:
sorts uniquely by time, skips the first 10 rows, and selects the next two rows.
selects every 100-th row.
It indicates that the ultimate result of the SELECT command should be written to a table (with the given
name).
Another (more SQL compliant) way to define the output table is the INTO clause. See INTO for a more
detailed description including the possible types.
It is also possible to specify a set in the GIVING clause instead of a table name. This is very useful if the result of a subquery is used in the main query. Such a set can contain multiple elements Each element can be a single value, range and/or interval as long as all elements have the same data type. The parts of each element have to be expressions resulting in a scalar.
In the main query and in a query in the FROM or JOIN clause the GIVING clause can only result in a
table and not in a set.
To be compliant with SQL dialects, the word SAVETO can be used instead of GIVING. Whitespace can be
given between SAVE and TO.
An expression is the basic building block of TaQL. They are similar to expressions in other languages. An expression is formed by applying an operator or a function to operands which can be a table column or keyword, a constant, or a subexpression. An operand can be a scalar value or an array or set. The next subsections discuss them in detail.
An expression can be used in several places:
The expression in the clause can be as complex as one likes using arithmetic, comparison, and logical
operators. Parentheses can be used to group subexpressions.
The operands in an expression can be table columns, table keywords, constants, units, functions, sets and
intervals, and subqueries.
The index operator can be used to take a single element or a subsection from an array expression.
For example,
The last example shows a set with a continuous interval.
Internally TaQL uses the following data types:
Scalars and arbitrarily shaped arrays of these data types can be used. However, arrays of Regex are not possible.
If an operand or function argument with a non-matching data type is used, TaQL will do the following
automatic conversions:
- from Integer to Double or Complex.
- from Double to Complex.
- from String or Double to DateTime.
In this document some special data types are used when describing the functions.
- Real means Integer or Double.
- Numeric means Integer, Double, or Complex.
- DNumeric means Double or Complex.
TaQL supports any possible data type of a table column or keyword. In some commands (column list and CREATE TABLE) columns are created where it is possible to specify the data type of a column. The following case-insensitive values can be used to specify a type:
The TIME type is a special data type. It means that the column gets data type DOUBLE and that a MEASINFO record will be defined in the column keywords to designate the column as an epoch.
TaQL supports the use of extended regular expressions and string distances. They can be specified in various ways as discussed in section 4.3.6. There are three basic types of regular expressions.
matches 3c_ and 3c_xx, but not 3caxx.
For example:
do the same as the pattern examples above.
Furthermore it is possible to specify maximum string distances (known as Levensthein or Edit distance). It is explained in section 4.3.6.
Scalar constants of the various data types can be formed in a way similar to Python and Glish. Array constants can be formed from scalar constants.
A Bool constant is the value T or F (both in uppercase) or the value true or false (any case).
An integer constant is a numeric value without decimal point or exponent. It can also be given as a hexadecimal value like 0xffff.
A floating-point constant is given with a decimal point and/or exponent. ’E’ or ’e’ can be used to
specify the exponent. An integer number followed by a unit is also regarded as a double constant.
Another way to define a Double constant is by means of a Time or Position. Such a constant is always
converted to radians. It can be given in several ways:
The imaginary part of a Complex constant is formed by an Integer or Double constant immediately followed by a lowercase i or j. A full Complex constant is formed by adding another Integer or Double constant as the real part. E.g.
Note that a full Complex constant has to be enclosed in parentheses if, say, a multiplication is performed on it. E.g.
A String constant has to be enclosed in ” or ’ and can be concatenated (as in C++). E.g.
A regular expression constant can be given directly or using a function.
All examples but the last one do the same: matching a name starting with 3c or 3C.
The last example shows a glob-style pattern to find files on /usr not ending in .h or .cc.
do the same.
Case-insensitive matching can only be done as shown in the example above by downcasing the string to
be matched.
Please note that these functions are not limited to constants. They can also be used to form regular
expressions from variables.
A maximum string distance constant can be specified in a similar way. Such a distance is known as the Levensthein or Edit distance. It is a measure of the similarity of strings by counting the minimum number of edits (deletions, insertions, substitutions, and swaps of adjacent characters) that need to be done to make the strings equal.
This tests if the strings in the given column are within the maximum distance of the string given in the constant. The following qualifiers can be given (in any order):
DateTime constant can be formed in 2 ways:
A DateTime constant with the current date/time can be made by using the function datetime without arguments.
N-dimensional arrays of all data types can be created with the exception of regular expressions.
It is possible to form a 1-dimensional array from a constant bounded discrete set. When needed such a set is
automatically transformed to an array. E.g.
The first example results in an integer array of 10 elements with values 0..9. The others result in a string array of 3 elements. The second version already shows that strings can be concatenated (as explained further on).
A multi-dimensional array can be formed by giving a set of arrays. A nested list resembles the numpy way. For example:
results in a 2-dim array. However, it is also possible to use arrays created in other ways such as arrays in a column or arrays created with the array function described below. For example:
results in a 3-dim array.
Furthermore it is possible to use the array function to create an array of any shape. The values are given in the first argument as a scalar, set, or another array. The shape is given in the latter arguments as scalars or as a set. The array is initialized to the values given which are wrapped if the array has more elements.
The first examples create an array with shape [10,4] containing the values 1..10 in each line. The latter results in a boolean array having the same shape as the DATA array and filled with False.
An array can have an optional mask. Similar to numpy’s masked array, a mask value True means that the
value is masked off, thus not taken into account in reduce functions like calculating the mean.
Note that this definition is the same as the FLAG column in a MeasurementSet, but is different from a mask
in a Casacore Image where True means good and False means bad.
All operations on arrays will take the possible mask into account. Reduce functions like median only use
the unmasked array elements. Furthermore, partial reduce functions like medians will set an
output mask element to True if the corresponding input array part has no unmasked elements.
Operators like + and functions like cos operate on all array elements. The mask in the resulting array is
the logical OR of the input masks. Of course, the result has no mask if no input array has a
mask.
A masked array is created by applying a boolean array to an array using the square brackets operator. Both arrays must have the same shape. For example:
The first example applies the FLAG column in a MeasurementSet to the DATA column. The second example masks off high DATA values.
The functions arraydata, arraymask, and flatten can be used to get the array data or mask. The last one flattens the array to a vector while removing all masked elements.
The TaQL commands putting values into a table accept two columns (in parentheses) for a masked array. This is described in more detail in the appropriate sections. For example:
to write the data averaged over the first axis (frequency channel)into column MD. Only the unflagged data points are taken into account. The output contains the resulting flags in column MM; a flag is set to True if all channels were flagged.
A cell in a table column containing variable shaped arrays, can be empty. Such a cell does not contain an array and is represented in TaQL as a null array. Note it is different from a cell containing an empty array, which is an array without values.
Null arrays can be used with any operator and in any function. If one of the operands or function arguments is a null array, the result will be a null array; only array functions reducing to a scalar (such as sum and mean) give a valid value (usually 0).
The UPDATE and INSERT commands will ignore a null array result; no value is written in that row.
A table column can be used in a query by giving its name in the expression, possibly qualified with a table shorthand name. A column can contain a scalar or an array value of any data type supported by the table system. It will be mapped to the available TaQL data types. If the column keywords define a unit for the column, the unit will be used by TaQL.
The name of a column can contain alphanumeric characters and underscores. It should start with an
alphabetic character or underscore. A column name is case-sensitive.
It is possible to use other characters in the name by escaping them with a backslash. e.g., DATE\-OBS.
In the same way a numeric character can be used as the first character of the column name. e.g., \1stDay.
A reserved word cannot be used directly as a column name. It can, however, be used by escaping it with a
backslash. e.g., \IN.
Note that in programming languages like C++ and Python a backslash itself has to be escaped by another backslash. e.g., in Python:
tab.query(’DATE\\-OBS10MAR1996’).
If a column contains a record, one has to specify a field in it using the dot operator; e.g., col.fld means
use field fld in the column. It is fully recursive, so col.fld.subfld can be used if field fld is a record in
itself.
Alas records in columns are not really supported yet. One can specify fields, but thereafter an error message
will be given.
Usually a column used in an expression will be a column in one of the tables specified in the FROM or JOIN clause. However, it is possible to use a column created in the SELECT clause, in expressions given in the HAVING or ORDERBY clause. In fact, a column name not preceded by a table shorthand, is first looked up in the SELECT columns and thereafter in the first FROM table.
It can be advantageous to use a SELECT column if that column is an expression; it saves both typing and execution time. because that expression is executed only once.
It is possible to use table or column keywords, which can have a scalar or an array value or a record, possibly
nested. A table keyword has to be specified as ::key. In an expression the :: part can be omitted if
there is no column with the same name. A column keyword has to be specified as column::key.
Note that the :: syntax is chosen, because it is similar to the scope operator in C++.
As explained in the FROM clause, keywords in the primary table and in other tables can be used. If
used from another table, it has to be qualified with the (shorthand) name of the table. E.g.,
sh.key or sh.::key
takes table keyword key from the table with the shorthand name sh.
If a keyword value is a record, it is possible to use a field in it using the dot operator. e.g., ::key.fld to use field fld. It is fully recursive, so if the field is a record in itself, a subfield can be used like col::key.fld.subfld
A keyword can be used in any expression. It is evaluated immediately and transformed to a constant value.
TaQL has a fair amount of operators which have the same meaning as their C and Python counterparts. The operator precedence order is:
Operator names are case-insensitive. For SQL compliancy some operators have a synonym.
All operators can be used for scalars and arrays and a mix of them. Note that arrays of regular expressions cannot be used.
The following table shows all available operators and the data types that can be used with them.
Operator | Data Type | Description |
** | numeric | power. It is right associative, thus 2**1**2 results in 2. |
* | numeric | multiplication |
/ | numeric | non-truncated division, thus 1/2 results in 0.5 |
// | real | truncated division (a la Python) resulting in an integer, thus 1./2. results in 0 |
% | real | modulo; 3.5%1.2 results in 1.1; -5%3 results in -2 |
+ | no bool | addition. If a date is used, only a real (converted to unit day) can be added to it. String addition means concatenation. |
- | numeric,date | subtraction. Substracting a date from a date results in a real (with unit day). Subtracting a real (converted to unit day) from a date results in a date. |
& | integer | bitwise and |
| | integer | bitwise or |
^, XOR | integer | bitwise xor |
, | all | comparison for equal. The norm is used when comparing complex numbers. |
no bool | comparison for greater |
|
no bool | comparison for greater or equal |
|
no bool | comparison for less |
|
no bool | comparison for less or equal |
|
, | all | comparison for not equal |
~= | numeric | shorthand for the NEAR function with a tolerance of 1e-5 |
!~= | numeric | shorthand for NOT NEAR with a tolerance of 1e-5 |
&&, AND | bool | logical and |
||, OR | bool | logical or |
!, NOT | bool | logical not |
~ | integer | bitwise negation |
+ | numeric | unary plus |
- | numeric | unary minus |
~ | string | test if string matches a regular expression constant. |
!~ | string | test if string does not match a regular expression constant. |
(I)LIKE | string | test if a string matches an SQL pattern (I for case-insensitive). |
NOT (I)LIKE | string | test if string does not match an SQL pattern. |
IN | all | test if a value is present in a set of values, ranges, and/or intervals. (See the discussion of sets). |
NOT IN | all | negation of IN |
BETWEEN | no bool | x BETWEEN b AND c is similar to x>=b AND x<=c and x IN [b=:=c] |
NOT BETWEEN | no bool | x NOT BETWEEN b AND c is the negation of above. |
AROUND | real,datetime | x AROUND mid IN width is similar to x>=mid-width/2 AND x<=mid+width/2 and x IN [mid<:>width] |
NOT AROUND | no bool | x NOT AROUND mid IN width is the negation of above. |
INCONE |
| cone search. (See the discussion of cone search functions). |
NOT INCONE |
| negation of INCONE |
EXISTS |
| test if a subquery finds at least N matching rows. The value for N is taken from its LIMIT clause; if LIMIT is not given it defaults to 1. The subquery loop stops as soon as N matching rows are found. E.g. EXISTS(select from ::ANTENNA where NAME=’’somename’’ LIMIT 2) results in true if at least 2 matching rows in the ANTENNA table were found. |
NOT EXISTS |
| negation of EXISTS |
As in SQL the operator IN can be used to do a selection based on a set. E.g.
The result of operator IN is true if the column value matches one of the values in the set. A set can contain any data type except a regex.
This example shows that (in its simplest form) a set consists of one or more values (which can be arbitrary expressions) separated by commas and enclosed in square brackets. The elements in a set have to be scalars and their data types have to be the same or convertible to a common data type. The square brackets can be left out if the set consists of only one element. For SQL compliance parentheses can be used instead of square brackets if the set contains more than one element.
An array is also a set, so IN can also be used on an array like:
where expr1 is the array result of some expression. It is also possible to use a scalar as the righthand of operator IN. So if expr1 is a scalar, operator IN gives the same result as operator ==.
The lefthand operand of the IN operator can also be an array or set. In that case the result is a boolean array telling for each element in the lefthand operand if it is found in the righthand operand.
An element in a set can be more complicated than a single value. It can define multiple discrete values and a continuous interval. The possible forms of a set element are:
These examples show constants only, but start, end, and incr can be any expression.
Note that :: used here can conflict with the :: in the keywords. e.g., a::b is scanned as a keyword
specification. If the intention is start::incr, whitespace should be used as in a: :b. In practice this
conflict will hardly ever occur.
which is equal to
Thus the BETWEEN/AND and AROUND/IN syntax can be used directly to define a single interval or be used to define intervals in a set.
It is very important to note that the 2nd form of set specification results in discrete values, while the 3rd and later forms result in continuous intervals.
Each element in a set can have its own form, i.e., one element can be a single value while another can be an interval. If a set consists of single or bounded discrete start:end:incr values only, the set will be expanded to an array. This makes it possible for array operators and functions (like mean) to be applied to such sets. E.g.
If a set on the right side of the IN operator contains a single element (either a value, range, or interval), it does not need to be enclosed in square brackets or parentheses.
Another form of constructing a set is using a subquery as described in section 4.11.
It is possible to take a subsection or a single element from an array column, keyword or expression using the index operator [index1,index2,...]. This syntax is similar to that used in Python or Glish. Similar to Python a negative value can be given meaning counting from the end. However, a negative stride cannot be given. Taking a single element can be done as:
Taking a subsection can be done as:
If a start value is left out it defaults to the beginning of that axis. An end value defaults to the end of the
axis and an increment defaults to one. If an entire axis is left out, it defaults to the entire axis.
E.g., an array with shape [10,15,20] can be subsectioned as:
The examples show that an index can be a simple constant (as it will usually be). It can also be an
expression which can be as complex as one likes. The expression has to result in a real value which will be
truncated to an integer.
For fixed shaped arrays checking if array bounds are exceeded is done at parse time. For variable shaped
arrays it can only be done per row. If array bounds are exceeded, an exception is thrown. In the future a
special undefined value will be assigned if bounds of variable shaped arrays are exceeded to prevent the
selection process from aborting due to the exception.
Note that the index operator will be applied directly to a column. This results in reading only the required part of the array from the table column on disk. It is, however, also possible to apply it to a subexpression (enclosed in parentheses) resulting in an array. E.g.
can both be used and have the same result. However, the first form is faster, because only a
single element is read (resulting in a scalar) and 1 is added to it. The second form results in
reading the entire array. 1 is added to all elements and only then the requested element is taken.
From this example it should be clear that indexing an array expression has to be done with
care.
TaQL has full support of units, both basic and compound units. Each value or subexpression can be followed by a unit telling that the value or subexpression result gets that unit or will be converted to that unit. All basic units supported by module Quanta can be used. Compound units (such as ’m/s’) can be given as well or are formed by a TaQL expression with units (such as ’10m/30s’). Note that units are case sensitive. Most common units use lowercase characters. A basic unit can be preceded by a scaling prefix (like k for kilo). The basic units and prefixes can be shown using the show units command of the program taql.
Most basic units can be given literally (i.e., as an unquoted string) after a value or subexpression.
Whitespace between value or subexpression and unit is optional.
A compound unit can be given literally if only containing digits, underscores and/or dots (e.g., m2, fl_oz. or
m.m). Otherwise the unit has to be quoted (e.g., ’m/s’) or escaped with a backslash (e.g., m\/s). Whitespace
between value or subexpression and compound unit is mandatory unless the unit is quoted.
For example:
Units can be converted to another (conforming) unit by giving that unit after a (sub)expression. E.g., 3 deg rad converts 3 degrees to radians. Note that the empty string (”) is an empty unit, which can be used to make a value unitless. If ever needed, it can be used to set a non-conforming unit for a value. E.g. (3deg ’’) kg.
There is no real distinction between giving a unit as part of a value (as in 3deg) or using whitespace
between value and unit (as in 3 deg). Also composite units (enclosed in quotes) can be given right after a
value without whitespace.
However, a few units are identical to reserved TaQL keywords (e.g., ’in’ for inch or ’as’ for arcsecond). Such
units have to be quoted or escaped with a backslash, unless given after a value without whitespace (as in
3in). For example:
Arguments to functions such as sin are converted to the appropriate unit (radians) as needed. In a similar way, the units of operands to operators like addition, will be converted as needed to make their units the same. An exception is thrown if a unit conversion is not possible.
Units can be given (or derived) in various ways.
Units will probably mostly be used in an expression in the WHERE clause or in a CALC command. However, it is also possible to use a unit in the selection of a column in the SELECT clause. For example:
In such a case the selection is an expression and the unit is stored in the column keywords. Thus in this example, TIME is stored in a column TIMED with keyword QuantumUnits=d and the values are converted to days.
More than 200 functions exist to operate on scalar and/or array values. Some functions have two names. One
name is the CASA/Glish name, while the other is the name as used in SQL. In the following tables the
function names are shown in uppercase, while the result and argument types are shown in lowercase. Note,
however, that function names are case-insensitive.
Furthermore it is possible to have user defined functions that are dynamically loaded from a shared library.
In section Writing user defined functions it is explained how to write user defined functions.
A set of standard UDFs exists dealing with Measure conversions, for example to convert J2000 to
apparent. Another set of UDFs deals with values and relations in MeasurementSets and Calibration
Tables.
Sets, and in particular subqueries, can result in a 1-dim array. This means that the functions accepting an array argument can also be used on a set or the result of a subquery.
These functions can be used on a scalar or an array argument.
Apart from using regex/pattern constants, it is possible to use functions to form a regex or pattern. These functions can only be used on a scalar argument.
A regex formed this way can only be used in a comparison == or !=. E.g.
object == pattern(’3C*’)
to find all 3C objects in a catalogue.
A few remarks:
These functions make it possible to handle dates/times and can be used on a scalar or an array argument. The syntax of a date/time string or constant is explained in section 4.3.7.
All functions can be used without an argument in which case the current date/time is used. e.g., DATE() results
in the current date.
It is possible to give a string argument instead of a date. In this case the string is parsed and converted to a
date (i.e., the function DATETIME is used implicitly).
Note that the function STR discussed in the next section can also be used for pretty-printing a date/time. It
gives more control over the number of decimals and date format.
Angles (scalar or array) can be returned as strings in HMS and/or DMS format. Currently, they are always formatted with 3 decimals in the seconds.
The functions mentioned above and the date/time functions in the previous subsection can format a value in a
predefined way only.
The STRING (shorthand STR) function makes it possible to convert values to strings using an optional format
string or width.precision value. It also makes it possible to format dates, times, and angles in a variety of
ways.
The value can be of any type (except Regex) and can be a scalar or array. The optional format must be a
scalar string or numeric value. If no format is given, an appropriate default format will be used.
By default a value is right adjusted, but can be left adjusted by giving a negative width.
A bool value is prettyfied as ’True’ or ’False’. Using format ’%d’ it is prettyfied as 1 or 0.
Note that precision represents all digits, not only the ones behind the decimal point. Thus 10.3 is not the same as ’%10.3d’ as the latter defines 3 decimals.
Apart from a printf-style format string, it is also possible to define a string to format
date/time and angle values (which are automatically converted to radians if containing units).
Such a format string contains one or more format values as defined in class MVTime. A vertical bar or
a comma (with optional whitespace) must be used as separator; they cannot be mixed. A numeric value
can be part of the string to define the precision of the time/angle. The default precision is 6 (thus
hh:mm:ss).
The optional time/angle formats and modifiers are:
Format | Description |
ANGLE | +ddd.mm.ss.ttt |
TIME | hh:mm:ss.ttt |
ALPHA | use d,m instead of . in angles and h,m instead of : in times |
YMD | yyyy/mm/dd/hh:mm:ss.sss |
YMD_ONLY | YMD without the time (same as YMDNO_TIME) |
DMY | dd-Mon-yyyy/hh:mm:ss.sss |
FITS | yyyy-mm-ddThh:mm:ss.sss |
ISO | yyyy-mm-dd hh:mm:ss.sssZ (same as FITSUSE_ZUSE_SPACECLEAN) |
BOOST | the same as DMYUSE_SPACE |
NO_H, NO_D | suppress the output of hours (or degrees): useful for offsets |
NO_HM, NO_DM | suppress the degrees and minutes |
CLEAN | suppress leading or trailing periods or colons if not all time/angle parts |
are printed (e.g., when giving NO_H or 4 decimals) | |
DAY | precede the output with Day- (e.g., Wed-) |
NO_TIME | suppress printing of time |
USE_SPACE | use a space between date and time (and day and date) |
USE_Z | put a Z after the time to denote UTC |
DIG2 | get angle/time in range -90:+90 or -12:+12 |
LOCAL | local time; in FITS mode append time zone as +hh:mm |
For example:
If such a format string contains an invalid part, it is assumed that the entire string is a printf-style format string.
The exact comparison of floating point values is quite tricky. Two functions make it possible to compare 2
double or complex values with a tolerance. They can be used on scalar and array arguments (and a mix of
them). The tolerance must be a scalar though.
Note that operator = is the same as NEAR with a tolerance of 1e-5.
Standard mathematical can be used on scalar and array arguments (and a mix of them).
Note that the arguments or results of the trigonometric functions are in radians. They are converted automatically if units are given.
The following functions reduce an array to a scalar. They are meant for an array, but can also be used for a scalar.
These functions reduce an array to a smaller array by collapsing the given axes using the given function. The
axes are the last argument(s). They can be given in two ways:
- As a single set argument; for example, maxs(ARRAY,[1,2])
- As individual scalar arguments; for example, maxs(ARRAY,1,2)
For example, using MINS(array,0,1) for a 3-dim array results in a 1-dim array where each value is the
minimum of each plane in the cube.
It is important to note that the interpretation of the axes numbers depends on the style being
used. e.g., when using glish style, axes numbers are 1-based and in Fortran order, thus axis 1 is
the most rapidly varying axis. When using python style, axis 0 is the most slowly varying axis.
Axes numbers exceeding the dimensionality of the array are ignored. For example, maxs(ARRAY,[1:10])
works for arrays of virtually any dimensionality and results in a 1-dim array.
The function names are the ’plural’ forms of the functions in the previous section. They can only be used for
arrays, thus not for scalars.
These functions are a generalization of the functions in the previous section. They downsample an
array by taking, say, the mean of every n*m elements. The functions in the previous section
downsample by taking the mean of a full line or plane, etc. The most useful one is probably
calculating the boxed mean, but the other ones can be used similarly. The width of each window axis
has to be given. Missing axes default to 1. Similarly to the partial reduce functions described
above, the axes must be given as the last argument(s) and can be given as scalars or as a set.
For example, BOXEDMEAN(array,3,3) calculates the mean in each 3x3 box. At the end of an axis the box
used will be smaller if it does not fit integrally.
The functions can only be used for arrays, thus not for scalars.
These functions transform an array into an array with the same shape by operating on a rectangular window
around each array element. The most useful one is probably calculating the running median, but the other
ones can be used similarly. The half-width of each window axis has to be given; the full width is 2*halfwidth
+ 1. Missing axes default to a half-width of 0. Similarly to the partial reduce functions described
above, the axes must be given as the last argument(s) and can be given as scalars or as a set.
For example, RUNNINGMEDIAN(array,1,1) calculates the median in a 3x3 box around each array element. See
the examples how it is applied to an image.
In the result the edge elements (i.e., the elements where no full window can be applied) are set to 0 (or
False).
The functions can only be used for arrays, thus not for scalars.
Explicit type conversions can be done using one of the functions below. They can operate on scalars and arrays.
The following functions create an array value with or without a mask. Function marray creates a new masked array, the other functions return a masked array if the input was masked, otherwise an unmasked array.
The GXXX aggregate functions calculate an aggregated value for all rows in a group, usually defined with a
GROUPBY clause. For example, when grouping in TIME, an aggregate function like GNTRUE(FLAG) counts
per time slot the number of flagged data points. Aggregate functions can only be used in the SELECT and
the HAVING clause.
Most functions listed below reduce the values in a group to a scalar value, also if the value in a
row is an array (as in the GNTRUE example above). The arrays in a group can have different
shapes.
However, there are several aggregate functions returning an array as done by the last three functions (GHIST, GAGGR, and GROWID) shown below. Furthermore, most scalar functions have a plural form (e.g., GNTRUES) returning an array. They are described at the end of this section.
Note that the aggregate function names differ from their SQL counterparts; they all have the prefix G, because TaQL functions like MAX already exist for array operations. This naming scheme also makes it more clear which TaQL functions are aggregate functions.
A technical detail is how aggregate functions are implemented. TaQL walks sequentially through a table. Non-lazy functions operate directly on the value in a row making the table access purely sequential. It requires that the results of all groups are held in memory. For some functions, in particular GAGGR, this could lead to a very high memory usage. Therefore, some functions are implemented in a lazy way. They keep the row numbers of a group and access the data when the aggregated result of a group is needed. In this way only the data of a single group needs to be held in memory, but the access to the table might be non-sequential making it somewhat slower. Currently, only GAGGR and the User Defined aggregate functions are implemented in a lazy way.
Most functions above have a plural counterpart. They calculate the aggregated value per array index, thus the
result has the same shape as the arrays in the group. Similar to function GAGGR, they require that all
arrays in a group have the same shape.
For instance, for a MeasurementSet the expression GMEANS(DATA) calculates the mean in a group per
channel/polarization. Not only it is a shorthand for MEANS(GAGGR(DATA), 0), but it usually works faster
because, unlike GAGGR, it is non-lazy.
The functions available are:
Cone search functions make it possible to test if a source is within a given distance of a given sky position. The expression
could be used to test if sources with their sky position defined in columns RA and DEC are within 1 arcmin of
the given sky position.
The cone search functions implement this expression making life much easier for the user. Because they can
also operate on arrays of positions, searching in multiple cones can be done simultaneously. That makes it
possible to find matching source positions in two catalogues as shown in an example at the end of this
section.
The arguments of all functions are described below. All of them have to be given in radians. However, usually one does not need to bother because TaQL makes it possible to specify positions in many formats automatically converted to radians.
The following cone search functions are available.
Please note that ANYCONE(SOURCE,CONES) does the same as any(CONES(SOURCE,CONES)), but is faster because it
stops as soon as a cone is found.
Function CONES makes it possible to do catalogue matching. For example, to find sources matching other
sources in the same catalogue (within a radius of 10 arcseconds):
Note that in this example the SELECT clause returns an array with positions which are used as the cone
centers. So each source in the catalogue is tested against every source. It makes it an N-square operation,
thus potentially very expensive. The result is a 4-dim boolean array with shape (in glish style)
[1,nrow,1,nrow] which can be processed in Glish. Please note that the CONES function results for each row
in a array with shape [1,nrow,1].
The query can be done with multiple radii, for example also with 1 arcsecond and 1 arcminute.
resulting in an array with glish shape [3,nrow,1,nrow]. In this way one can get a better indication how close sources are to the cone centers.
TaQL can be extended with so-called User Defined Functions (UDF). These are dynamically loaded functions, either written in C++ or in Python. In TaQL the name of a UDF written in C++ consists of the name of the library (without lib prefix and extension) followed by a dot and the function name. For example:
denotes function hadec in shared library libmeas.so or libcasa_meas.so. For OS-X the extension .dylib
will be used.
The physical shared library name must be fully lowercase, but the UDF name used in TaQL is
case-insensitive. The name of a UDF written in Python is like py.module.func where the module part
is optional. In the USING STYLE clause it is possible to define synonyms for the UDF library
names. By default, mscal is defined as a synonym for derivedmscal and py as a synonym for
pytaql.
Usually a UDF will operate on the arguments given to the function and will not itself operate on a table given in a query command. However, some UDFs (most notably the mscal ones) do not have arguments, but operate directly in a specific way on a table. Normally they use the first table given in the FROM clause, but the UDF name can be preceded by a table shorthand to specify another table. For example:
to get the hourangle from two different tables. Of course, both tables need to have the same number of rows.
Note that UDFs not directly operating on a table, will ignore a shorthand.
In section Writing user defined functions it is explained how to write user defined functions.
The Casacore package comes with several predefined UDFs in library libcasa_derivedmscal. It contains
four groups of UDFs, all operating on a MeasurementSet and several on a CalTable, the CASA calibration
table (both old and new format).
Although the library is called derivedmscal, for ease of use it is possible to use the synonym
mscal.
Get derived values
The first group calculates derived values like hourangle and azimuth for each row in the MeasurementSet
or CalTable given in the FROM clause. It uses the time, direction and arraycenter or first or
second antenna of a baseline from the MeasurementSet or CalTable. For a CalTable, where a row
contains a single antenna, functions like PA1 are the same as PA2. All angles are returned in
radians.
By default all these functions will use the direction given in column PHASE_DIR of the FIELD subtable. It is
possible to use another column in the FIELD table by giving its name as a string argument (e.g.,
HA(’DELAY_DIR’)).
Except for the last 2 functions, it is possible to use an explicit direction which must be given as [RA,DEC] in
J2000 or as a case-insensitive name of a planetary object (as defined by the Casacore Measures) or a known
source (such as CygA). For example:
The examples above give the azimuth and elevation of the given directions for each selected row in the MeasurementSet, using the position of ANTENNA1 and the times in these rows.
If a string value is given, it is first tried as a planetary object. Theoretically it is possible that a column has the same name as a planetary object. In such a case the name can be escaped by a backslash to indicate that a column name is meant. For example:
means that column SUN in the FIELD table has to be used.
Stokes conversion
The STOKES function makes it possible to convert the Stokes parameters of a DATA column in a
MeasurementSet, for instance from linear or circular to iquv. It is also possible to convert the weights or
flags, i.e., to combine them in the same way as the data would be combined.
In all cases the case-insensitive string argument defines the output Stokes axes. It must be a comma separated list of Stokes names. All values defined in the Casacore class Stokes are possible. Most important are:
If not given, the string argument defaults to ’IQUV’. For example:
creates a table with column CIRCDATA containing the circular polarization data.
CASA style selection
The BASELINE function makes it possible to do selection on baselines in a MeasurementSet or CalTable using
the special CASA selection syntax described in note 263. Similar functions CORR, TIME, FIELD, FEED,
SCAN, SPW, UVDIST, STATE, OBS, and ARRAY can be used to do selection based on other meta data. The
functions accept a string containing a selection string and return a Bool value telling if a row matches the
selection string. For example,
selects the cross-correlation baselines containing an antenna whose name matches the pattern in the function
argument.
Note there is a difference how CASA and TaQL handle unknown antennas given in the baseline
selection string. CASA tasks give an error, while TaQL will not complain and not even report it,
because doing a selection this way should not behave differently from doing it like NAME=’RTX’.
Also note that in CASA tasks only one selection string per type can be given and the final selection is the
AND of them. TaQL has the AND and OR operators making it possible to combine the selections in all kind
of ways, possibly using multiple selection strings of the same type.
Get values from a subtable
Several functions exist to get information like the name of an antenna from the subtable for each row in the
main table. Basically they do a join of the main table and a subtable. For example:
gets the names of the antennae used in each baseline.
The following functions can be used:
Note that the following are equivalent. The first versions are shorthands for the latter ones.
In the last example the id-column must be given as such, thus must not be a string.
These functions make it possible to convert Casacore measures (e.g., directions) from one reference frame to another. The prefix MEAS. has to be used for all these functions. The MEAS library libcasa_meas.so (or .dylib) will be loaded if not loaded yet. All conversions supported by Casacore’s Measures are possible. It is quite flexible; for instance, source names can be used instead of right ascension and declination. Also it recognizes nested MEAS functions and table columns containing measures. For example:
The first example converts a J2000 position to galactic coordinates. The second example gives the moon’s azimuth/elevation at the WSRT for the current date/time.
Below it is described how the measure values can be specified. Further down it is described in detail for each measure type.
Many functions are available, but they come down to a few basic functions. The others (described further
down) are synonyms or shorthands for the basic functions described below.
A function will operate on each element of the Cartesian product of the function arguments.
The first six suffices can also be used with the Moon.
See stjarnhimlen.se for additional information.
For ease of use several functions have a shorthand synonym.
For even more ease of use several functions are defined with an implicit ’toref’ argument.
The function arguments can be given in a variety of ways. Coordinate values (such as directions) can be
followed by a ’ref’ argument telling the reference frame used for them (e.g., J2000). If not given, a default
reference frame is assumed.
Where needed, the argument data types and units are used to distinguish arguments. However, a string value
for a reference frame cannot be distinguished from a string giving the name of a source, observatory or line.
In such a case the string value is used to distinguish them.
can be used to see the possible frames for each Measure type.
For example:
For example:
Note that in the last example ’UTC’ is not necessary, because it is the default.
If needed, the reference type (with optional suffix) can be given in the next argument. The reference
type defaults to ITRF if xyz coordinates are used, otherwise to WGS.
For example:
For example:
For example:
Note that in the last example ’LSRK’ is not necessary, because it is the default.
For example:
Below a few examples are given showing how the MEAS functionality can be used.
calculates the local apparent sidereal time for the given date/time and position. The second example shows that an observatory name can be used for the position. It also shows that the date/time can be given as a string.
calculates Jupiter’s azimuth/elevation for WSRT and VLA for all times returned by the subquery (see next section for subqueries).
converts the PHASE_DIR directions in the FIELD table to B1950. Note that no frame information is needed for such a conversion.
calculates the azimuth/elevation of the given source direction for the LOFAR site for the next 24 hours on the given date. The result is an array with shape [24,2]. The direction in the second example is given in B1950, the first as the default J2000. The result of the first example is double values with unit deg (given at the end of the expression). The result of the second example is strings in DMS format (because function DMS is used).
calculates the rest frequency for the given radial velocity, direction, date/time and position. The result will have unit Hz.
As in SQL it is possible to create a set from a subquery. A subquery has the same syntax as a main query, but has to be enclosed in square brackets or parentheses. Basically it looks like:
The subquery on othertable results in a constant set containing the times for which the windspeed matches. Subsequently the main query is executed and selects all rows from the main table with times in that set. Note that like other bounded sets this set is transformed to a constant array, so it is possible to apply functions to it (e.g., min, mean).
This example shows how a subquery is used to join the main table of a MeasurementSet and its ANTENNA subtable. The subquery returns a list with the names of all antennae, which subsequently is indexed with the antenna number to get the antenna name for each row in the main table.
is a newer and easier way to obtain the name of ANTENNA1. It makes use of the new user defined functions in derivedmscal which can do an implicit join of a MeasurementSet and its subtables.
This example contains another subquery to get all windspeeds and to take the mean of them. So
the first subquery selects all times where the windspeed is less than the average windspeed.
A subquery result should contain only one column, otherwise an exception is thrown.
It may happen that a subquery has to be executed twice because 2 columns from the other table are needed. E.g.
In this case othertable contains the time range for each windspeed. For big tables it is expensive to execute the subquery twice. A better solution is to store the result of the subquery in a temporary table and reuse it.
However, this has the disadvantage that the table tmptab still exists after the query and has to be deleted explictly by the user. Below a better solution for this problem is shown.
TaQL has a few extensions to support tables better, in particular the Casacore MeasurementSets.
However, below an even nicer solution is given.
The set expression in the GIVING clause is filled with the results from the subquery and used in the main query. So if the subquery results in 5 rows, the resulting set contains 5 intervals. Thereafter the resulting intervals are sorted and combined where possible. In this way the minimum number of intervals have to be examined by the main query.
In this example the other table is a subtable of table my.ms. Its name is given by keyword WEATHER of my.ms.
Note that the function ROWNUMBER cannot be used here, because it will give the row number in the selection and not (as ROWID does) the row number in the original table. Furthermore, ROWID gives a 0-relative row number which is needed to be able to use it as a selection criterium on the 0-relative values in the measurement set.
In a MeasurementSet the UVW coordinates are stored in meters, so they have to be multiplied with
the frequency and divided by speed of light to get them in wavelengths.
The first join finds the SPECTRAL_WINDOW_ID for each row; the second join finds the channel
frequencies.
A derivedmscal function can be used for an easier solution as shown below.
It shows how the UVWWVLS function in derivedmscal can be used to obtain the UVW coordinates in wavelengths.
Similar to SQL it is possible to do aggregation and grouping in TaQL and to do selection on the groups using the HAVING clause.
One or more aggregated values can be calculated for a group defined by the GROUPBY clause. The aggregate functions described in section 4.10.13 can be used. For example:
A group is formed for the unique values of the columns given in the GROUPBY clause. In the example
above a group per baseline is formed. Usually an aggregate function is ued to calculate a value for the group.
In the example above the aggregate function gcount() counts the number of rows per baseline.
Often only the GROUPBY columns and aggregated values are part of the SELECT clause, but the example
shows that other values (here the baseline length) can also be selected. Non-aggregated values get the values
in the last row of a group.
Usually aggregated values and GROUPBY are used jointly, but it is possible to leave out one of them. If GROUPBY is not given, the entire table is a single group. For example:
does not have groups, thus shows the total number of rows in the MS.
does not use aggregate functions, but shows the unique baselines in the MS. Apart from the order, it has the same result as
but is somewhat faster.
In the examples above a sole aggregate function is used, but it is also possible to use it in an expression. Similarly, an expression can be used in the GROUPBY. For example:
groups the MS in chunks of 5 time slots. Note that the nested query gets the TIME of the first time slot. The result is a set, hence the 0th element has to be taken.
Note that an aggregate function can only be used in the SELECT and HAVING clause, so TaQL will give an error message if used elsewhere.
The HAVING clause can be used to select specific groups. For example:
groups by time, but only selects the groups for which the maximum amplitude of the DATA is more
than 100. Both examples give the same result, but the first one is more efficient. Not only it is
less typing, but it is faster because it reuses the result column MAXA of the SELECT part.
Similar to WHERE, any expression can be used in HAVING, but the result has to be a bool scalar
value.
As shown in the example, HAVING will normally use aggregate functions, but it is not strictly needed.
However, selections without an aggregate function could as well be done in the WHERE clause.
Usually HAVING will be used in combination with GROUPBY, but it can be used without. It can
also be used without an aggregate function in the SELECT. However, it is an error if both are
omitted.
A lot of development work could be done to improve the query optimization. At this stage only a few simple optimizations are done.
can generate many identical or overlapping intervals. They are sorted and combined where possible to make the set as small as possible.
TaQL does not recognize common subexpressions nor does it attempt to optimize the query. It means that the user can optimize a query by specifying the expression carefully. When using operator or &&, attention should be paid to the contents of the left and right branches. Both operators evaluate the right branch only if needed, so if possible the left branch should be the shortest one, i.e., the fastest to evaluate.
The user should also use functions, operators, and subqueries in a careful way.
could also be expressed as
The latter (as a set) is slower. So, if possible, the column should be returned directly. This is also easier
to write.
An even more important optimization for this query is writing it as:
Using the DISTINCT qualifier has the effect that duplicates are removed which often results in a much smaller set.
The second form is by far the best, because in that case the subquery will stop the matching process as
soon as N matching rows are found. The first form will do the subquery for the entire table.
Furthermore in the first form a column has to be selected, which is not needed in the second
form.
give the same result. Operator IN is faster because it stops when finding a match. If using ANY all elements are compared first and thereafter ANY tests the resulting bool array.
Usually TaQL will be used to get a subset from a table. However, as described in the first sections, it can also be used to change the contents of a table using the UPDATE, INSERT, or DELETE command. Note that a table has to be writable, otherwise those commands exit with an error message.
updates all or some rows in the first table. More input tables can be given in the FROM clause and used in
clauses like SET and WHERE. Unlike SQL it is possible to specify more tables in the UPDATE part which is
the same as specifying them in the FROM clause. However, using the FROM clause makes it more clear that
only the first table is updated.
update_list is a comma-separated list of column=expression parts. Each part tells to update the given
column using the expression. Both scalar and array columns are supported. E.g.
to make the antenna numbers zero-based if accidently they were written one-based.
are equivalent. They copy the DATA and FLAG column of that.ms to this.ms for rows where all data in this.ms are flagged. Note the use of the shorthand (alias) t2.
If an array gets an array value, the shape of the array can be changed (provided it is allowed for that table column). Arrays can also be updated with a scalar value causing all elements in the array to be set to that scalar value.
It sets all elements of the arrays in column FLAG to False.
Type promotion and demotion will be done where possible. For example, an integer column can get the
value of a double expression (the result will be truncated).
Unit conversion will be done as needed. Thus if a column and its expression have different units, the
expression result is automatically converted to the column’s unit. Of course, the units must be of the same
type to be able to convert the data.
Note that if multiple column=expression parts are given, the columns are changed in the order as specified in the update-list. It means that if an updated column is used in an expression for a later column, the new value is used when evaluating the expression. e.g., in
the SUMD update uses the new DATA values.
Thus to swap the values of the ANTENNA1 and ANTENNA2 column, one can not do:
To solve this problem a temporary table (in this case in memory) can be used to save the value of e.g., ANTENNA1:
It is possible to update part of an array using array indexing and slicing. E.g.,
The first example sets only a single array element, while the second one sets an entire row in the array. Similar to numpy it is also possible to use a mask like
which sets the flag for the DATA values being a NaN. The data and mask must have the same shape. Note this is easier to write than the similar command
Masking and slicing can be combined making it possible to use masking on a part of an array. If the mask is given first, the slice is taken from both the data and mask. If the slice is given first, it is only applied to the data; the mask should have the same shape as the slice. For example:
Both commands set the flag for NaN data in the XX polarization. The first one is somewhat easier to write, but processes the entire DATA and FLAG before taking the slice. The second one only reads and processes the required parts of DATA and FLAG, thus is more efficient.
If a column is updated with the value of a masked array, only the array part of the masked array is used. However, it is also possible to jointly update the data column and mask column from a masked array by combining them in parentheses like:
It writes the data part into DATA and the mask into FLAG. As above it is possible to use a slice or mask operator on the combination like:
The slice or mask is applied to both columns.
The INSERT command adds rows to the table. It can take three forms:
The first form adds a single row setting the values in the same way as the UPDATE command. The second form is the SQL syntax and can add multiple rows. In this form the optional LIMIT part can also be given right after the INSERT keyword. In both forms it is possible to jointly specify data column and mask column if the value is a masked array. This is done by combining them in parentheses like (DATA,FLAG) as described in the previous subsection for the UPDATE command.
The first form adds one row to the table and puts the values given in the expressions into the columns.
For example:
adds one row, puts 0 in ANTENNA1 and 1 in ANTENNA2.
The second form can add multiple rows to the table. It puts the values given in the expression lists into
the columns given in the column list. If the column list is not given, it defaults to all stored
columns in the table in the order as they appear in the table description. Multiple expression
lists can be given; each list results in the addition of a row (however, see LIMIT clause below).
Each expression in the expression list can be as complex as needed; for example, a subquery can also be
given. Note that a subquery is evaluated before the new row is added, so the new row is not taken into
account if the subquery is done on the table being modified.
It should be clear that the number of columns has to match the number of expressions.
Note that row cells not mentioned in the column list, are not written, thus may contain rubbish in the new
rows.
The data types and units of expressions and columns have to conform in the same way as for the UPDATE
command; values have to be convertible to the column data type and unit.
For example:
adds two rows, putting 0 and 2 in ANTENNA1 and 1 and 3 in ANTENNA2.
The LIMIT clause can be used to add multiple rows while giving fewer expressions. LIMIT can be given at the beginning or the end of the command. For example:
The first example will add 100 rows where the value in each row is the row number. The second example shows that multiple expression lists can be given. It will iterate through them while adding rows. Thus COL1 and COL2 will have the values 0, 1, 0, 1, and 0 in the new rows.
The third form evaluates the SELECT command and adds the rows found in the selection to the table
being modified (which is given in the INTO part). The columns used in the modified table are
defined in the column list. As above, they default to all stored columns. The columns used in the
selection have to be defined in the column-list part of the SELECT command. They also default to
all stored columns. Column names and data types have to match, but their order can differ.
For example:
appends all rows and columns of my.ms to itself. Please note that only the original number of rows is copied.
copies rows from other.ms where ANTENNA10. It swaps the values of ANTENNA1 and ANTENNA2. All other columns are not written, thus may contain rubbish.
deletes some or all rows from a table.
deletes the rows matching the WHERE expression.
If no selection is done, all rows will be deleted.
It is possible to specify more than one table in the FROM clause to be able to use, for example,
keywords from other tables. Rows will be deleted from the first table mentioned in the FROM
part.
TaQL can be used to create a new table. The data managers to be used can be given in full detail. The syntax is:
The command consists of 5 parts, all of them optional.
The CREATE TABLE command can be used in a nested query making it possible to fill it immediately. For example:
creates a table with one column and ten rows. The column is filled with the row number. Note that the following command would do the same.
The colspecs part defines the column names, their data types, and optional shapes and units. It can optionally be enclosed in square brackets or parentheses (for SQL compatibility). It is a comma separated list of column specifications. Each specification looks like:
A column specification describes the column and consists of various parts.
The datamanagers part makes it possible for the expert user to define the data managers to be used by columns. It is a comma separated list of data manager specifications looking like the output of the table.getdminfo command in Python. Each specification has to be enclosed in square brackets. For example:
The case of the keyword names used (e.g., NAME) is important. They have to be given in uppercase. The
following keywords can be given:
NAME defines the unique name of the data manager.
TYPE defines the type of data manager.
SPEC is a list of keywords giving the characteristics of the data manager. This is highly data manager type
specific. If shapes have to be given here, they always have to be in Casacore format, thus in Fortran order.
TaQL has no knowledge about these internals.
COLUMNS is a list of column names defining all columns that have to be bound to the data manager.
TaQL can be used to remove one or more tables. If a plain table is removed, its subtables are removed as well. The syntax is:
The table-list is a comma separated list of table names, which can also use :: to denote subtables.
removes the table ’my.ms’ and subtable EXTRA of ’that.ms’.
TaQL can be used to modify the table structure, i.e., to add, rename, and remove columns and keywords. It is also possible to add rows. The syntax is:
It changes the table with the given name. The tables given in the optional FROM clause can be used in expressions defining keyword values. Any number of subcommands can be given, separated by whitespace and/or comma. The following subcommands can be given. They are explained in the next subsections.
The nouns COLUMN and KEYWORD can also be given in the plural form. The whitespace
between verb and noun is optional. For SQL-compatibility DROP can be used instead of DELETE.
For example:
renames column Col1 to Col1A and adds a new column Col1 with data type I4.
Note that TaQL has no way of showing keywords having a record value. The program showtableinfo can be used for that purpose.
adds one or more columns to the table. The specification of the columns and the optional data managers is the same as used in the CREATE TABLE command. Thus for each column a data type, dimensionality or shape, and unit can be given. The data manager(s) for the new columns can be specified in the DMINFO part. If not given, StandardStMan will be used. For example:
adds two columns, a 4-byte floating point scalar column and an 8-byte floating point 3-dim array column. They will be stored with StandardStMan.
makes it possible to copy the data in a column to a new column, that may not exist yet. If residing in
different tables, they must have the same number of rows. The new column gets its description from the
input column. The optional DMINFO part can be used to define the datamanager(s) for the new columns.
Note that copying to existing columns can be done with the UPDATE command. For example:
The first example creates column NCol1 taking its description from Col1; thereafter the contents of Col1 are
copied into it.
The second example is similar, but takes Col1 from the table with shorthand ’t’, which is given in the FROM
part. Because the data are copied, mytab and othertab must have the same number of rows.
The third example is similar to the first one, but has to be used if NCol1 already exists.
renames one or more columns in a table. For example:
removes one or more columns. Note that if multiple columns are combined in a TiledStMan, they have to be removed at the same time. Thus in that case
are not the same, because the second example might fail.
DROP is a synonym for DELETE.
adds a keyword with the given value or replaces the value if the keyword already exists. The value of a keyword can be a scalar, array, or arbitrarily deeply nested record. See section 4.5 how to specify a keyword name in a column or nested record. The AS dtype part can be used to explicitly set the data type of a new keyword. For an existing keyword, the data type of the new value has to match the data type of the current value.
The value can be an expression, possibly using values from another table given in the FROM clause. It has to be a constant expression, thus cannot depend on column values. Of course, column values can be used when aggregated to a single value. If no data type is given, the data type of the expression result is used. If given, upward and downward coercion is possible (e.g., integer to float and also float to integer). For example:
The 1st example sets table keyword key1 to 4. Its data type is not given, thus is the expression’s data type,
in this case I8.
The 2nd example sets key1 to 9, but as an unsigned 4 byte integer. Note that the :: part is redundant.
The 3rd example copies the value of keyword otherkey while converting its data type to I4. Note that if no
data type is given, the data type of otherkey is NOT preserved, because it is seen as a TaQL expression which
has data type I8 (or R8).
The 4th example sets the ckey.subrec.fld1 in column col to the given vector. It is a nested structure, thus field
fld1 in field subrec of column keyword ckey will be set. Its data type will be R8.
Note that the command in the 4th example does not create the higher level records. If not existing yet, the
5th example can be used to create them, where [=] denotes an empty record (it is the old Glish syntax for an
empty struct).
The last example shows how to create a key with an empty integer vector as value. In such a case the data
type must be given, because it cannot be derived from the value.
Setting a keyword to the value of another keyword is easily possible. For instance:
However, it has two problems.
1) As explained above the data type might not be preserved.
2) Keywords having a record value cannot be copied this way, because TaQL expressions do not support
record values.
copies the value of keyword otherkey to key. It can be used for any keyword value, thus also for records. The optional AS dtype part can be used to change the data type.
renames one or more table or column keywords. If the old keyword is a field in a column or a nested record, the new name should only contain the new field name, not the full keyword path. For example:
The first example renames the table keyword NAME and the keyword CNAME of column Col1.
The second example renames a field in the nested records of table keyword KEYS.
removes one or more table or column keywords.
DROP is a synonym for DELETE.
adds the given number of rows to the table.
where nrows can be any expression. For example,
makes mytab the same size as othertab (assuming it was empty).
Before TaQL had the GROUPBY command, the COUNT command could be used instead of the gcount
aggregate function to count the number of occurrences in a table.
For backward compatibility this command can still be used, but its usage is discouraged, also because usually
GROUPBY is faster.
The exact syntax is:
It counts the number of rows for each unique tuple in the column list of the table (after the possible WHERE selection is done). For example:
counts the number of rows per timestamp.
counts the number of rows per baseline.
As in the other TaQL commands a column in the column list can be any expression, but that will be slower than straight columns.
TaQL can be used to get derived values from a table by means of an expression. The expression can result in any data type and value type. For example, if the expression uses an array column, the result might be a vector of arrays (an array for each row). If the expression uses a scalar column, the result might be a vector of scalars or even a single scalar if a reduce function like SUM is used.
The CALC command was developed before the GROUPBY was available and before SELECT could be used without the FROM part. Currently, SELECT is more powerful than the CALC command. For example, multiple expressions can be given in a SELECT command. However, especially in Python sessions CALC has the advantage that it returns the results as a numpy-array or a list instead of a Casacore table.
The exact syntax is:
The part in square brackets can be omitted if no column is (directly) used in the expression. The examples
will make clear what that means.
The following syntax is still available for backward compatibility:
is a simple expression not using a table. It shows how the CALC command can be used as a desk calculator to convert 1 inch to cm.
gives a vector of scalars containing the mean per row.
gives a single scalar giving the sum of the means in each row. Note that in this command the CALC command does not need the FROM clause, because it does not use a column itself. Columns are only used in the nested query which has a FROM clause itself.
Some examples are given starting with simple ones.
The result of the following queries is a reference table, because no expressions have been given in the column-list. This will be the most common case when using TaQL.
The following examples result in a plain table, thus in a deep copy of the query results, because the column-list contains an expression or a data type.
The following command shows how a running median can be applied to a Casacore image.
The running medians are subtracted from the data in the copy. It uses a half window size of 25x25, thus the
full window is 51x51.
When doing this, one should take care that in case of a spectral line cube the image is not too large,
otherwise it won’t fit in memory. If too large, it should be done in chunks like:
where sc and ec are the start and end frequency channel. In this example it is assumed that the axes of the
image are RA, DEC, freq, Stokes.
Note that the image is updated, so it should have been copied before if the original data needs to be
kept.
calculates for each row the mean of the data for the selected subset of the measurement set.
looks like the previous example. It, however, calculates the mean of the mean of the data in each row for the selected subset of the measurement set.
shows the maximum absolute difference between VIDEO_POINT of each correlation and the mean of the DATA for that correlation. Note that the 2 indicates averaging over the first axis, thus the frequency axis.
The Miriad program uvflux estimates the source I flux density and its standard deviation at the phase center without having to make an image. A single, not too complicated TaQL command (courtesy Dijkema, Heald) provides the same functionality on a MeasurementSet using the XX and YY data. For LOFAR it is best to use baselines with a length between 5 and 10 km. The command shows various aspects of TaQL that are explained below. The numbers at the beginning of the lines point to the text following the example.
A subquery is used to get the average flux (I = 0.5*(XX+YY)) per time slot.
The example below counts per antenna the number of fully flagged baselines, excluding the autocorrelations. It uses grouping and aggregate functions twice; first per baseline, thereafter per antenna. It uses the concatenation and the WITH features. The timings of the various query parts are shown by using the time keyword. It shows that the processing time is dominated by the first query on the MeasurementSet used (which has a size of 1.3 GByte).
There is a lot to say about this query, which is quite complex. It shows that the WITH clause and table concatenation are nice features.
User and a programmer interfaces to TaQL are available. The program taql and some Python and Glish functions form the user interface, while C++ classes and functions form the programmer interface.
The main TaQL interface in Python is formed by the query function in module table. The function can be used to compose and execute a TaQL command using the various (optional) arguments given to the query function. E.g.
The first command opens the table mytable. The second command does a simple query resulting in a temporary table. That temporary table is used in the next command resulting in a persistent table. The latter function call is transformed to the TaQL command:
During execution $1 is replaced by table seltab1.
Note that the name argument generates the GIVING part to make the result persistent.
The functions sort and select exist as convenience functions for a query consisting of a sort or column selection only. Both functions have an optional second name parameter to make the result persistent.
The calc function can be used to execute a TaQL calc command on the current table. The result can be kept in a variable. For example, the following returns a vector containing the median of the DATA column in each table row:
It is possible to embed Python variables and expressions in a TaQL command using the syntax $variable and $(expression). A variable can be a standard numeric or string scalar or vector. It can also be a table tool. An expression has to result in a numeric or string scalar or vector. E.g
These three queries give the same result.
The substitution mechanism is described in more detail in pyrap.util.
The most generic function that can be used is taql (or its synonym tablecommand). The full TaQL command has to be given to that command. The result is a table object. E.g.
By default, these commands will use the Python style for a TaQL statement. The style argument can be used to choose another style.
The Glish interface is formed by script table.g. By default, it will use the Glish style for a TaQL statement. For example:
The program taql makes it possible to execute TaQL commands from the shell. Commands can be given in different ways:
The commands can be given in a fully recursive way. For example, a command in an input file can invoke
another TaQL command file using -f.
The following commands can be given:
A command can be preceded by zero or more options to specify if, how and where the results of a TaQL selection are printed. The options can be given at various levels:
The output of a TaQL command can be printed. Most commands (such as UPDATE) will only show the
expanded command and the number of rows affected. However, the CALC and SELECT command can also
show the the results of the selected expressions.
The result of a CALC command is always printed.
Selected columns in a SELECT command are optionally printed. If an implicit SELECT is done (thus if
SELECT was added to the command) or if -ps is in effect, all results are printed. Otherwise if -pa is in effect,
the first N rows are printed where N is defined by the -m option.
The following print options are available.
All -p options can be preceded by no to negate settings.
The initial default settings are ’-nops -pa -ph -pm -nopc -pr -m 50’.
Note that an implicit SELECT always uses -noph -nopc -nopr regardless of their settings.
A few other options are available.
Note that ’–’ can be used to indicate the end of the options. This can be useful if the following TaQL command starts with a minus sign.
For example:
The C++ programmer can use TaQL commands and expressions at various levels,
The function tableCommand in TableParse.h can be used to execute a TaQL command. The result is a TaQLResult object. Its function isTable() tells if the result contains a Table object or an TableExprNode object. The latter results from a CALC command. . E.g.,
These examples do the same as the Python ones shown above.
Note that in the second function call the table name $1 is replaced by the object seltab1 passed to the
function.
There is no style argument, so if an explicit style is needed it should be the first part of the TaQL statement.
Note that the Glish style is the default style.
The function parse in RecordGram.h can be used to parse a TaQL expression. The result is a TableExprNode object that can be evaluated for each row in the table. E.g.
The example above does the same as the first example in the previous section. There are, however, better ways to use this functionality.
The example above shows a boolean scalar expression, but it can also be a numeric expression or an array expression as shown in the example below. Note that TaQL expression results have data type Bool, Int64, Double, DComplex, String, or MVTime.
Class RecordGram can also be used to apply TaQL to C++ vectors of values or Records. The RecordGram class documentation and its test program describe these features in more detail.
The other expression interface is a true C++ interface having the advantage that C++ variables can be used directly. Class Table contains functions to sort a table or to select columns or rows. When selecting rows class TableExprNode (in ExprNode.h) has to be used to build a WHERE expression which can be executed by the overloaded function operator in class Table. E.g.
does the same as the first example shown above. See classes Table, TableExprNode, and TableExprNodeSet for more information on how to construct a WHERE expression.
A C++ user defined function has to be written as a class derived from the abstract base class UDFBase. The documentation of this base class describes how to write a UDF. Furthermore one can look at class UDFMSCal that contains the UDFs described in subsection User defined functions.
It is possible to write a UDF that operates on an individual expression (for each table row) and returns the result. It is, however, also possible to write a UDF acting as an aggregate function. In that case it will return a result based on the values of all rows in a group. See the desription of the GROUPBY clause for more information on the GROUPBY clause and aggregate functions.
Note that a class can contain multiple UDFs as done in UDFMSCal. Also note that a single UDF can operate on multiple data types which is similar to a function like min that can operate on scalars and arrays of different data types.
A UDF class can contain a HELP function, which should return help information. This function is called by a help command like
It returns an overview of the functions in the UDF class and possible other information. The optional subtype argument can be used to return more specific information. Note that the same result is given by
TaQL finds a UDF by looking in a dictionary mapping the UDF name to a function constructing an object of the UDF class. If not found, it tries to load the shared library with the lowercase name of the library part of the UDF (like in derivedmscal.pa1). If the load is successful, it calls an initialization function in the shared library that should add all UDF functions in the library to the dictionary. The description of the UDFBase class shows how this should be done.
NOTE: This section is for a future version of TaQL. It has not been fully implemented yet.
For performance reasons User Defined Functions will usually be implemented in C++. It is, however, possible to implement them in Python, both regular functions and aggregate functions. This can be done by means of the pytaql module of Casacore.
A UDF has to be implemented in Python by subclassing pytaqlbase, that can be imported from Casacore.python. The subclass has to implement a few functions, some are optional. The functions are called in the order given below.
The UDF should check if the argument types are correct and determine the result type. It has to return a dict containing the following fields:
Such a UDF can be called in TaQL like py.module.class where the class defaults to the module name.
An example of UDFs in Python is given below. The first one is a regular UDF, the second one an aggregate
UDF.
In the near or far future TaQL can be enhanced by adding new features and by doing optimizations.