An Introductory SQL Tutorial for Gaia-ESO Science Archive Users
This document pertains to release iDR1 of the GES archive.
In this document we provide a brief introduction to the use of Structured Query Language (SQL) for accessing data in the Gaia-ESO Survey Science Archive (hereafter GSA). Sections included in this document are a Primer aimed at users new both to the GSA and to SQL and a Reference which should be of use more experienced and returning users. Readers wanting a fuller introduction to SQL should consult an online tutorial or one of the legion of SQL books available: O'Reilly's SQL in a nutshell is a good introduction. The document concludes with some examples of cross-querying GSA and external survey data, which we expect to be a common usage of the archive.
This GSA Cookbook is based on the SuperCOSMOS Science Archive (SSA) version. Currently, the latter provides a more comprehensive introduction to SQL and in particular gives real-world examples of imaging astronomy application queries that were used in the construction of that all-sky legacy archive based on the Schmidt telescope survey plate collections.
2.1. Relational databases
The GSA is a relational database, which means that it stores data in tables composed of rows and columns. Each row comprises the information stored for one data entry – e.g. a spectrum, spectral feature, astronomical source or a stellar atmospheric parameter for a particular source in the case of the GSA – and there is one column for each of the attributes recorded for that entry – e.g. RA, Dec, Teff, logG, [Fe/H], etc. The different tables comprising a database may be linked (or related), if they each have columns representing the same data value (integrity constraints can be included in the table definitions which ensure consistency between two related tables, e.g. by preventing the deletion of only one of a pair of rows in different tables thus linked). For ease of use, it is possible to define virtual tables - called views - which are super- and/or sub-sets of the data in one or more tables and which can be queried using the same syntax as ordinary tables (which are sometimes called base tables, to distinguish them from these virtual tables). In addition to tables and views, the major constituents of a relational database are indexes (the database community prefer that spelling to "indices"), which can speed up the identification of records which satisfy the particular condition expressed in a query, and various stored procedures and functions which extend the range of operations which can be performed on data held in the tables. The collection of definitions of columns, tables, views, indexes, stored procedures and functions in a database is called its schema.
The GSA schema is described in detail elsewhere, but we recap here the basic features which we shall use later. The main science tables of the GSA are those pertaining to the spectra themselves, and those pertaining to the atmospheric and chemical compositional parameters derived for the target stars as derived from those spectra. Spectrum lists all spectra obtained as part of the survey, and each Spectrum is derived from a TwoDFrame, which corresponds to the images from which spectra are calibrated and derived. There are tables associated with these: TwoDProvenance tracks the processing history of each TwoDFrame via a set identifier pairs that indicate the dependency of one frame on another; likewise, OneDProvenance performs the same role for Spectrum. A set of astrophysical analysis tables is provided with various parameters (depending on the type of star/spectrum being analysed, and the analysis process, or Working Group, being employed), for example GiraffeAstroAnalysisWG12 and UvesAstroAnalysisWG12. Views of the common attributes across this set of tables are defined for convenience when the user is unconcerned as to the originator of the analysis, e.g. RecommendedAstroAnalysis. Finally, Target (and it's various views, e.g. TargetBulge, TargetOpen) give details of the survey targets, while AtomicParam provides a compilation of the atomic data employed in the astrophysical analysis of the spectra. In addition to these major tables, there are also a number of metadata tables, which store ancillary information describing the instrumentation and processes involved in obtaining GES spectroscopic data, and which enable the provenance of derived measurements to be traced all the way back to the source spectrographs. To aid spatial matching of GES objects and external catalogue datasets (such as the 2MASS Point Source, Vista Hemisphere Survey and VST ATLAS catalogues) there are also "neighbour" and "cross-neighbour" tables which record pairs of sources within 10 to 30 arcsec of one another, depending on the catalogues joined.
SQL is the standard language for accessing and manipulating data stored in a relational database. In fact, several versions of the SQL standard exist, and most database management systems (DBMSs) actually support a subset of standard SQL, with some vendor-specific additions. The GSA is currently implemented in Microsoft's SQL Server 2008 DBMS, so GSA users will employ its SQL dialect (known as Transact-SQL, or T-SQL), although we have tried to restrict the use of vendor-specific features to a minimum. A fuller reference on T-SQL dialect than presented here is available online here.
The first thing to understand about SQL is that it is a set-based language, not a procedural language, like Fortran or C. A user submitting an SQL query to a relational database is defining the set of properties of the records to be returned from the database, not specifying the list of operations which will lead to their delivery; this latter is the responsibility of the DBMS engine, which will decide the best way to execute a given query from a set of possible execution plans. Many database vendors are adding procedural capabilities to the SQL dialects they support, and these constitute one of the main areas of difference between those dialects. These extensions will not be discussed here, as we shall concentrate on the basics of standard SQL.
For security reasons, the GSA does not allow users to execute queries which affect the basic structure and contents of the database, only those which can extract data from it. In SQL terms, this means that only SELECT statements are allowed: N.B. in this tutorial we write all SQL keywords in upper case italics and some column names in mixed case, both for clarity, although the GSA's SQL dialect is case insensitive by default. There are three basic classes of SELECT statement:
A projection is the retrieval of a set of full columns from a table. To retrieve the target name and astrophysical parameters Teff, logG and FeH from analyses of all spectra available in the GSA, one would type:
where RecommendedAstroAnalysis is the name of the GSA table view which records information about all the best astrophysical analyses, and targetID, TEff, logG and FeH are the names of the relevant columns in that table.
A selection is the retrieval of the data values in particular columns for those rows in a table which satisfy certain criteria. So, if one were interested only in sub-solar metallicity stars down to Fe/H = -0.5 in the previous query, the appropriate SQL query would be:
In this example the SQL statement has been split into three lines to emphasise
the SELECT…FROM…WHERE syntax, but this is still one SQL
statement. The SQL Query Form in the GSA interface ignores the
whitespace at the end of each line of text and generates a single query string
from valid multi-line text like this. (Note that this means that users
should not used double hyphens to indicate comments in multi-line
queries, since this will result in all text after the
first comment being ignored.)
while all other stars could be selected using the following statement:
The parentheses in these examples have been included for clarity – they are only required when needed to avoid ambiguity, and when necessary to over-rule the standard order of precedence amongst operators, outlined in Section 3.3.6 (users should note that the accidental omission of the WHERE clause from a selection turns it not into an invalid query, but into the projection of the columns contained in its SELECT clause, which for large tables will return a lot of data).
Note that attributes in tables that have been indexed in the GSA for efficient filtering are indicated in the Schema Browser as highlighted rows in the lists of table attributes - selections predicated on those attributes will be particularly fast.
A join is the retrieval of data entries from one or more related tables in a database matched under some criterion. For example, one might want to query Spectrum for some subset of attributes of interest (e.g. object and radial velocity in a Declination range) but at the same time choosing only those within a restricted magnitude range in a specific passband, where the relevant quantity (in this case J-band magnitude JMag) is present in a related table. The SQL query retrieving the desired data here would be:
In this query, records in the Spectrum and the Target tables are joined on condition that their values for the targetID attribute are equal. Furthermore, we select only those rows for which the J band magnitude is between 13 and 14.
The AS keyword can be used to rename the attributes in the SELECT clause so that their names in the output result set differs from those by which they are known in the database table. For example, a user thinking that the column names referred to in the query above are unfriendly could rewrite it as follows:
and the columns returned in the output result set will be headed nameOfObject and radialVelocity. This useful facility can be misused by the unwary or the very stupid. For example, it would be possible to have a query which started "SELECT ra AS dec, dec AS ra", which could confuse the subsequent analysis of the extracted result set. In the particular case of extracting data in VOTable format from the GSA, the assignment of Unified Content Descriptors (UCDs) to columns - i.e. setting the values of the ucd attributes to <FIELD> elements in the metadata section at the start of the VOTable document - is performed on the basis of the column name in the output result set, so a user swapping ra and dec, as in the example above, would end up with the wrong UCD for both those two columns, causing problems were that VOTable to be used in conjunction with other Virtual Observatory tools reliant on UCDs. For the moment, users wishing to have UCDs included in VOTable output from the GSA must not rename table columns, while, more generally, it is left to the user to ensure that any renaming they do perform is sensible (note: UCDs in the GSA have been initially assigned according to the UCD1 definitions).
It is also possible to write join queries in a way that makes it more explicit that they are joining the two tables, i.e. the example above becomes:
This is an inner join, meaning it only returns the (object,rv) tuples for matched rows, but there are other types of join, which return different combinations of data (see the SSA Cookbook for more details).Note that when joining N tables, there should usually be at least N-1 predicates in the WHERE clause, and more often than not the attributes used in those predicates will be identifiers used as a key in the referenced table. Sometimes it is possible to join on other attributes, although this can be rather inefficient (e.g. joining two catalogue tables on equality of RA and Dec within some tolerance, in other words a spatial cross-match, is terribly inefficient for large catalogues). If we feel that users are likely to make joins on attributes other than unique identifier keys, we will make indexes on those attributes to make the join execution more efficient.
The SQL Server dialect of SQL allows the construction of nested SELECT statements, in which the WHERE clause includes a subquery which is itself a valid SELECT statement. For example, a join on SpecFrame and Spectrum could be written in the following way:
The subquery generates a list of SpecFrameID values and matches between this and the SpecFrameID column of the Spectrum table are made by use of the IN operator.
This query could also be written using a second logical operator, ANY, as follows:
where the ANY operator is used to match rows in the Spectrum table with any of the rows in the output result set from the subquery which have the same specFrameID value. Note that in both subquery formulations the list of attributes in the SELECT clause of the subquery must be consistent with the rest of the WHERE clause in the main query, since they are to be matched by value.
These last two queries illustrate the equivalence of IN and = ANY, but care must be taken if the logical operator NOT is added to the outer query, so that it seeks matches with the complement of the set for which matches are sought in the original query. The operator which is equivalent to NOT IN is not < > ANY, as one might initially expect, but rather < > ALL - where ALL is another logical operator, which evaluates to TRUE if all of a set of comparisons are TRUE - which is clear when one thinks through the logic of that query, but perhaps not immediately apparent.
The most common use for subqueries is to express complicated joins in a simpler fashion. Up to 32 levels of nesting are allowed, in principle, although memory restrictions may prevent that from being achieved in practice. To evaluate a complicated join as a nested series of queries would often be much slower, but the database engine should recognise the equivalence of the two forms of the query and recast it in the most efficient form as part of its query optimisation process. So, there should be no difference in the performance of queries submitted as complicated joins or as nested subqueries, and the latter are to be prefered if they help ensure that the query that is executed really is the one that the user wanted to run.
Our discussion to this point has implicitly assumed that the values of the attributes corresponding to each column in every row in a database table are known. This need not always be the case, as a simple example from the GSA illustrates. The *AstroAnalysis* tables in the GSA merge atmospheric parameters and abundance determinations for stellar spectra. What happens if a given elemental abundance, e.g. [Ca/H], cannot be measured from a given spectrum? One answer would be a null value, which is a special type of entry to be included in a table if the value of a given attribute is not known (or is indeterminate or is not applicable) for a particular row. In designing the GSA we have decided not to use nulls in these cases, but to define default values for use in these situations instead: e.g. in the example above, we would set the abundance to be a recognisably meaningless value; in the case of floating point numbers, -0.9999995e9. Nulls and default values are semantically different: the query processor in a database management system (DBMS) recognises that a value marked as null is unknown or indeterminate and will not include it in, say, the computation of the mean value of all entries in a column, while, to the query processor, a default value is like any other value, and will include it, unless the user explicitly excludes it - e.g. by computing the mean abundance only for those objects with values higher than -0.9999995e9, in this case. Other default values in the GSA include -99999999 for 4- and 8-byte integer attributes, -9999 for 2-byte integers, and 0 for 1-byte (unsigned) integers. The schema browser generally indicates the default value for many of the table attributes, and they have all been chosen to lie well beyond the range of legitimate values found in the GSA, so it is simple to exclude them:
As a result of defining default values for some columns, there have to be dummy rows in some tables (i.e. rows for which every attribute takes the appropriate default value). The reason for this is illustrated by consideration of the SpecFrame and Spectrum tables in the GSA. If unique identifier specFrameID in the Spectrum table has a default value because that particular spectrum's ancestor frame is missing or has not been ingested into the DB yet for some reason, and if the attribute specFrameID references the attribute specFrameID in table SpecFrame, then table SpecFrame needs an entire row of defaults for specFrameID=-99999999 in order to maintain the referential integrity of the database.
It is the responsibility of the user to ensure that the presence of default values and dummy rows will not corrupt the results of queries run on the GSA, but our decision to use them, rather than nulls greatly simplifies the logic involved in all queries run on the database. The inclusion of null values for an attribute means that an expression involving it can evaluate to TRUE, FALSE or UNKNOWN, and we believe that the simplification for users in avoiding this three-valued logic greatly outweighs the burden of having to remember that default values exist for some columns in the GSA.
Previous sections have described the basic SELECT…FROM…WHERE… structure of an SQL query. This is the basic syntax to be employed for querying the GSA, but there are some additional options in the SELECT clause which users may find useful. Once again, a fuller reference than is presented below is available online here.
SQL offers a number of useful aggregate functions, which can be used for deriving summary information on columns or selections from them.
The meanings of those mathematical aggregate functions which apply only to numeric columns are very straightforward: AVG, MAX, MIN, SUM, STDEV, STDEVP, VAR, and VARP return, respectively, the mean, maximum, minimum, sum, standard deviation, population standard variation, variance and population variance of all values in the specified column. They can be used in conjunction with a WHERE clause or not, i.e.
will return the maximum effective temperature found in the combined view of all astrophysical parameter analyses, while
returns the maximum value found for stars within the specified range of log(g).
N.B. Strictly speaking, these functions only apply to non-null values found within the particular column. As discussed above, the GSA contains no null values by design, but uses default values instead. These will not be ignored in the computation of aggregate functions. For example, the SQL query
returns the value -0.9999995e9, which is clearly nonsensical astrophysically, and just illustrates the unintentional selection of the dummy row in the RecommendedAstroAnalysis view.
There are additional aggregate functions which can be run on columns of all types. The most useful of these is COUNT, which can be used in a number of ways. The total number of rows in a table can be obtained using the following syntax:
Using COUNT(*) like this is a very good way of ensuring that a query is sensible before getting data returned by running it, and users are strongly encouraged to use this method to develop and debug SQL queries before running them on the full GSA. The performance overhead involved in this two-query process is not as high as it might seem, since, depending on how heavily the database is being used, some of the data from the COUNT(*) query will still be in cache when the query is run a second time to extract the desired attributes.
An interesting function specific to SQL Server’s SQL dialect is TOP, which is illustrated as follows. The query
would return the effective temperatures for ten rows in the RecommendedAstroAnalysis view. This will generally not be the ten highest values in the table; remember that SQL is a set-based language, so a query yields the set of rows satisfying the criteria specified in the query, but with, by default, no particular ordering within that set. The ten highest area values can be obtained using TOP, however, with the addition an ORDER BY clause, which does impose an ordering of the rows in the result set: i.e.:
Note the presence of the DESC (for descending) keyword in the ORDER BY clause, which is required because the default behviour is for that clause to list rows in ascending order.
3.1.4 GROUP BY and HAVING
The GROUP BY clause allows aggregate functions to return more than a single value. For example:
returns a count of the number of astrophysical analysis items there are for each target identifier.
The following mathematical functions are supported by SQL Server's SQL dialect.
Arithmetic functions (such as ABS, CEILING, DEGREES, FLOOR, POWER, RADIANS, and SIGN) return a value having the same data type as the input value, while trigonometric functions and others (such as EXP, LOG, LOG10, SQUARE, and SQRT), cast their input values to float and return a float value; this probably is of no concern to the average GSA user. All mathematical functions, except for RAND, are deterministic functions - i.e. they return the same results each time they are called with a specific set of input values - and RAND is deterministic only when a seed parameter is specified.
An operator in SQL is a symbol specifying an action that is performed on one or more expressions. For the present purposes, their major use is to provide greater flexibility in the possible forms of WHERE clauses of queries and in the columns of the result sets they can produce, which need not simply be columns drawn from the table(s) being queried. There are several classes of operator to consider.
3.3.1 Arithmetic operators
SQL Server's SQL dialect supports five arithmetic operators. The four basic ones - Addition (+), Subtraction (-), Multiplication (*) and Division (/) - plus the Modulo operation, (%), which returns the remainder of dividing one integer by another, and is used in the format "dividend%divisor". The use of the four basic arithmetic operators follows straightforwardly from their definitions.
3.3.2 Comparison operators
The Boolean comparison operators are used most frequently to filter rows via the WHERE clause of a SQL query. The most simple comparison operators ( <, >, =) were used above without introduction, but there a total of nine comparison operators which can be applied to pairs of expressions in the SQL Server dialect of SQL: = (Equal to); > (Greater than); < (Less than); >= (Greater than or equal to); <= (Less than or equal to); <>(Not equal to); != (Not equal to); !> (Not greater than); and !< (Not less than).
3.3.3 Logical operators
In a similar fashion, we have used a number of the logical operators (e.g. AND, BETWEEN, etc) above without introduction, but the following is the full list of logical operators supported by SQL Server:
The use of some of these operators (e.g. AND, BETWEEN, OR) has been illustrated above, but it is worth making a few comments on the remaining ones.
The LIKE operator is used for pattern matching. This is most commonly used for string matching. For example, a user interested in knowing how many open cluster target SpecFrames are present in the GSA could issue the query:
Note the use of the percentage sign wildcard character which matches all the object name strings defined for GES. Several other wildcard characters can be used in conjunction with LIKE; these are described further in the SSA Cookbook, as are further examples of logical operators.
The plus sign, +, is used as a string concatenation operator in the SQL Server dialect of SQL. This is most likely to be of use to GSA users in the formatting of result sets - i.e. in the definition of SELECT clauses.
3.3.5 Unary operators
The SQL Server dialect of SQL defines three unary operators - i.e. operators which have only one operand - although none of these are likely to be of much use to most GSA users. The first of these is the positive unary operator, +, which is used principally in SQL statements not allowed by the GSA query interface. The second, -, is the negative unary operator, which returns the negative value of a numeric expression, as shown in the following query:
The final unary operator, ~, the Bitwise NOT operator, converts each bit in a numeric expression of integer data type into its 1s complement (i.e. 0s are changed to 1 and vice versa).
3.3.6 Operator precedence
The operators described in this subsection have the following descending levels of precedence:
When two operators in an expression have the same operator precedence level, they are evaluated left to right based on their position in the expression.
The GSA Target table is cross-matched to a number of external survey catalogue datasets held locally, e.g. the 2MASS and VHS catalogues, and the all-sky SSA legacy catalogue generated from digitisation of the Schmidt photographic surveys (for a complete list, click on "Browser" on a navigation bar or see below). Rather than prejoining any two datasets to create a static, merged table of what are assumed to be associated sources on the basis of a fixed joining algorithm, and choosing a subset of what are assumed to be useful attributes from each of the two catalogue tables to propagate into that merged table, the philosophy is to create a table of pointers between any two matched datasets. This means, for example, that any externally catalogued source co-incident or nearby a GES target is readily available, out to some predefined maximum angular search radius, and all sorts of science usages are possible because the exact matching criteria (e.g. maximum allowable displacement, consistency in morphological parameters or classification, or even the number of possible external matches for a given GES source) can be tuned at query time by simple expressions in SQL. Furthermore, all attributes of both datasets are available to querying, because no decision has been made as to which attributes to propagate into a static merged set. The flip side to this flexibility is the rather opaque syntax for querying cross-matched data in SQL: instead of querying a single merged table, e.g. SELECT * FROM MergedTable WHERE ..., in general you have to query three tables: the two cross-matched tables and the table of pointers, e.g. SELECT * FROM Table1, ExternalDB..Table2, CrossNeighboursTable2 WHERE ..., using the joining techniques discussed above to select out the rows that you require. Some real-world examples best illustrate how to query cross-matched data in the GSA, but first we discuss the naming of GSA objects (databases, tables and attributes) that pertain to catalogue joining.Note the units of spherical coordinates in the GSA: these are almost always decimal degrees for both RA and Dec. Where prefered you can easily convert units in selections or WHERE clauses: e.g. SELECT ra*15.0 AS raHours ..., or WHERE (ra/15.0) > 23.0 ... etc.
In GSA parlance, a table of pointers that associates a set of externally catalogued sources to the table of GES targets is called a cross-neighbour table. The naming convention is simple: cross-neighbour tables are named by concatenating the two associated table names with an X, e.g. the cross-neighbour table between the GES Target table and the 2MASS point source catalogue is called TargetXtwomass_psc. Within the cross-neighbour tables, there will always be the following three attributes: masterObjID, which points to the source for which neighbours have been created (the central source of the neighbourhood, if you like); slaveObjID, which points to the neighbouring sources (sources lying within the neighbourhood); and finally distanceMins which is the angular separation between the central source and the neighbour, in units of arcminutes. Depending on the external catalogue joined, there may be other attributes in cross-neighbour tables (use the Browser to examine these).
At the time of writing, external catalogue databases held and cross-matched with GES targets consist of:
Alternatively, you can use the GSA Browser to look at all the neighbour tables, their attributes and also the external databases that are available in the GSA.
Suppose a user wishes to select the identifiers and co-ordinates of all 2MASS point-source catalogue sources that are within 6 arcsec of a GES target. The SQL to do this is as follows:
Note: i) the use of table aliases ges and twomass which are a convenient short-hand; ii) the use of attribute prefixes (e.g. ges.ra) to distinguish between attributes in different tables that happen to have the same name; iii) the table joining condition in the WHERE clause, which selects associated rows (if you omit this condition, you will get all rows of each table joined to all other rows, i.e. an extremely large and useless dataset!); and finally iv) the specification of a maximum radius of 0.1 arcmin (=6 arcsec) for this query, where the maximum available for 2MASS joins is 10 arcsec. When selecting nearby cross-matches, users should note that one or more than one rows may result for each master object, since there may be more than one slave neighbour within the neighbourhood defined by the maximum join criterion specified. If you want the nearest match only, then the next section explains how to do this.
Suppose a user wishes to select the GES/2MASS cross-matches as detailed in the previous example, but taking the nearest match in each to a limit of 2 arcseconds. The following query will do the job:
Note the following: i) here, we have used table aliases as attribute prefixes only where necessary (i.e. where attributes are not unambiguously identified by their names alone); and ii) a subquery is used to select the closest neighbour in each case, and the main query selects this nearest object if it is within the specified 2 arcsec maximum radial tolerance.
In general, this cross-neighbour query is likely to be the most useful one, since most catalogue cross-match applications simply assume that the nearest association is most likely to be the correct one in each case.
Home | Overview | Browser | Access | Login | Examples
Links | Credits
WFAU, Institute for Astronomy,