An Astronomy Data Query Language cookbook to accompany Gaia Data Release 1

Fundamentals of relational databases and structured query language

Structured Query Language (SQL) is the lingua franca of relational database management systems (RDBMS). A relational database consists simply of tabular sets of data where relationships exist between the tables: the data are organised in a relational way for ease of maintenance and efficiency of storage. Moreover the relational organisation presents a logical arrangement to the user of the data. Astronomical Data Query Language (ADQL) is a specialised variant of SQL developed for use with the proliferation of astronomical datasets available within the burgeoning Virtual Observatory on the Internet. ADQL standardises queryable resources and extends the functionality of SQL in an astronomical context. Standardisation is needed because although there are existing SQL standards, various open-source and commercial relational database management systems (MySQL, PostgreSQL, Microsoft SQL Server, Oracle DB, ...) do not always implement the SQL standard in exactly the same way syntactically. Extensions are required because certain common astronomical constructs (notably spherical polar coordinate systems and geometries) are not easily expressed in standard SQL.

SQL basics

SQL is not a programming language as such especially when limited, for example, in a read-only context via a web form accessing a static dataset stored in an RDBMS. In such circumstances the end-user is limited to SELECT ... FROM ... WHERE ... queries but these can get quite sophisticated when combining (joining) row sets from different tables, or when creating "virtual" row sets using subquery constructs in place of explicit table names. SQL is set-based in that the statement provided by the end-user defines the set of properties of the records to be returned from the stored data rather than a list of operations to be executed in order to deliver those records - at first sight this can be a little confusing for someone familiar with procedural programming. There are three main types of SELECT statement:

Examples of all three are introduced below with reference to the Gaia database tables. A good general introduction to SQL is given by SQL in a nutshell (O'Reilly 2008); there are also good tutorials available online. A more theoretical and algebraic description of the relational logic that underpins SQL is provided by Wikipedia.

ADQL implementation

ADQL simply defines just one operational statement, not surprisingly it is a SELECT:

SELECT [TOP setLimit] selectList FROM fromClause 
[WHERE condition(s)] [GROUP BY columnExpression(s) [HAVING groupCondition(s)]] [ORDER BY column(s)]

The Gaia Archive as a relational database

The Gaia Archive presents an interactive User Interface for web browsers (hereafter the Gaia archive) and a programmatic Table Access Protocol web service that feature an ADQL interface at their core. Behind this interface is a set of tables in an RDBMS, e.g.

To browse the tables and their contents, simply click on the "Search" tab in the Gaia archive, then the ADQL form sub-tab, and explore via the tree-view in the panel on the left hand side. Note that the Gaia archive contains several databases to track different Gaia releases and public non-Gaia data (e.g. from cross-matched surveys). Gaia DR1 data are held in database "gaiadr1" while the other public datasets are contained in database "public". Names of tables containing Gaia data should be prefixed with "gaiadr1" while the others should be prefixes with "public".

Here is a simple example ADQL projection limited to the first 10 rows:

SELECT TOP 10 source_id, parallax FROM gaiadr1.tgas_source

and here is a simple selection:

SELECT TOP 10 * FROM gaiadr1.tgas_source WHERE dec < 0 ORDER BY parallax DESC

while here is a simple join query:

SELECT COUNT(*) FROM gaiadr1.tgas_source AS t, public.igsl_source_catalog_ids AS i WHERE t.source_id = i.source_id

N.B. by definition an unpredicated join yields all valid combinations of the two (or more) row sets in the FROM list so it is essential to predicate join queries to associate the related rows. In general, a join of N tables will require at least N-1 predicates in order to do achieve this. If your join query on N tables has less than N-1 predicates on attribute(s) that relate the rows then be prepared for a long wait and a large results set containing useless rows of unrelated combinations. Indeed, a good way of validating a join query is to use COUNT(*) initially to check if the predicates are correct and return a sensible row count before making the attribute selection itself. If join predicates are missing or incorrectly expressed, then the row count is likely to be huge, always assuming that the query doesn't time-out (which is highly likely). In this example rows from the two tables are associated by the unique identifier key that is common to both (source_id).

Given the above primer in set-based SQL/ADQL, it should now be clear why the row sets returned by this query:

SELECT TOP 1 'Hello world!' AS greeting, ra, dec FROM gaiadr1.tgas_source

are as they are with and without the TOP 1 and/or the column projection of ra and dec.

To get the best out of the archive at DR1:

Note also the following:

The implications of NULL

In tabulated Gaia data it is not always the case that every attribute in every row has a value available. For example, in DR1 only the Tycho subset of sources has full five-parameter astrometric solutions yielding proper motion and parallax. In the Gaia archive database this is dealt with using null values, where null is a special type of entry included in a table if the value of a given attribute is not known (or is indeterminate or is not applicable) for a particular row. The query processor in an RDBMS recognises that a value marked as null is unknown or indeterminate and will not include it in, say, the computation of the mean value of all entries in a column, or in selections predicated on, say, a range of values for that column. The inclusion of null values for an attribute means that an expression involving it can evaluate to true, false or unknown. While in many (most?) circumstances this is unlikely to be an issue and indeed would be what the end user would want and expect, for complex queries it should be borne in mind when checking the results especially if the row count is unexpectedly low or high or the results set itself is missing one or more expected records. Consult Wikipedia for a detailed discussion on NULL and three-valued logic in an SQL context (the section on comparisons is particularly relevant and useful). Note that NULL can be checked for explicitly in ADQL:

SELECT COUNT(*) FROM gaiadr1.gaia_source WHERE parallax IS NULL

and

SELECT COUNT(*) FROM gaiadr1.gaia_source WHERE pmra IS NOT NULL

for example, but normally this should not be necessary.

Fully qualified attribute identifiers

ADQL allows a fully-qualified identifier string for databases, tables and fields that follows from the underlying SQL convention whereby full stops are used, i.e. database_name.table_name.attribute_name. For example in the Gaia archive:

SELECT gaiadr1.tgas_source.ra FROM gaiadr1.tgas_source 

This is unnecessarily verbose in that there is no need to specify database nor table names in the SELECT clause because there is no ambiguity when processing the query (table identifiers must always be fully qualified as above). If however a query joins two tables which contain the same attribute name for example, then it is necessary to qualify the identifier to the level that removes any ambiguity. In general, fully qualifying attribute identifiers may be necessary when joining tables within one database or when joining tables across different databases. An example of the latter is if a user uploads data to the Gaia archive, thereby creating a table in a database identified by their login name. Column names ra, dec and source_id are particularly common amongst the public database tables, so any selections involving those in a join query would need to specify from which table the selection should be made. For examples of this see queries below.

ADQL reference

A full technical specification of ADQL is available online as part of the specification and reference documentation developed under the auspices of the International Virtual Observatory Alliance. Here we present a summary of the most important features and excluding any that are not implemented in the Gaia archive (the ADQL standard itself does not require strict adherence to the entire specification so a given service need not implement everything within it).

Query syntax

The query syntax is as follows (taken from IVOA Astronomical Data Query Language, version 2.0, 30th October 2008):

SELECT [ ALL | DISTINCT ]
  [ TOP unsigned_integer ]
  { * | { value_expression [ [AS] column_name ] }, ... } 
FROM {
  { table_name [ [AS] identifier ] | 
  ( SELECT ....) [ [AS] identifier ] |
  table_name [NATURAL] [ INNER | { LEFT | RIGHT | FULL [OUTER] } ] JOIN table_name
  [ON search_condition | USING ( column_name,...) ] } 
  , ...}
[ WHERE search_condition ]
[ GROUP BY column_name, ... ]
[ HAVING search_condition ]
[ ORDER BY { column_name | unsigned_integer } [ ASC | DESC
  ],...]

The query returns a subset of the table(s) specified where the column ordering is as specified in the SELECT expression, or as in the tables if an asterisk is specified to select all columns. The row ordering is arbitrary unless ORDER BY is specified. As described previously, TOP N will limit the results set to N rows (again an arbitrary set unless ORDER BY is specified).

ADQL supports table subqueries and joins, both implicit and explicit. Table subqueries can appear as part of one or more WHERE predicates (e.g. using IN or BETWEEN) and/or in the FROM clause when building derived, transient row sets for convenience (see later for examples of these features).

ADQL supports explicit joins of types INNER (the default for JOIN ... ON and when joining implicitly via comma-separated entities in the FROM clause) and OUTER (LEFT, RIGHT and FULL). For example, the simple join query

SELECT COUNT(*) FROM gaiadr1.tgas_source AS t, public.igsl_source_catalog_ids AS i WHERE t.source_id = i.source_id

can be written explicitly as an inner join:

SELECT COUNT(*) FROM gaiadr1.tgas_source AS t INNER JOIN public.igsl_source_catalog_ids AS i ON t.source_id = i.source_id

ADQL search filters can be part of the WHERE, JOIN and HAVING clauses and consist of conditions separated by the standard logical operators AND, OR and NOT. Standard arithmetic comparators (=, !=, <, >, <>, <=, >=) are supported (note == is not used for equality). Additional SQL functions BETWEEN, LIKE, NULL and EXISTS are supported. Standard mathematical operators are supported: +, -, *, / but note that there is no ** exponentiation - use the POWER function instead (see below). String concatenation || and wildcards % (in conjunction with LIKE) are supported.

ADQL trigonometric functions (arguments or results in radians) and associated conversions:

ADQL exponentiation and logarithmic functions:

Truncation and rounding in ADQL:

Other ADQL functions:

Geometric functions

For the purposes of queries involving geometric constructs in spherical polar coordinate systems that are common in astronomy applications ADQL specifies a number of geometric extension functions to SQL. A complete list of those functions currently defined is given in the full ADQL specification. This section describes those implemented in the Gaia archive.

The geometrical functions fall into the following broad categories:

The functions are best described by way of simple examples (see also the example queries in the following main section). Note that all angular coordinate arguments are in degrees.

Data type functions

POINT(coordsys, longitude, latitude) specifies the simplest possible geometry, that of a point on the celestial sphere, in a coordinate system specified by coordsys. This coordinate system string follows the format as defined in Space Time Coordinate Metadata for the Virtual Observatory for full flexibility but for standard equatorial coordinates at equinox J2000.0 simply specify 'ICRS'.

BOX(coordsys, longitudeCentre, latitudeCentre, longitudeExtent, latitudeExtent) specifies a rectangular polygon via the string coordsys (see below), the central longitude and latitude, and extent.

For example:

BOX('ICRS', 0, 0, 10, 1)

defines a 10 square degree rectangular strip along the celestial equator across the South Galactic Cap. This combines with utility and predicate functions as follows when used in an ADQL query:

SELECT ra, dec FROM gaiadr1.tgas_source WHERE CONTAINS(POINT('ICRS', ra, dec), BOX('ICRS', 0, 0, 10, 1)) = 1

which selects all TGAS objects in the rectangular region so defined. Note that the interpretation of BOX is as described in the specification for literal values, but when using column names the BOX sides are great-circles.

CIRCLE(coordsys, longitudeCentre, latitudeCentre, radius) specifies a circular region on the celestial sphere centred at the given coordinates and with the given radius in degrees. In all other respects it functions like BOX.

POLYGON(coordsys, longitude_1, latitude_1, ..., longitude_N, latitude_N) specifies an arbitrary polygon with vertices as given by the coordinate pairs in coordinate system coordsys. For example

SELECT ra, dec FROM gaiadr1.tgas_source WHERE CONTAINS(POINT('ICRS', ra, dec), POLYGON('ICRS', 0, 0, 10, 0, 0, 10)) = 1

selects all TGAS sources in the (spherical) triangle bounded by the great circles connecting coordinates [(0, 0), (10, 0), (0, 10)].

Predicate functions

CONTAINS(geometry_1, geometry_2) is a boolean-valued function used to determine if geometry_1 is wholly contained within geometry_2. It is most commonly used to test if a point is contained within (or is on the boundary of) a shape as illustrated above, but can be employed for any two arbitrary geometries. In the ADQL WHERE predicate the function should be compared to 1 (= true, i.e. does contain) or 0 (= false, i.e. does not contain).

INTERSECTS(geometry_1, geometry_2) is a boolean-valued function used to determine if geometry_1 intersects with geometry_2 and can be employed to test of two geometries overlap. Once again, in the ADQL WHERE predicate the function should be compared to 1 (= true, i.e. does intersect) or 0 (= false, i.e. does not intersect).

Utility functions

AREA(geometry') returns the area of the given geometry in square degrees. For example

SELECT TOP 1 AREA(BOX('ICRS', 0, 0, 1, 1)) AS areaInSqDegs FROM gaiadr1.tgas_source

demonstrates the use of this function in returning the area of the BOX geometry specified.

DISTANCE(pointGeometry_1,pointGeometry_2) returns the great circle distance between the two points on the celestial sphere specified by the two geometry arguments pointGeometry_1 and pointGeometry_2. The great circle arc length is returned in degrees. For example:

SELECT TOP 1 DISTANCE(POINT('ICRS', 0, 0), POINT('ICRS', 1, 1)) AS dist FROM gaiadr1.tgas_source

COORD1(POINT(...)) extracts the first coordinate of the pair in the POINT geometry argument (COORD2(...) extracts the second). COORDSYS(...) extracts the coordinate system string from the given geometry. These final three utility functions are unlikely to be used generally.

Note that database table column references are allowed for coordinate pairs in all geometry functions, not just POINT().

Some common SQL syntax that will not work (or is unimplemented in Gaia archive TAP) ADQL

The following currently do not to work in the Gaia archive:

Common syntax and other errors and how to avoid them

Ten example ADQL queries

This section expresses in ADQL ten of the usage scenarios presented in "Gaia data access scenarios summary" (hereafter AB-026) to illustrate some of the features of the language when querying the DR1 database. Gaia Data Access Scenario (GDAS) codes are references to the scenario labels used in that document.

I want data of all objects contained in a rectangular/circular region centered on a given sky position (GDAS-BR-07)

This is straightforwardly done using the relevant ADQL geometry functions (note angular arguments are in degrees, not the radians default of most SQL trigonometric functions!). For example, to select all sources within 1 degree of central position 3h45m, +24 (J2000), i.e. the centre of the Pleiades star cluster, use the following ADQL:

SELECT ra,dec
FROM gaiadr1.gaia_source
WHERE CONTAINS(POINT('ICRS', ra, dec), CIRCLE('ICRS', 56.25, 24.0, 1.0)) = 1

Notes:

  • For a rectangular region selection, simply replace CIRCLE(...) with for example BOX('ICRS', 56.25, 24.0, 1.0, 2.0) for a 1 degree x 2 degree rectangle centred at the same position
  • Strictly speaking, any reference system (e.g. as specified in these geometric functions via 'ICRS') should also specify a reference position (see for example Space Time Coordinate Metadata for the Virtual Observatory). Available options include GEOCENTER (centre of the Earth) and BARYCENTER (barycentre of the solar system) amongst others. These are simply specified via 'ICRS BARYCENTER' etc. however note that such reference systems are currently ignored by the archive. No on-the-fly conversion is available and operations are made in the provided coordinates so ensure these correspond to the system specified for coordinates stored in the archive.
  • This example is relevant also to the following scenario from AB-026: GDAS-ST-06

I liked the statistical plots presented in Section 3 of Volume 1 of the Hipparcos and Tycho Catalogues. Show me the same for Gaia and allow me to specify the statistic to explore (GDAS-BR-06)

Embedded within the Gaia sourceID attribute used as a unique identifier for all detected sources is a level-12 Hierarchical Equal Area iso-Latitude Pixelization (HEALPix) index value, and this can be conveniently employed to form statistical aggregates in ADQL using "... GROUP BY..". The HEALPix index uses bits 36 and upwards in sourceID and dividing that attribute by 2^35 extracts the relevant information. The trick is to note that for practical visualisation purposes the number of distinct sky pixels afforded by level-12 HEALPix indexation (around 200 million) is too high but this can be degraded straightforwardly to a practical yet still useful level by dividing by further factors of 4 per level. For example, degrading to level 7 (just under 200000 distinct sky pixels) which is 5 powers of 4, or 10 of 2, we require to divide sourceID by 2^45:

SELECT source_id/35184372088832 AS hpx7, COUNT(*) AS nsrc
FROM gaiadr1.gaia_source
WHERE phot_g_mean_mag BETWEEN 16.5 AND 17.0
GROUP BY hpx7

Notes:

  • Options other than COUNT(*) for statistical aggregates are
    • MAX() - to get the maximum value (of the attribute specified in the argument) amongst the group, e.g. MAX(...)
    • MIN()
    • AVG() - to get the average value amongst the group, e.g. AVG(SQRT(pmra*pmra + pmdec*pmdec))
    • SUM()
  • Unfortunately there is no provision for higher order nor robust statistics in ADQL (e.g. the absence of a median aggregate is particularly limiting).
  • This example is relevant also to the following scenario from AB-026: GDAS-BR-15

I want astrometric and/or photometric and or/spectroscopic measurements of a specific type of source ... but possibly for each and every epoch of observation / Information about flux variation and position among objects with multi-epoch photometry (GDAS-EG-09 & GDAS-EG-10)

For the current data release (DR1) these two scenarios can be illustrated by example queries on the Ecliptic poles scanning datasets which supplement the main Gaia catalogue releases. They also illustrate the principles of joining two related tables in ADQL. For example, suppose a user is interested in Cepheid variables including statistical parameters pertaining to their variability, along with time-resolved photometry - here is the query:

SELECT p.*, c.*
FROM gaiadr1.cepheid AS c, gaiadr1.phot_variable_time_series_gfov AS p
WHERE p.source_id = c.source_id
ORDER BY p.source_id, p.observation_time

Notes:

  • The FROM clause specifies the two tables containing the relevant information (table of Cepheids, and table of photometric data from all photometrically variable sources including Cepheids) while the WHERE clause filters associated records via the unique source identifier
  • the results set will consist of a concatenation of time-resolved measurements for each source; the ORDER BY clause ensures that each distinct source's measurements appear sequentially in the results set, in time order within each source set.
  • In DR1, the other specific type of variable source that can be queried in this convenient way are RR Lyraes (table rrlyrae)

Display the information for a given source for variability analysis... (GDAS-ST-25)

In DR1, variability information is available in the Ecliptic Pole Scanning Law (EPSL) data supplements. Suppose (for illustrative purposes) we wish to analyse the light curve of a specific source selected by high amplitude variability: this can be done using a sub-query to pick out the specific source identifier as follows:

SELECT observation_time, g_magnitude
FROM gaiadr1.phot_variable_time_series_gfov AS p, (
    SELECT TOP 1 source_id
    FROM gaiadr1.phot_variable_time_series_gfov_statistical_parameters
    ORDER BY range DESC
) AS t
WHERE p.source_id = t.source_id
ORDER BY observation_time

Notes:

  • This example selects the light curve of the highest amplitude variable recorded
  • The results set consists of time-stamped (Barycentric JD) magnitudes in each row, ordered on increasing time

I want all quasars observed so far with Gaia to be plotted on the celestial sphere (GDAS-ST-08)

A selection of around 170,000 sources made via the (ground-based) identification of QSOs in the Initial Gaia Source List is possible in DR1 using the following join between the IGSL and the main source catalogue:

SELECT g.ra, g.dec
FROM gaiadr1.gaia_source AS g, public.igsl_source AS i
WHERE aux_lqrf = 'true' AND g.source_id = i.source_id 

Notes:

  • boolean (logical) literals can be specified as above or alternatively '1' , 'y', 'yes', 't' or 'on' for true (or conversely '0', 'n', 'no', 'f', 'off' or 'false')
  • The results set can be displayed in topcat in a variety of projections i.e. 3d spherical, Aitoff or Plate Carree
  • The Gaia archive also contains a much smaller list of the ICRF2-defining quasars in table aux_qso_icrf2_match which may be useful in this context
  • This example is relevant also to the following scenario from AB-026: GDAS-EG-06

I want any measurements of all stars except Cepheids (GDAS-GA-05)

This is trivial if it is assumed that the table of Cepheids derived from variability information is in some useful sense complete. In DR1 care must be taken since variability content is confined to the EPSL regions, so:

SELECT ra, dec, phot_g_mean_mag
FROM gaiadr1.gaia_source 
WHERE (
  (ra BETWEEN 268.74298 AND 271.25702 AND dec BETWEEN 66.061389 AND 67.061389)
  OR (ra BETWEEN 88.74298 AND 91.25702 AND dec BETWEEN -67.061389 AND 66.061389)
  )
  AND source_id NOT IN (SELECT source_id FROM gaiadr1.cepheid)

Notes:

  • The first two predicates limit this query to the North and South Ecliptic Poles at 18:00:00+66:33:41 and 06:00:00-66:33:41 respectively, which is the extent of variability information in DR1
  • With over 4 million rows in the results set this query exceeds the limits for download currently set in the Gaia archive for non-registered users
  • This example is relevant also to the following scenario from AB-026: GDAS-ED-03

I want to find stars with exceptional properties, i.e. those stars that are outside a certain statistical range. E.g. extreme proper motion, extreme variability, ... (GDAS-ST-04)

This kind of scenario can be approached via two ADQL queries, the first to make an appropriate aggregate statistical summary to inform the second which simply selects the unusual objects as being outlying on the statistical distribution. Some statistical summaries are available in the Gaia archive pages (click on the "Statistics" tab), but histogram counts on arbitrary attributes are particularly useful and easy in ADQL as follows:

SELECT ROUND(SQRT(pmra*pmra + pmdec*pmdec), 0) AS muTot, COUNT(*)
FROM gaiadr1.tgas_source
GROUP BY muTot
ORDER BY muTot DESC

Here the ROUND function (which returns the nearest double value that is equal to a mathematical integer) is used to quantise the total proper motion in units of milliarcseconds per year. In conjunction with the COUNT(*) aggregate and GROUP BY, a results set is returned that usefully summarises millions, or even billions, of rows in order that the extremes of the distribution can be studies. The results set from the above can be plotted up as a histogram with bars of width 1 mas/yr, but simply specifying descending order presents the highest proper motion objects at the top of the results set: it can be seen that there are a few dozen objects with total proper motion of several arcseconds/yr. These can of course then be studied in more detail via:

SELECT *, SQRT(pmra*pmra + pmdec*pmdec) AS muTot
FROM gaiadr1.tgas_source
WHERE SQRT(pmra*pmra + pmdec*pmdec) > 3000

which will list the full records of sources having proper motions greater than 3 arcsec/yr.

Notes:

  • An alias defined in the SELECT clause (e.g. muTot in this example) can be used in the aggregate clauses but not in the WHERE clause.
  • groups with zero counts do not appear in the results set of course: care should be taken when handling the data as it might be easy to overlook this and assume a uniform distribution of bins.
  • For aggregate functions other than COUNT, see the second example above.
  • You can further filter results sets on properties of the groups using a HAVING clause, e.g. "... GROUP BY muTot HAVING COUNT(*) > 10 ..." would remove bins with counts of <= 10.
  • This example is relevant also to the following scenario from AB-026: GDAS-OA-15

Linear combinations of quantities should be queryable (GDAS-ST-05)

This is straightforward in ADQL: a filter predicate in the WHERE clause is not limited to simple mathematical equalities involving literal constants. For example, to select how many TGAS sources have a total proper motion measured with greater than 5σ significance (cf. Hambly et al. (2008) Section 4.2.2):

SELECT COUNT(*)
FROM gaiadr1.tgas_source
WHERE pmra*pmra + pmdec*pmdec > 
           5*SQRT(pmra*pmra*pmra_error*pmra_error + pmdec*pmdec*pmdec_error*pmdec_error)

Notes:

  • For simplicity the above query assumes propagation of errors with no account taken of covariance between the combined astrometric parameters - this is inadvisable (for further information see Section 5.1 of the DR1 astrometry paper)
  • WHERE expressions can be linear combinations of any appropriate table attributes, or in fact arbitrarily complex mathematical expressions involving all the usual mathematical operators and common functions.
  • Be aware that sometimes the expression may not evaluate, e.g. if a quantity is unavailable and null.

Provide a list of point-like sources which have absolute proper motions and parallaxes consistent with zero (GDAS-EG-05)

This is similar to the previous example but with an additional filter predicate in the WHERE clause. This time with 1σ significance criteria on both total proper motion (assumed "absolute" in some sense) and absolute parallax (Gaia parallaxes are certainly absolute by design):

SELECT TOP 10 *
FROM gaiadr1.tgas_source
WHERE pmra*pmra + pmdec*pmdec <
            SQRT(pmra*pmra*pmra_error*pmra_error + pmdec*pmdec*pmdec_error*pmdec_error)
    AND parallax < parallax_error AND (astrometric_excess_noise < 2 OR astrometric_excess_noise_sig <= 2)

Notes:

  • Here we limit the results set to the first ten rows for illustration only
  • Astrometric error filters should be used in conjunction with the "excess noise" attributes to allow for the presence of systematic errors (see, for example the Gaia Astrometric Global Iterative Solution paper or the online documentation).
  • Once again no account is taken here of error correlation

Examine the properties of Gaia data in detail without querying the whole Gaia database (GDAS-OA-17)

There are a number of ADQL tools and techniques that are relevant here, but note that the main Gaia source catalogue has an attribute random_index that allows selection of subsamples. Hence it is possible extract detailed properties of a random subsample without downloading the entire source list as follows:

SELECT *
FROM gaiadr1.gaia_source
WHERE MOD(random_index, 1000000) = 0

which randomly selects one out of every million objects.

Notes:

  • the ADQL MOD(x,y) function returns the remainder after division of x by y
  • queries of this type are deterministic in that if you execute the query twice you will get the same results set (although not necessarily in the same row order - remember SQL does not guarantee the same ordering unless you explicitly specify ORDER BY ...) because random_index was precomputed (i.e. populated once) when the database was set up
  • see also GDAS-BR-06 and GDAS-ST-04 above
  • This example is relevant also to the following scenario from AB-026: GDAS-ST-10

Crossmatch queries using precomputed neighbour tables

The Gaia archive includes a set of precomputed crossmatch tables that provide cross-identification of Gaia sources with those from other large-scale surveys. The approach taken in creating these "neighbour" tables reflects the inevitable complications in making a reliable identification of the counterpart of each Gaia source in the generally ground-based, lower angular resolution surveys observed at a different epoch. Extensive online documentation is available within the Gaia archive but briefly a neighbourhood is defined around each Gaia source and all potential matches from the external catalogue are listed in the neighbourhood table. In addition one of the sources in the neighbourhood is identified as the most likely match in each case, where "most likely" considers not only positional proximity but other aspects, and an additional separate table of best neighbour matches is also created. In general the neighbourhood table has an optional many-to-one relationship with the Gaia source table in that in general there may be none, one or many potential matches in the external catalogue for a given Gaia source. The neighbour table has an optional one-to-one relationship. Many usage scenarios will utilise the best neighbour list, but others may examine the content of the neighbourhood to make their own specialised decision on which counterpart to use. Attributes provided to aid in this decision include the angular distance (proximity), incorporating proper motion correct where that is possible, and also the number of external matches for the given Gaia source along with the number of "mates" that each external source has in the Gaia catalogue. This allows, for example, assessment of the reliability of a given cross-identification at query time.

Survey datasets held locally within the Gaia archive for crossmatch are generally large (millions of rows or larger) or catalogues having special relevance to astrometry. The following catalogues have precomputed neighbour tables in DR1:

Note that the set of source attributes included for a given external catalogue is generally only a subset of those available in the original catalogue.

The table design of the neighbour tables follows a template in which

and the angular distance and error, number of matches and mates, and a few other attributes are available to allow query filters on the reliability of the association.

Note that because the Hipparcos and Tycho2 catalogues have been used for DR1 (e.g. as prior information in AGIS for TGAS) the one-to-one matches employed within the Astrometric Global Iterative Solution are also available as extra fields in the TGAS source tables. Hence a neighbour match between Gaia TGAS sources and Tycho2 can also be achieved as follows:

SELECT TOP 100 *
FROM public.tycho2, gaiadr1.tgas_source 
WHERE tgas_source.tycho2_id = tycho2.id OR tgas_source.hip = tycho2.hip

and in this case the counterparts will be those used in applying astrometric constraints (if any) in AGIS. Note that the Hipparcos and Tycho2 catalogues are contained within schema "public" and that join criteria have to be combined with the OR condition because one or the other, but not both, are specified for each row in tgas_source.

There now follows several example queries to illustrate the use of the precomputed neighbour and neighbourhood tables.

I want whatever measurements of MW G2 stars only in a given sky region (GDAS-GA-04)

One way of making such a selection in absence of any optical colours or detailed spectroscopic information in DR1 is to use 2MASS photometry and this provides a simple example of the use of the best neighbour table. This then requires a crossmatch query employing the precomputed 2MASS crossmatch to enable filtering by near-infrared colours (which are much less susceptible to reddening compared to optical colours anyway). Here is an example employing colour cuts taken from Table 4 of Wegner (2014):

SELECT g.ra, g.dec, t.j_m, t.j_m - t.ks_m AS jminusk
FROM gaiadr1.gaia_source AS g, gaiadr1.tmass_best_neighbour AS x, gaiadr1.tmass_original_valid AS t
WHERE g.source_id = x.source_id AND t.tmass_oid = x.tmass_oid
  AND g.ra BETWEEN 175 AND 185 AND g.dec BETWEEN -0.5 AND +0.5 
  AND t.j_m - t.ks_m BETWEEN 0.41 AND 0.43 
  AND (t.j_msigcom *t.j_msigcom) + (t.ks_msigcom *t.ks_msigcom) < 0.2

Notes:

  • ADQL geometry functions are provided for convenience but an alternative is to coarsely filter on (indexed) spherical coordinates, in this case "WHERE ... ra BETWEEN 175 AND 185 AND dec BETWEEN -0.5 AND +0.5 ..." and for some geometries this may be faster
  • This example is relevant also to the following scenario from AB-026: GDAS-GA-20

An example query to select the nearest neighbour is slightly more complicated:

SELECT g.ra, g.dec, t.j_m, t.j_m - t.ks_m AS jminusk
FROM gaiadr1.gaia_source AS g, gaiadr1.tmass_neighbourhood AS x, gaiadr1.tmass_original_valid AS t, (
    SELECT source_id, MIN(angular_distance) AS mindist
    FROM   gaiadr1.tmass_neighbourhood
    GROUP BY source_id
) AS m
WHERE g.source_id = x.source_id AND t.tmass_oid = x.tmass_oid
  AND g.ra BETWEEN 175 AND 185 AND g.dec BETWEEN -0.5 AND +0.5 
  AND t.j_m - t.ks_m BETWEEN 0.41 AND 0.43 
  AND (t.j_msigcom *t.j_msigcom) + (t.ks_msigcom *t.ks_msigcom) < 0.2
  AND g.source_id = m.source_id AND x.angular_distance = m.mindist

This is not necessarily the same as the 'best' neighbour because the algorithm that chooses the best match may take into account prior conditions other than (proper motion corrected) proximity when choosing the counterpart, but nonetheless may be useful in some situations. At the very least it shows how to filter the neighbourhood records using a subquery.

Note that cast as a single ADQL query this will time out in the Gaia archive. This can be worked around easily by creating a table from the small set resulting from the main filter predicates, create a user table from the results set (by clicking the appropriate gadget icon in the Gaia archive results pane), and then making a join of this very small set on the neighbourhood table.

I have a list of N members and member candidates in a young open cluster further than 300 pc... I want to obtain the parallaxes, proper motions, broad-band photometry and accurate coordinates of all of them in the Gaia dataset (GDAS-GA-01)

This scenario illustrates several features of the Gaia Archive Core Systems user interface in a 3-stage workflow that includes a final ADQL query. There are several ways to approach this but the easiest is to make use of the user table upload feature of the Gaia archive, the use of which requires the user to be registered and to log in their session as opposed to querying anonymously. Then, step 1 is to upload the list of cluster candidates using the "Upload user table" gadget actioned via the button on the top-left of the search tab (disk icon with upward pointing arrow). Available upload formats are VOTable, or space- or comma-separated text (ASCII or CSV). Step 2 is to crossmatch this with the relevant Gaia data table - in DR1, full astrometry is provided for Tycho-Gaia Astrometric Solution (TGAS) sources only, so action a crossmatch using the "Cross-match tables" gadget (third button along on the search tab, double star icon) between the user table and tags_source. At both stages 1 and 2 a results table will appear in the table browser under "User tables". Finally, a selection can then be made in ADQL between the two matched tables in conjunction with the crossmatch table:

SELECT x.dist*3600.0 AS distArcsec, *
FROM user_xyz.table1 AS t1, gaiadr1.tgas_source AS t2, user_xyz.xmatch_tgas_source_table1 AS x
WHERE t1.table1_oid = x.table1_table1_oid AND t2.source_id = x.tgas_source_source_id
ORDER BY t1.table1_oid, x.dist

Notes:

  • The "ORDER BY" clause ensures that multiple matches for the same source appear on consecutive lines in the results set, ordered on increasing distance
  • Obviously "xyz" should be substituted by the registered user name of the login session
  • This example is relevant also to the following scenarios from AB-026: GDAS-GA-16, GDAS-GA-21, GDAS-BR-03, GDAS-ST-23

I would like to have Gaia distances and proper motions for stars that were observed by RAVE ... the radial velocities and physical parameters derived from RAVE (GDAS-GA-22)

This requires a standard join query using the precomputed neighbour table for the RAVE survey:

SELECT TOP 10 *
FROM gaiadr1.gaia_source AS s, gaiadr1.rave4_best_neighbour AS x, gaiadr1.rave4_original_valid AS r
WHERE s.source_id = x.source_id AND x.rave4_oid = r.rave4_oid

N.B. the RAVE catalogue and crossmatch will not be available immediately at the DR1 release data but will be added as soon as possible thereafter

Multiply imaged quasars ... (GDAS-FP-01)

Multiply imaged quasars consist of multiple (typically 2, 3, 4 or even more) point-like images of a same quasar lensed by a foreground intervening galaxy. The typical angular separations between the multiple point-like images range between 0.1″ and 3″-5″. In the absence of detailed spectrophotometric information from Gaia itself in DR1, one way of beginning to tackle such a scenario is to use the precomputed crossmatch "neighbourhood" tables to examine the environments of previously known quasars. The on-ground identification of quasars is expressed in the Initial Gaia Source list to which the main source table is cross referenced. Furthermore we can employ the precomputed crossmatch with SDSS DR9 as a set of cross-identifications of Gaia sources that a common neighbour. Here is some example ADQL that performs the search:

SELECT TOP 100 *
FROM gaiadr1.gaia_source AS g, gaiadr1.sdss_dr9_neighbourhood as x,
(
 SELECT source_id
 FROM   public.igsl_source_catalog_ids
 WHERE  id_lqrf > 0
) AS q,
(
 SELECT sdssdr9_oid
 FROM   gaiadr1.sdss_dr9_neighbourhood
 WHERE  angular_distance <= 5
 GROUP BY sdssdr9_oid
 HAVING COUNT(*) > 1
) AS s
WHERE g.source_id = q.source_id
  AND g.source_id = x.source_id
  AND s.sdssdr9_oid = x.sdssdr9_oid
ORDER BY s.sdssdr9_oid, x.angular_distance

Notes:

  • The first subquery (with results table labelled "AS q") identifies the set of Gaia source IDs corresponding to known quasars
  • The second subquery (with results table labelled "AS s") creates a set of SDSS sources that have more than one Gaia source within neighbourhood radius of 5 arcseconds
  • The outer query joins these together with the main Gaia source table and SDSS neighbourhood table in order to associate multiple Gaia sources with the previously identified quasar, the assumption being that one of those Gaia sources is the known quasar while the other are possible lensed counterparts.
  • Once again, and depending Gaia archive server load levels at execution time, this may time out and as before the work around is to split into two or more stages, saving intermediate results to a user table

Further information

For convenience here is a set of links to follow for further useful information:

Acknowledgements

This work has received funding from the European Community's Seventh Framework Programme (FP7-SPACE-2013-1) under grant agreement number 606740. Funds have also been provided by the UK Science and Technology Facilities Council.