Large Survey Database

Large Survey Database

The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>109 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >102 nodes, and can be made to function in “shared nothing” architectures.

An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, the tables are partitioned into sets of related columns (‘column groups’), grouping together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells” by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006).

Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, MapReduce ``kernels” that operate on query results on a per-cell basis can be written, with the framework taking care of their distribution, scheduling, and execution. The combination leverages the users’ familiarity with SQL, while offering a fully distributed computing environment.

LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on an 8-core machine. In a cluster environment, data rates of 14Gbits/sec (I/O limited) were achieved. Based on current experience, we believe LSD will scale to be useful for analysis and storage of LSST-scale datasets.

LSD has and active user community, and has been successfully applied to storage and analysis of a multitude of large astronomical data sets, including PanSTARRS PS1, SDSS, PHAT, WISE, PTF, 2MASS, GALEX, and others.

 

More Information