Distributed processing with noSQL databases enables fast geoprocessing of big spatial data
I was lucky enough to attend theHigh Performance Geoprocessing Symposium in Ottawa. It is amazing how far some very exciting geospatial projects focused on “big spatial data” under the LocationTech umbrella have come in the last few years. If you are not familiar with LocationTech, it is a working group hosted by the Eclipse Foundation with the objective of developing advanced open source geospatial technologies. Eclipse is a vendor neutral not-for-profit community collaborating on commercially-friendly open source software.
The talks today described projects that have been developed under the impetus of “big spatial data” and user demand for faster, more responsive geoprocessing.
GeoTrellis – high performance raster processing
Robert Cheetham of Azavea described how his project GeoTrellis is achieving “advanced spatial analysis on the web.” The challenges that this project strives to overcome are
- performance and scalability – performing overlays involving large rasters for multi-criteria suitability, optimal siting analysis or simulation involves serious processing and to be able to do this in real-time with the type of response that a web user expects requires new algorithms
- large datasets – such as social media streams like Twitter, the huge volumes of data required to model a smart city, or just the volume of data require to monitor and run a modern city on a daily basis
- user interface – an ArcGIS syle menu bar with thousands of options is fine for trained GIS professionals, but an urban planner or other non-GIS professional needs something better
GeoTrellis has been designed to address these challenges. It is open source released under an Apache licence, which means that all you need to do is indicate attribution, otherwise you can do anything you like with it. It provides high performance raster input/output, geoprocessing, and web services using distributed processing to achieve quite amazing throught put for large raster datasets. Robert described overlay operations with 6000 x 5000 pixel rasters, hundreds of megabytes in file size, that GeoTrellis performs in 200 milliseconds, effectively real-time from the point of view of a web user.
GeoTrellis relies on other open source libraries to achieve these phenomenal processing speeds. It uses the Hadoop (a noSQL distributed database) file system (HDFS), but replaces Hadoop’s MapReduce (MapReduce is software for processing vast amounts of data in-parallel on large clusters of commodity hardware) with Apache Spark for distributed processing. Robert said that one way to look at GeoTrellis is as the geospatial enabled version of Spark.
Robert described a number of applications that involve overlaying very large raster databases for analysis or simulation where different scenarios can be run in real-time so that the user can quickly assess alternatives. For example, OpenTreeMap is a crowdsourced urban forest database where using GeoTrellis users can quickly assess the costs and benefits and optimize the planting of urban trees with different objectives, shading, air quality, reducing runoff, and so on. This would not be possible with traditional GIS and geospatial databases.
GeoJinni – Spatial Hadoop
Ahmed Eldawy from the University of Minnesota describes a LocationTech project he has been working on for several years that was originally called Spatial Hadoop, but has been renamed to GeoJinni. Perhaps the best known NoSQL distributed database, Hadoop knows nothing about spatial data. It doesn’t support spatial datatypes, spatial indexes or spatial analytics.
GeoJinni is a comprehensive extension to Hadoop that allows efficient processing of spatial data. It injects spatial awareness in the different layers and components of Hadoop to make it more efficient to store and process big spatial data. It is a spatially-enabled Hadoop that supports spatial datatypes, spatial indexes, spatial analysis, and spatial predicates (such as “intersect”). GeoJinni can be installed as an extension to an existing Hadoop cluster which means that it can be run without the need to give up existing Hadoop installations. It is portable to run with a number of Hadoop distributions including Apache Hadoop, Cloudera, and Hortonworks.
Some of the applications where GeoJinni is currently used are modeling world traffic and querying and visualizing spatio-temporal satellite data.
Currently it uses MapReduce to distribute processing across multiple machines, but in the future the plan is to extend support to Spark.
GeoMesa – high performance geospatial (vector) analytics
Chris Eichelberger from theGeoMesa project described an approach to big spatial data that focusses on vector data and compliance with existing widely used geospatial APIs.
The problem as outlined by Chris is if you want to find all the points that are within a certain distance from a fixed point your strategy depends on the total number of points you are dealing with.
- hundreds of points – can be handled reasonably well (but not optimally) by a full table scan with an SQL database such as Oracle or PostgreSQL.
- hundreds of thousands of points – you need PostGIS or Oracle Spatial to handle this efficiently. The widely used GeoTools API is an open source Java library that provides tools for geospatial data management and analytics. GeoTools supports plugins for various databases including MySQL, Oracle, SpatiaLite and SQL Server, but the most widely used database with GeoTools is PostGIS.
- hundreds of millions (typical of Twitter streams, for example) – this volume of data is in the realm of big spatial data and requires a new and different approach
NoSQL (“not only SQL”) databases
I have already mentioned Hadoop. To deal with data volumes that are too large for traditional SQL databases, beginning in 2004 Google developed “BigTable” which is a compressed, high performance, and proprietary data storage system built on the Google File System that is used by a number of Google applications including Google Maps. Apache Accumulo is a distributed “key/value store” that is based on Google’s BigTable design and is built on top of Apache Hadoop and other Apache projects. “Key/value store” simply means that every record has a unique identifier, often a hash. For putting geospatial data into a key/value store the key concept is that of the “geohash” converting a 2D, 3D or 4D coordinate such as a lon and lat, lon, lat and elevation or a lon, lat, elevation, and time to an integer index, such as a quadtree or R-Tree index, that can be used to order and rapidly retrieve spatial data. Geohash means that you now can take advantage of key/value databases such as Accumulo and programs such as MapReduce – Accumulo uses MapReduce for distributed processing. GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.
An especially exciting thing about GeoMesa is that it is an implementation of the GeoTools API on Apache Accumulo instead of PostGIS. If you have an application that was developed with GeoTools, you can simply replace PostGIS with GeoMesa to handle big spatial data volumes. There is even a plug-in for GeoServer so that geospatial data in Accumulo to be shared and visualized via Open Geospatial Consortium (OGC) standard services, such as WFS. Chris also announced that the National Geospatial-Intelligence Agency’s (NGA) GeoWave project, which shares many goals with GeoMesa, will be folded into GeoMesa.
Spatio-temporal
A fairly common query when dealing with social media streams such as Twitter is to “find all the tweets from locations within 10 km and within 15 minutes of where I was in my car at 9:45 this morning.” Chris outlined how GeoMesa can handle that type of query through GeoMesa’s support for an Open Geospatial Consortium standard called the web geoprocessing service (WPS).