Scaling exactEarth’s satellite vessel-tracking service using GeoMesa on Google Cloud Platform
By Anthony Fox, Director of Data Science & System Architecture at CCRi, and Misha Brukman, Product Manager for Google Cloud Bigtable
Google Cloud Bigtable, Google Cloud Dataproc and GeoMesa are the main components in this solution for global geospatial analytics
Automatic Identification Systems (AISs) are used by ships and vessel-tracking services to provide real-time awareness of locations for safe navigation and routing and historical analysis, similar to what air-traffic control systems do for aircraft. The data services company exactEarthuses a constellation of satellites in space to collect more than 7 million AIS vessel-position reports per day. That number is expected to grow by an order of magnitude when 58 new satellite payloads come online soon.
Over its history, exactEarth built a scalable architecture to give its customers real-time views of AIS vessel positions anywhere across our oceans, as well as insight into patterns of maritime activity through historical archives — contributing to collision avoidance, cargo tracking, navigation assistance, environmental protection and other applications. (It’s worth noting that the data attributes and scalability requirements associated with this workload, such as location and time-series information, are similar to those in many spatial-temporal applications across industries such as Retail and Manufacturing.)
However, as a forward-thinking technology company, exactEarth is constantly exploring options for scaling its services to the next level. Recently, it partnered with machine-learning and geospatial analytics firm CCRi to help evaluate the viability of Google Cloud Platform (GCP) for managing and analyzing satellite-collected data more efficiently.
In this post, we’ll summarize the resulting solution, including the role of the open source GeoMesa spatial-temporal framework.
What is GeoMesa?
GeoMesa is a suite of tools for bringing spatial-temporal data, real-time IoT and sensor workloads to the cloud. It can run on a variety of distributed databases including Apache Accumulo, Apache Cassandra, Apache HBase and the original massively scalable, columnar NoSQL database that inspired them all, Google Cloud Bigtable.
Cloud Bigtable is a managed service in the GCP ecosystem. It supports the Apache HBase API for seamless adoption and portability across self-hosted HBase instances or managed Bigtable instances. Bigtable scales to huge volumes of data, while maintaining low-latency access, by separating compute from storage and utilizing the Colossus distributed file system across any number of provisioned server nodes.
As we’ll describe next, the combination of Cloud Bigtable scalability and spatial-temporal analytic functionality in GeoMesa proved to be a key ingredient for exactEarth’s use case.
Solution details
As a first step, CCRi loaded one year of exactEarth data (spanning November 2015 through October 2016) into a GeoMesa instance on top of a 3-node Cloud Bigtable setup; this data amounted to 2.62 billion position reports. Through GeoMesa, users can query the AIS data using a variety of spatial relationships.
For example, to find the unique number of vessels that passed through the Panama Canal on any given day, the analyst can combine a bounding box with a temporal duration and subsequently observe approximately 40 ships. To drill-down to a particular vessel, you’d enter a query with a predicate on the Maritime Mobile Service Identities (MMSI) number of the vessel of interest. In the latter case, GeoMesa’s cost-based query optimizer will kick in and delegate the processing of the query to the secondary index table. With GeoMesa, you can also select which attributes get their own indexes and control the amount of data that gets stored in each additional index to trade-off performance for storage costs, as needed.
After ingesting all the AIS data, CCRi set up a GCP Compute instance to run GeoServer, an open-source Open Geospatial Consortium (OGC) service provider that enables access to exactEarth data via standards such as Web Feature Service (WFS) and Web Map Service (WMS). Through these standard protocols, any OGC-compliant mapping client such as OpenLayers, Leaflet or QGIS can consume exactEarth’s data. GeoMesa’s efficient optimizations for typical requests such as on-demand heat maps can also be triggered via standard styling rules.
To handle many user requests, GeoServer can be set up behind a GCP load balancer that will distribute OGC requests across a configurable cluster of instances. Combined with Bigtable’s load balancing of requests across Bigtable server instances, this setup can scale to high volumes of requests while maintaining good performance.
GeoMesa and Cloud Dataproc at work
For analysis, GeoMesa provides deep integration with Apache Spark and the Spark SQL query optimizer (Catalyst). Thanks to the GeoMesa API’s consistency across Cloud Bigtable and Google Cloud Dataproc — the managed service for Apache Hadoop and Spark clusters available via GCP — as well as other components like Accumulo and Cassandra, Cloud Dataproc was a useful target architecture for this deployment.
GeoMesa extends Spark SQL to add spatial data types, topological SQL predicates, geometry functions and optimization rules for efficiently loading data out of a GeoMesa backend and into Spark’s in-memory execution engine. Used in combination with Jupyter Notebook, GeoMesa and Cloud Dataproc, which offers fast data load and 90-second cluster spin-up time, provide an interactive ad-hoc analysis and visualization environment over massive spatio-temporal data sets.
For example, it takes about 7 to 8 hours to traverse the Panama Canal. An analyst can determine the busiest hours of the day in the canal by executing the following SQL statement and then viewing a graph of the results using Vega-Lite charts.
SELECT hour, avg(count) AS avgcount
FROM (
SELECT hour, day, count(distinct MMSI) AS count
FROM (
SELECT MMSI, dayofyear(Time) AS day, hour(Time) AS hour
FROM exactEarth
WHERE st_contains(st_geomFromWKT('POLYGON((-79.88 9.18,-79.83...))'), geom)
)
GROUP BY day, hour
)
GROUP BY hour
ORDER BY hour ASC
GeoMesa intercepts the st_contains(st_geomFromWKT($panama), geom)
predicate and loads only the relevant data into the Spark DataFrame. From there, the full power and flexibility of Spark is available to the programmer or analyst to compute answers to complex questions like “Which vessels that traveled through the Panama Canal had the longest voyage in the past two months?”
SELECT
MMSI, line AS track,
st_lengthSpheroid(line) AS length
FROM (
SELECT
MMSI,
collect_list(geom) AS list,
st_makeLine(collect_list(geom)) AS line
FROM (
SELECT MMSI, Time, geom
FROM exactEarth
WHERE MMSI IN (<>)
SORT BY Time ASC
)
GROUP BY MMSI
)
WHERE size(list) > 1
ORDER BY length DESC
To get even more out of this data, CCRi deployed an instance of CCRi’s web-based, high-volume spatio-temporal visualization tool, Stealth. Static data maps don’t do justice to the dynamic nature of spatio-temporal data, which is best appreciated through animation. Stealth can animate multiple millions of records in a standard browser without the need for plugins or WebGL.Check out the following short demo to get a sense of the data that exactEarth provides, and how visualization can lead to additional insights.
(View other examples here and here.)
Next steps
As demonstrated here, GCP, and specifically Cloud Bigtable and Cloud Dataproc, proved to be a high-performance, highly interactive backend for exactEarth’s high volume and real-time needs. Furthermore, thanks to GeoMesa’s standard API that works across multiple backends, exactEarth can migrate this application between cloud-based and on-premise architectures as it desires.
To learn more:.
- Explore Cloud Bigtable Quickstarts
- Explore Cloud Dataproc Quickstarts