上一篇,我们已经将Hadoop环境搭建好了,今天我们就根据Esri提供的gis-tools-for-hadoop来切身体会一下Hadoop的工作原理以及Hadoop与GIS的初次约会。
相关环境:
Esri/gis-tools-for-hadoop:https://github.com/Esri/gis-tools-for-hadoop
- Sample tools that demonstrate full stack implementations of all the resources provided to solve GIS problems using Hadoop
- Templates for building custom tools that solve specific problems
Resources for building custom tools
- Spatial Framework for Hadoop
- Java helper utilities for Hadoop developers
- Hive spatial user-defined functions
- Esri Geometry API Java – Java geometry library for spatial data processing
- Geoprocessing Tools – ArcGIS Geoprocessing tools for Hadoop
hive-0.11.0:http://www.apache.org/dyn/closer.cgi/hive/
在使用Esri提供的工具之前,我们需要安装HIVE
- hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
HIVE是Facebook贡献的,比较类似于MySQL的语法,其实经常使用数据库SQL语句的基本都不陌生,这也是Hive项目研究的原因,因为SQL的用户群太广泛了。——————————————————————
**************************分割线*****************************
———————————————————————————————–
安装Hive跟Hadoop一样,就是一个解压缩命令,而且只需要安装在namenode机器上(master)
安装完毕之后,需要配置相关文件
1:安装HIVE在hadoop用户下安装,确保hadoop用户对hive文件夹有权限
- drwxrwxr-x. 10 hadoop hadoop 4096 Jun 25 18:55 hive-0.11.0
2:我们需要将Hadoop和JDK的环境变量添加到如下文件中:
/home/hadoop/hive-0.11.0/conf/hive-env.sh
/home/hadoop/hive-0.11.0/bin/hive-config.sh
3:需要将HIVE_HOME的信息添加到hadoop用户的环境变量中
- export JAVA_HOME=/home/jdk/jdk1.7.0_25
- export PATH=$JAVA_HOME/bin:$PATH
- export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib
- #Hadoop
- export HADOOP_HOME=/home/hadoop/hadoop-1.2.0
- export HIVE_HOME=/home/hadoop/hive-0.11.0
- export HADOOP_HOME_WARN_SUPPRESS=1
- export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin
- export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib
4:默认情况下$HIVE_HOME/conf下面没有hive-default.xml、hive-site.xml文件,只有hive-default.xml.template文件,使用cp命令复制一份即可
- [hadoop@namenode conf]$ cp hive-default.xml.template hive-default.xml
- [hadoop@namenode conf]$ cp hive-default.xml.template hive-site.xml
- [hadoop@namenode conf]$ ll
- total 248
- -rw-rw-r–. 1 hadoop hadoop 75005 Jun 26 02:19 hive-default.xml
- -rw-rw-r–. 1 hadoop hadoop 75005 May 11 12:06 hive-default.xml.template
- -rw-rw-r–. 1 hadoop hadoop 2714 Jun 26 02:34 hive-env.sh
- -rw-rw-r–. 1 hadoop hadoop 2378 May 11 12:06 hive-env.sh.template
- -rw-rw-r–. 1 hadoop hadoop 2465 May 11 12:06 hive-exec-log4j.properties.template
- -rw-rw-r–. 1 hadoop hadoop 2941 Jun 26 02:38 hive-log4j.properties
- -rw-rw-r–. 1 hadoop hadoop 2870 May 11 12:06 hive-log4j.properties.template
- -rw-rw-r–. 1 hadoop hadoop 75005 Jun 26 02:19 hive-site.xml
5:用户可以修改warehouse的默认路径,修改$HIVE_HOME/conf/hive-site.xml,该文件比较大
- <property>
- <name>hive.metastore.warehouse.dir</name>
- <value>/user/hive/warehouse</value>
- <description>location of default database for the warehouse</description>
- </property>
测试hive
- [hadoop@namenode ~]$ hive
- Logging initialized using configuration in file:/home/hadoop/hive-0.11.0/conf/hive-log4j.properties
- Hive history file=/tmp/hadoop/hive_job_log_hadoop_4317@namenode.com_201306261028_23687925.txt
- hive>
——————————————————————
**************************分割线*****************************
———————————————————————————————–
Esri的gis-tools-for-hadoop介绍
- [hadoop@namenode samples]$ ll
- total 20
- drwxr-xr-x. 4 hadoop hadoop 4096 Jun 26 04:54 data
- drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 lib
- drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 05:30 point-in-polygon-aggregation-hive
- drwxr-xr-x. 5 hadoop hadoop 4096 Jun 26 04:54 point-in-polygon-aggregation-mr
- -rw-r–r–. 1 hadoop hadoop 98 Jun 26 04:54 README.md
里面有data文件夹,包含一个地震数据(csv)和美国州立数据(json)
- [hadoop@namenode data]$ ll
- total 556
- drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 counties-data
- drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 earthquake-data
- -rw-r–r–. 1 hadoop hadoop 560721 Jun 26 04:54 samples.gdb.zip
lib文件夹里面是非常重要的两个Jar包
- [hadoop@namenode lib]$ ll
- total 908
- -rw-r–r–. 1 hadoop hadoop 794165 Jun 26 04:54 esri-geometry-api.jar
- -rw-r–r–. 1 hadoop hadoop 135121 Jun 26 04:54 spatial-sdk-hadoop.jar
point-in-polygon…-hive文件里面有一个SQL文件,该文件非常详细介绍了使用步骤
- [hadoop@namenode point-in-polygon-aggregation-hive]$ ll
- total 8
- -rw-r–r–. 1 hadoop hadoop 3195 Jun 26 04:54 README.md
- -rw-r–r–. 1 hadoop hadoop 1161 Jun 26 04:54 run-sample.sql
point-in-polygon…-mr里面有关于使用java开发hadoop的源代码
- [hadoop@namenode point-in-polygon-aggregation-mr]$ ll
- total 28
- -rw-r–r–. 1 hadoop hadoop 4885 Jun 26 04:54 aggregation-sample.jar
- -rw-r–r–. 1 hadoop hadoop 1012 Jun 26 04:54 build.xml
- drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 cmd
- drwxr-xr-x. 2 hadoop hadoop 4096 Jun 26 04:54 gp
- -rw-r–r–. 1 hadoop hadoop 1913 Jun 26 04:54 README.md
- drwxr-xr-x. 3 hadoop hadoop 4096 Jun 26 04:54 src
——————————————————————
**************************分割线*****************************
———————————————————————————————–
在执行相关SQL之前,需要将数据导入到HDFS里面,我们不能使用Linux系统的cp和mv来执行,必须使用hadoop自带的工具来完成
- [hadoop@namenode ~]$ hadoop fs
- Usage: java FsShell
- [-ls <path>]
- [-lsr <path>]
- [-du <path>]
- [-dus <path>]
- [-count[-q] <path>]
- [-mv <src> <dst>]
- [-cp <src> <dst>]
- [-rm [-skipTrash] <path>]
- [-rmr [-skipTrash] <path>]
- [-expunge]
- [-put <localsrc> ... <dst>]
- [-copyFromLocal <localsrc> ... <dst>]
- [-moveFromLocal <localsrc> ... <dst>]
- [-get [-ignoreCrc] [-crc] <src> <localdst>]
- [-getmerge <src> <localdst> [addnl]]
- [-cat <src>]
- [-text <src>]
- [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
- [-moveToLocal [-crc] <src> <localdst>]
- [-mkdir <path>]
- [-setrep [-R] [-w] <rep> <path/file>]
- [-touchz <path>]
- [-test -[ezd] <path>]
- [-stat [format] <path>]
- [-tail [-f] <file>]
- [-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]
- [-chown [-R] [OWNER][:[GROUP]] PATH…]
- [-chgrp [-R] GROUP PATH…]
- [-help [cmd]]
- Generic options supported are
- -conf <configuration file> specify an application configuration file
- -D <property=value> use value for given property
- -fs <local|namenode:port> specify a namenode
- -jt <local|jobtracker:port> specify a job tracker
- -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
- -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
- -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
- The general command line syntax is
- bin/hadoop command [genericOptions] [commandOptions]
1:将数据导入到hdfs
- [hadoop@namenode ~]$ hadoop fs -put /home/hadoop/gis-tools-for-hadoop-master/samples/ /home/hadoop/hadoop-1.2.0/tmp/
以上就是使用-put将原来物理存储在/home/hadoop/gis-tools-for-hadoop-master/samples/ 文件夹里面的信息,导入到/home/hadoop/hadoop-1.2.0/tmp/里面。导入进去之后,我们可以在任意datanode节点机器看到很多碎文件
- [hadoop@datanode1 current]$ pwd
- /home/hadoop/hadoop-1.2.0/tmp/dfs/data/current
- [hadoop@datanode1 current]$ ll
- total 7052
- -rw-rw-r–. 1 hadoop hadoop 98 Jun 26 06:12 blk_2562006058303613171
- -rw-rw-r–. 1 hadoop hadoop 11 Jun 26 06:12 blk_2562006058303613171_1050.meta
- -rw-rw-r–. 1 hadoop hadoop 560721 Jun 26 06:12 blk_3056013857537171121
- -rw-rw-r–. 1 hadoop hadoop 4391 Jun 26 06:12 blk_3056013857537171121_1052.meta
- -rw-rw-r–. 1 hadoop hadoop 2047 Jun 26 06:12 blk_3813361044238402711
- -rw-rw-r–. 1 hadoop hadoop 23 Jun 26 06:12 blk_3813361044238402711_1063.meta
- -rw-rw-r–. 1 hadoop hadoop 2060 Jun 26 06:12 blk_5126515286091847995
- -rw-rw-r–. 1 hadoop hadoop 27 Jun 26 06:12 blk_5126515286091847995_1064.meta
- -rw-rw-r–. 1 hadoop hadoop 794165 Jun 26 06:12 blk_5144324295121310544
- -rw-rw-r–. 1 hadoop hadoop 6215 Jun 26 06:12 blk_5144324295121310544_1055.meta
- -rw-rw-r–. 1 hadoop hadoop 1913 Jun 26 06:12 blk_7055687596152865845
- -rw-rw-r–. 1 hadoop hadoop 23 Jun 26 06:12 blk_7055687596152865845_1062.meta
- -rw-rw-r–. 1 hadoop hadoop 5742811 Jun 26 06:12 blk_7385460214599207016
- -rw-rw-r–. 1 hadoop hadoop 44875 Jun 26 06:12 blk_7385460214599207016_1053.meta
- -rw-rw-r–. 1 hadoop hadoop 1045 Jun 26 06:12 blk_-787033794569559952
- -rw-rw-r–. 1 hadoop hadoop 19 Jun 26 06:12 blk_-787033794569559952_1056.meta
- -rw-rw-r–. 1 hadoop hadoop 3195 Jun 26 06:12 blk_8646433984325059766
- -rw-rw-r–. 1 hadoop hadoop 35 Jun 26 06:12 blk_8646433984325059766_1048.meta
- -rw-rw-r–. 1 hadoop hadoop 772 Jun 26 06:12 dncp_block_verification.log.curr
- -rw-rw-r–. 1 hadoop hadoop 159 Jun 26 03:30 VERSION
2:列出HDFS文件
- [hadoop@namenode ~]$ hadoop fs -ls /home/hadoop/hadoop-1.2.0/tmp/
- Found 6 items
- -rw-r–r– 1 hadoop supergroup 98 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/README.md
- drwxr-xr-x – hadoop supergroup 0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data
- drwxr-xr-x – hadoop supergroup 0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/lib
- drwxr-xr-x – hadoop supergroup 0 2013-06-26 09:38 /home/hadoop/hadoop-1.2.0/tmp/mapred
- drwxr-xr-x – hadoop supergroup 0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/point-in-polygon-aggregation-hive
- drwxr-xr-x – hadoop supergroup 0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/point-in-polygon-aggregation-mr
3:列出HDFS目录下某个文档中的文件
- [hadoop@namenode ~]$ hadoop fs -ls input /home/hadoop/hadoop-1.2.0/tmp/data
- ls: Cannot access input: No such file or directory.
- Found 3 items
- drwxr-xr-x – hadoop supergroup 0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data/counties-data
- drwxr-xr-x – hadoop supergroup 0 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data/earthquake-data
- -rw-r–r– 1 hadoop supergroup 560721 2013-06-26 09:30 /home/hadoop/hadoop-1.2.0/tmp/data/samples.gdb.zip
4:查看HDFS目录下某个文档文件的内容
- hadoop fs -cat input /home/hadoop/hadoop-1.2.0/tmp/data/earthquake-data/earthquake.csv
5:删除HDFS目录下的文档
- hadoop fs –rmr /home/hadoop/hadoop-1.2.0/tmp/data
说明:进入HDFS里面的文件夹结构后,就不能使用Linux的命令比如查看某个文档下的文件不能使用ls,或者cd到某个路径下
使用上面的方法可以验证用户的数据是否上传到HDFS上面去。
——————————————————————
**************************分割线*****************************
———————————————————————————————–
接下来就是用esri提供的SQL语句来进行操作了
- // 添加jar包
- add jar
- /home/hadoop/esri-geometry-api.jar
- /home/hadoop/spatial-sdk-hadoop.jar;
- //创建临时函数,以下SQL语句会用到ST_Point和ST_Contains,如果还用到其他,按照下面方法创建即可
- create temporary function ST_Point as ‘com.esri.hadoop.hive.ST_Point’;
- create temporary function ST_Contains as ‘com.esri.hadoop.hive.ST_Contains’;
- //创建一个外部表,指定相关结构,然后根据上面提供的csv文件,将该文件导入到该表中
- //特别注意这个路径需要指定导入到HDFS里面的路径
- //如果成功导入,可以使用Select * from earthquakes1;是会有记录返回的,如果
- //没有记录返回说明没有导入成功
- //HIVE的表类型有托管表和外部表,外部表可以理解为如果删除数据,并没有删除实际数据,只是将HDFS里面的元数据信息删除掉,具体参考HIVE帮助
- CREATE EXTERNAL TABLE IF NOT EXISTS earthquakes1 (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
- LOCATION ‘/home/hadoop/hadoop-1.2.0/tmp/data/earthquake-data’;
- //同样道理创建一个counties1表
- CREATE EXTERNAL TABLE IF NOT EXISTS counties1 (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
- ROW FORMAT SERDE ‘com.esri.hadoop.hive.serde.JsonSerde’
- STORED AS INPUTFORMAT ‘com.esri.json.hadoop.EnclosedJsonInputFormat’
- OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
- LOCATION ‘/home/hadoop/hadoop-1.2.0/tmp/data/counties-data’;
- //执行如下SQL语句即可
- SELECT counties1.name, count(*) cnt FROM counties1
- JOIN earthquakes1
- WHERE ST_Contains(counties1.boundaryshape, ST_Point(earthquakes1.longitude, earthquakes1.latitude))
- GROUP BY counties1.name
- ORDER BY cnt desc;
说明:其实HIVE比较类似于我们常用的SQL语句,大家在使用过程中可以类比SQL尤其是学过MySQL就更加熟悉了,我们都知道SQL语句的有简单的比如Select * from table;insert into …;delete from table等,也有复杂的,其实也就是用户自己编写存储过程等,那么上面的这个步骤就是HIVE一个非常典型的用户定义函数(user-defined function,UDF)这个也可以类比一下SQL的存储过程,有了这种我们就可以编写复杂的语句来完成我们的功能了,因为HIVE本身是Java写的,那么用户编写UDF也必须使用Java来编写,本身UDF分为三种,这个具体看帮助。
编写好的UDF我们怎么使用呢?参考上面的例子
1:我们需要将Java开发的UDF打成jar包
2:在Hive注册这个文件 add jar
3:我们还需要为java的类名起一个别名
create temporary function ST_Point as ‘com.esri.hadoop.hive.ST_Point';
以上步骤2、3每次启动hive都需要重新执行这两步骤
4:在Hive里面创建表并导入数据
5:执行相关语句获得相关功能
感兴趣的话大家也可以进入以下目录来查看相关的源代码
%Hadoop%\spatial-framework-for-hadoop-master\spatial-framework-for-hadoop-master\hive\src\com\esri\hadoop\hive
而且Esri提供的这个工具包包含了基本类似于oracle ST_Geometry的空间构造函数、空间关系函数
可以查看%Hadoop%\spatial-framework-for-hadoop-master\spatial-framework-for-hadoop-master\hive\function-ddl.sql
- create temporary function ST_AsBinary as ‘com.esri.hadoop.hive.ST_AsBinary’;
- create temporary function ST_AsGeoJSON as ‘com.esri.hadoop.hive.ST_AsGeoJson’;
- create temporary function ST_AsJSON as ‘com.esri.hadoop.hive.ST_AsJson’;
- create temporary function ST_AsText as ‘com.esri.hadoop.hive.ST_AsText’;
- create temporary function ST_GeomFromJSON as ‘com.esri.hadoop.hive.ST_GeomFromJson’;
- create temporary function ST_GeomFromGeoJSON as ‘com.esri.hadoop.hive.ST_GeomFromGeoJson’;
- create temporary function ST_GeomFromText as ‘com.esri.hadoop.hive.ST_GeomFromText’;
- create temporary function ST_GeomFromWKB as ‘com.esri.hadoop.hive.ST_GeomFromWKB’;
- create temporary function ST_PointFromWKB as ‘com.esri.hadoop.hive.ST_PointFromWKB’;
- create temporary function ST_LineFromWKB as ‘com.esri.hadoop.hive.ST_LineFromWKB’;
- create temporary function ST_PolyFromWKB as ‘com.esri.hadoop.hive.ST_PolyFromWKB’;
- create temporary function ST_MPointFromWKB as ‘com.esri.hadoop.hive.ST_MPointFromWKB’;
- create temporary function ST_MLineFromWKB as ‘com.esri.hadoop.hive.ST_MLineFromWKB’;
- create temporary function ST_MPolyFromWKB as ‘com.esri.hadoop.hive.ST_MPolyFromWKB’;
- create temporary function ST_GeomCollection as ‘com.esri.hadoop.hive.ST_GeomCollection’;
- create temporary function ST_GeometryType as ‘com.esri.hadoop.hive.ST_GeometryType’;
- create temporary function ST_Point as ‘com.esri.hadoop.hive.ST_Point’;
- create temporary function ST_PointZ as ‘com.esri.hadoop.hive.ST_PointZ’;
- create temporary function ST_LineString as ‘com.esri.hadoop.hive.ST_LineString’;
- create temporary function ST_Polygon as ‘com.esri.hadoop.hive.ST_Polygon’;
- create temporary function ST_MultiPoint as ‘com.esri.hadoop.hive.ST_MultiPoint’;
- create temporary function ST_MultiLineString as ‘com.esri.hadoop.hive.ST_MultiLineString’;
- create temporary function ST_MultiPolygon as ‘com.esri.hadoop.hive.ST_MultiPolygon’;
- create temporary function ST_SetSRID as ‘com.esri.hadoop.hive.ST_SetSRID’;
- create temporary function ST_SRID as ‘com.esri.hadoop.hive.ST_SRID’;
- create temporary function ST_IsEmpty as ‘com.esri.hadoop.hive.ST_IsEmpty’;
- create temporary function ST_IsSimple as ‘com.esri.hadoop.hive.ST_IsSimple’;
- create temporary function ST_Dimension as ‘com.esri.hadoop.hive.ST_Dimension’;
- create temporary function ST_X as ‘com.esri.hadoop.hive.ST_X’;
- create temporary function ST_Y as ‘com.esri.hadoop.hive.ST_Y’;
- create temporary function ST_MinX as ‘com.esri.hadoop.hive.ST_MinX’;
- create temporary function ST_MaxX as ‘com.esri.hadoop.hive.ST_MaxX’;
- create temporary function ST_MinY as ‘com.esri.hadoop.hive.ST_MinY’;
- create temporary function ST_MaxY as ‘com.esri.hadoop.hive.ST_MaxY’;
- create temporary function ST_IsClosed as ‘com.esri.hadoop.hive.ST_IsClosed’;
- create temporary function ST_IsRing as ‘com.esri.hadoop.hive.ST_IsRing’;
- create temporary function ST_Length as ‘com.esri.hadoop.hive.ST_Length’;
- create temporary function ST_GeodesicLengthWGS84 as ‘com.esri.hadoop.hive.ST_GeodesicLengthWGS84′;
- create temporary function ST_Area as ‘com.esri.hadoop.hive.ST_Area’;
- create temporary function ST_Is3D as ‘com.esri.hadoop.hive.ST_Is3D’;
- create temporary function ST_Z as ‘com.esri.hadoop.hive.ST_Z’;
- create temporary function ST_MinZ as ‘com.esri.hadoop.hive.ST_MinZ’;
- create temporary function ST_MaxZ as ‘com.esri.hadoop.hive.ST_MaxZ’;
- create temporary function ST_IsMeasured as ‘com.esri.hadoop.hive.ST_IsMeasured’;
- create temporary function ST_M as ‘com.esri.hadoop.hive.ST_M’;
- create temporary function ST_MinM as ‘com.esri.hadoop.hive.ST_MinM’;
- create temporary function ST_MaxM as ‘com.esri.hadoop.hive.ST_MaxM’;
- create temporary function ST_CoordDim as ‘com.esri.hadoop.hive.ST_CoordDim’;
- create temporary function ST_NumPoints as ‘com.esri.hadoop.hive.ST_NumPoints’;
- create temporary function ST_PointN as ‘com.esri.hadoop.hive.ST_PointN’;
- create temporary function ST_StartPoint as ‘com.esri.hadoop.hive.ST_StartPoint’;
- create temporary function ST_EndPoint as ‘com.esri.hadoop.hive.ST_EndPoint’;
- create temporary function ST_ExteriorRing as ‘com.esri.hadoop.hive.ST_ExteriorRing’;
- create temporary function ST_NumInteriorRing as ‘com.esri.hadoop.hive.ST_NumInteriorRing’;
- create temporary function ST_InteriorRingN as ‘com.esri.hadoop.hive.ST_InteriorRingN’;
- create temporary function ST_NumGeometries as ‘com.esri.hadoop.hive.ST_NumGeometries’;
- create temporary function ST_GeometryN as ‘com.esri.hadoop.hive.ST_GeometryN’;
- create temporary function ST_Centroid as ‘com.esri.hadoop.hive.ST_Centroid’;
- create temporary function ST_Contains as ‘com.esri.hadoop.hive.ST_Contains’;
- create temporary function ST_Crosses as ‘com.esri.hadoop.hive.ST_Crosses’;
- create temporary function ST_Disjoint as ‘com.esri.hadoop.hive.ST_Disjoint’;
- create temporary function ST_EnvIntersects as ‘com.esri.hadoop.hive.ST_EnvIntersects’;
- create temporary function ST_Envelope as ‘com.esri.hadoop.hive.ST_Envelope’;
- create temporary function ST_Equals as ‘com.esri.hadoop.hive.ST_Equals’;
- create temporary function ST_Overlaps as ‘com.esri.hadoop.hive.ST_Overlaps’;
- create temporary function ST_Intersects as ‘com.esri.hadoop.hive.ST_Intersects’;
- create temporary function ST_Relate as ‘com.esri.hadoop.hive.ST_Relate’;
- create temporary function ST_Touches as ‘com.esri.hadoop.hive.ST_Touches’;
- create temporary function ST_Within as ‘com.esri.hadoop.hive.ST_Within’;
- create temporary function ST_Distance as ‘com.esri.hadoop.hive.ST_Distance’;
- create temporary function ST_Boundary as ‘com.esri.hadoop.hive.ST_Boundary’;
- create temporary function ST_Buffer as ‘com.esri.hadoop.hive.ST_Buffer’;
- create temporary function ST_ConvexHull as ‘com.esri.hadoop.hive.ST_ConvexHull’;
- create temporary function ST_Intersection as ‘com.esri.hadoop.hive.ST_Intersection’;
- create temporary function ST_Union as ‘com.esri.hadoop.hive.ST_Union’;
- create temporary function ST_Difference as ‘com.esri.hadoop.hive.ST_Difference’;
- create temporary function ST_SymmetricDiff as ‘com.esri.hadoop.hive.ST_SymmetricDiff’;
- create temporary function ST_SymDifference as ‘com.esri.hadoop.hive.ST_SymmetricDiff’;
- create temporary function ST_Aggr_Union as ‘com.esri.hadoop.hive.ST_Aggr_Union’
确保数据都导入进去之后,执行SQL语句后,信息如下
- hive> SELECT counties1.name, count(*) cnt FROM counties1
- > JOIN earthquakes1
- > WHERE ST_Contains(counties1.boundaryshape, ST_Point(earthquakes1.longitude, earthquakes1.latitude))
- > GROUP BY counties1.name
- > ORDER BY cnt desc;
- Total MapReduce jobs = 3
- Launching Job 1 out of 3
- Number of reduce tasks determined at compile time: 1
- In order to change the average load for a reducer (in bytes):
- set hive.exec.reducers.bytes.per.reducer=<number>
- In order to limit the maximum number of reducers:
- set hive.exec.reducers.max=<number>
- In order to set a constant number of reducers:
- set mapred.reduce.tasks=<number>
- Starting Job = job_201306260649_0003, Tracking URL = http://namenode.com:50030/jobdetails.jsp?jobid=job_201306260649_0003
- Kill Command = /home/hadoop/hadoop-1.2.0/libexec/../bin/hadoop job -kill job_201306260649_0003
- Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
- 2013-06-26 09:57:13,072 Stage-1 map = 0%, reduce = 0%
- 2013-06-26 09:57:27,152 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.83 sec
- 2013-06-26 09:57:28,160 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.83 sec
- 2013-06-26 09:57:29,167 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.83 sec
- 2013-06-26 09:57:30,174 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 5.83 sec
- 2013-06-26 09:57:31,187 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:32,200 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:33,210 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:34,219 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:35,237 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:36,246 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:37,256 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:38,265 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:39,271 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:40,278 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:41,286 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:42,294 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:43,301 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:44,308 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:45,314 Stage-1 map = 100%, reduce = 70%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:46,323 Stage-1 map = 100%, reduce = 71%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:47,330 Stage-1 map = 100%, reduce = 71%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:48,337 Stage-1 map = 100%, reduce = 71%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:49,343 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:50,354 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:51,360 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:52,369 Stage-1 map = 100%, reduce = 73%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:53,379 Stage-1 map = 100%, reduce = 73%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:54,385 Stage-1 map = 100%, reduce = 73%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:55,391 Stage-1 map = 100%, reduce = 74%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:56,397 Stage-1 map = 100%, reduce = 74%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:57,403 Stage-1 map = 100%, reduce = 74%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:58,411 Stage-1 map = 100%, reduce = 76%, Cumulative CPU 10.4 sec
- 2013-06-26 09:57:59,418 Stage-1 map = 100%, reduce = 76%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:00,425 Stage-1 map = 100%, reduce = 76%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:01,433 Stage-1 map = 100%, reduce = 76%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:02,439 Stage-1 map = 100%, reduce = 76%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:03,448 Stage-1 map = 100%, reduce = 76%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:04,464 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:05,476 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:06,482 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:07,488 Stage-1 map = 100%, reduce = 79%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:08,497 Stage-1 map = 100%, reduce = 79%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:09,503 Stage-1 map = 100%, reduce = 79%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:10,516 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:11,524 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 10.4 sec
- 2013-06-26 09:58:12,533 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:13,541 Stage-1 map = 100%, reduce = 81%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:14,547 Stage-1 map = 100%, reduce = 81%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:15,554 Stage-1 map = 100%, reduce = 81%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:16,559 Stage-1 map = 100%, reduce = 82%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:17,566 Stage-1 map = 100%, reduce = 82%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:18,575 Stage-1 map = 100%, reduce = 82%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:19,582 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:20,592 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:21,599 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:22,606 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:23,614 Stage-1 map = 100%, reduce = 84%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:24,620 Stage-1 map = 100%, reduce = 84%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:25,626 Stage-1 map = 100%, reduce = 84%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:26,632 Stage-1 map = 100%, reduce = 85%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:27,638 Stage-1 map = 100%, reduce = 85%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:28,650 Stage-1 map = 100%, reduce = 85%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:29,656 Stage-1 map = 100%, reduce = 86%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:30,661 Stage-1 map = 100%, reduce = 86%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:31,668 Stage-1 map = 100%, reduce = 86%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:32,677 Stage-1 map = 100%, reduce = 87%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:33,682 Stage-1 map = 100%, reduce = 87%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:34,690 Stage-1 map = 100%, reduce = 87%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:35,700 Stage-1 map = 100%, reduce = 89%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:36,706 Stage-1 map = 100%, reduce = 89%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:37,713 Stage-1 map = 100%, reduce = 89%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:38,719 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:39,726 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:40,734 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:41,741 Stage-1 map = 100%, reduce = 91%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:42,747 Stage-1 map = 100%, reduce = 91%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:43,754 Stage-1 map = 100%, reduce = 91%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:44,760 Stage-1 map = 100%, reduce = 92%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:45,767 Stage-1 map = 100%, reduce = 92%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:46,773 Stage-1 map = 100%, reduce = 92%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:47,780 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:48,786 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:49,791 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:50,802 Stage-1 map = 100%, reduce = 94%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:51,807 Stage-1 map = 100%, reduce = 94%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:52,814 Stage-1 map = 100%, reduce = 94%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:53,820 Stage-1 map = 100%, reduce = 95%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:54,826 Stage-1 map = 100%, reduce = 95%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:55,834 Stage-1 map = 100%, reduce = 95%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:56,841 Stage-1 map = 100%, reduce = 96%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:57,847 Stage-1 map = 100%, reduce = 96%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:58,853 Stage-1 map = 100%, reduce = 96%, Cumulative CPU 43.27 sec
- 2013-06-26 09:58:59,861 Stage-1 map = 100%, reduce = 97%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:00,869 Stage-1 map = 100%, reduce = 97%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:01,876 Stage-1 map = 100%, reduce = 97%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:02,881 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:03,887 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:04,893 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:05,906 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:06,912 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 43.27 sec
- 2013-06-26 09:59:07,925 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 99.03 sec
- 2013-06-26 09:59:08,932 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 99.03 sec
- 2013-06-26 09:59:09,938 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 99.03 sec
- 2013-06-26 09:59:10,948 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 99.03 sec
- MapReduce Total cumulative CPU time: 1 minutes 39 seconds 30 msec
- Ended Job = job_201306260649_0003
- Launching Job 2 out of 3
- Number of reduce tasks not specified. Estimated from input data size: 1
- In order to change the average load for a reducer (in bytes):
- set hive.exec.reducers.bytes.per.reducer=<number>
- In order to limit the maximum number of reducers:
- set hive.exec.reducers.max=<number>
- In order to set a constant number of reducers:
- set mapred.reduce.tasks=<number>
- Starting Job = job_201306260649_0004, Tracking URL = http://namenode.com:50030/jobdetails.jsp?jobid=job_201306260649_0004
- Kill Command = /home/hadoop/hadoop-1.2.0/libexec/../bin/hadoop job -kill job_201306260649_0004
- Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
- 2013-06-26 09:59:19,107 Stage-2 map = 0%, reduce = 0%
- 2013-06-26 09:59:23,132 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:24,137 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:25,142 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:26,148 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:27,157 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:28,166 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:29,178 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:30,187 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:31,200 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:32,211 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:33,225 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 0.99 sec
- 2013-06-26 09:59:34,236 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.27 sec
- 2013-06-26 09:59:35,242 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.27 sec
- 2013-06-26 09:59:36,247 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.27 sec
- 2013-06-26 09:59:37,253 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.27 sec
- 2013-06-26 09:59:38,267 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3.27 sec
- MapReduce Total cumulative CPU time: 3 seconds 270 msec
- Ended Job = job_201306260649_0004
- Launching Job 3 out of 3
- Number of reduce tasks determined at compile time: 1
- In order to change the average load for a reducer (in bytes):
- set hive.exec.reducers.bytes.per.reducer=<number>
- In order to limit the maximum number of reducers:
- set hive.exec.reducers.max=<number>
- In order to set a constant number of reducers:
- set mapred.reduce.tasks=<number>
- Starting Job = job_201306260649_0005, Tracking URL = http://namenode.com:50030/jobdetails.jsp?jobid=job_201306260649_0005
- Kill Command = /home/hadoop/hadoop-1.2.0/libexec/../bin/hadoop job -kill job_201306260649_0005
- Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
- 2013-06-26 09:59:46,788 Stage-3 map = 0%, reduce = 0%
- 2013-06-26 09:59:52,817 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 09:59:53,824 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 09:59:54,835 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 09:59:55,840 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 09:59:56,851 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 09:59:57,865 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 09:59:58,874 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 09:59:59,884 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 10:00:00,890 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.67 sec
- 2013-06-26 10:00:01,896 Stage-3 map = 100%, reduce = 33%, Cumulative CPU 1.67 sec
- 2013-06-26 10:00:02,905 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 3.48 sec
- 2013-06-26 10:00:03,910 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 3.48 sec
- 2013-06-26 10:00:04,915 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 3.48 sec
- 2013-06-26 10:00:05,921 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 3.48 sec
- 2013-06-26 10:00:06,929 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 3.48 sec
- MapReduce Total cumulative CPU time: 3 seconds 480 msec
- Ended Job = job_201306260649_0005
- MapReduce Jobs Launched:
- Job 0: Map: 2 Reduce: 1 Cumulative CPU: 99.03 sec HDFS Read: 6771646 HDFS Write: 541 SUCCESS
- Job 1: Map: 1 Reduce: 1 Cumulative CPU: 3.27 sec HDFS Read: 1002 HDFS Write: 541 SUCCESS
- Job 2: Map: 1 Reduce: 1 Cumulative CPU: 3.48 sec HDFS Read: 1002 HDFS Write: 199 SUCCESS
- Total MapReduce CPU Time Spent: 1 minutes 45 seconds 780 msec
- OK
- Kern 36
- San Bernardino 35
- Imperial 28
- Inyo 20
- Los Angeles 18
- Monterey 14
- Riverside 14
- Santa Clara 12
- Fresno 11
- San Benito 11
- San Diego 7
- Santa Cruz 5
- San Luis Obispo 3
- Ventura 3
- Orange 2
- San Mateo 1
- Time taken: 183.06 seconds, Fetched: 16 row(s)
从上面的信息我们可以看出,执行这个SQL语句是产生了三个job,每个job都有map和reduce的进程,对Esri提供的小数据在hadoop架构下进行分析,使用了183s,这样印证了hadoop在执行小数据量的局限性。这也说明了,什么样的技术需要在什么样的场景下适用,有点英雄无用武之地的感觉。扩展:如果从事电信或者电力GIS行业的技术人员看完这个一点都不陌生,因为这正是他们习惯的开发模式,类似于ArcSDE的SQL操作ST_Geometry,他们习惯于直接在数据库里面使用SQL语句,使用SDE提供的关系操作符,那么以上使用的HIVE就是一个类SQL的方式,而且相关的操作符与SQL操作基本类似,如果这些行业数据量非常大而且分析较为复杂,不妨考虑一下这种模式。
同样,Esri也提供了基于Hadoop的可视化界面操作工具GP工具,这个工具提供的功能也比较简单
下载地址:https://github.com/Esri/geoprocessing-tools-for-hadoop
1:启用WebHDFS
在jobtacker机器上的hdfs-site.xml配置文件添加如下信息
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
2:部署GP工具包里面的requests和webhdfs的python包
将这两个文件夹拷贝到python的site-packages文件夹里面
C:\Python27\ArcGIS10.1\Lib\site-packages
具体使用可以参考提供的帮助参考。