Hudi0.9--初步使用
1. 编译官方网址: https://hudi.apache.org/docs/spark_quick-start-guide.html编译指导: https://github.com/apache/hudi#building-apache-hudi-from-source环境准备Unix-like system (like Linux, Mac OS X)Java 8 (Java 9 or 10
·
1. 编译
官方网址: https://hudi.apache.org/docs/spark_quick-start-guide.html
编译指导: https://github.com/apache/hudi#building-apache-hudi-from-source
环境准备
- Unix-like system (like Linux, Mac OS X)
- Java 8 (Java 9 or 10 may work)
- Git
- Maven
下载源码
# Checkout code and build
git clone https://github.com/apache/hudi.git && cd hudi
修改maven settings,hudi pom.xml 添加阿里云仓库
<!--hudi pom.xml -->
</repository>
<repository>
<id>nexus-aliyun</id>
<name>nexus-aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<!-- 修改maven settings -->
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>central</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
编译
# 采取默认编译 scala2.11 spark2.4.4
mvn clean package -DskipTests
# 编译 Scala 2.12
mvn clean package -DskipTests -Dscala-2.12
# 编译 Spark 3.0.0
mvn clean package -DskipTests -Dspark3
遇到的问题
编译jar包缺失手动下载添加
org.pentaho:pentaho-aggdesigner:5.1.5-jhyde
org.pentaho:pentaho-aggdesigner-algorithm:5.1.5-jhyde
2. 简单Spark demo开发
官方案例: https://hudi.apache.org/docs/spark_quick-start-guide.html
- 启动脚本
spark-shell \
--packages org.apache.spark:spark-avro_2.11:2.4.4 \ --添加avro 依赖
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --hudi 使用kryo序列化
--jars /data/software/hudi/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar --使用自己编译的hudi jar
- hudi数据添加查询
# 引入依赖
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
# 随机生成数据插入数据
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
## 数据查询
val tripsSnapshotDF = spark.
read.
format("hudi").
load(basePath + "/*/*/*/*")
//load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
scala> spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
+------------------+-------------------+-------------------+-------------+
| fare| begin_lon| begin_lat| ts|
+------------------+-------------------+-------------------+-------------+
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1624978473449|
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1624621326973|
| 93.56018115236618|0.14285051259466197|0.21624150367601136|1625167431901|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1624780551080|
| 43.4923811219014| 0.8779402295427752| 0.6100070562136587|1624616132563|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1624993336931|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|1625104334309|
| 41.06290929046368| 0.8192868687714224| 0.651058505660742|1624933798500|
+------------------+-------------------+-------------------+-------------+
scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
| 20210702130717|8beede37-5359-42e...| americas/united_s...|rider-213|driver-213| 27.79478688582596|
| 20210702130717|db8cd50b-0713-43c...| americas/united_s...|rider-213|driver-213| 64.27696295884016|
| 20210702130717|6696a1a8-c653-464...| americas/united_s...|rider-213|driver-213| 93.56018115236618|
| 20210702130717|fe3b38f2-9012-4a8...| americas/united_s...|rider-213|driver-213| 33.92216483948643|
| 20210702130717|c72d0c53-b2e0-486...| americas/united_s...|rider-213|driver-213|19.179139106643607|
| 20210702130717|27dde682-6134-464...| americas/brazil/s...|rider-213|driver-213| 43.4923811219014|
| 20210702130717|8e939051-9dda-4f3...| americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
| 20210702130717|37ee46c0-2c31-48f...| americas/brazil/s...|rider-213|driver-213|34.158284716382845|
| 20210702130717|57e7921e-620c-4e7...| asia/india/chennai|rider-213|driver-213|17.851135255091155|
| 20210702130717|5f15c2d5-744c-4e4...| asia/india/chennai|rider-213|driver-213| 41.06290929046368|
+-------------------+--------------------+----------------------+---------+----------+------------------+
- 其他操作待续 查看官网
更多推荐
已为社区贡献1条内容
所有评论(0)