ApacheSparkでの結合の謎を解き明かす

こんにちは、Habr。「EcosystemHadoop、Spark、Hive」コースの将来の学生のために、資料の翻訳を準備しました。



また、「Sparkアプリケーションのテスト」ウェビナーにすべての人を招待しますこのオープンレッスンでは、Sparkアプリケーションのテストにおける問題(統計データ、部分的な検証、重いシステムの開始/停止)について検討します。ソリューションのライブラリを調べて、テストを書いてみましょう。






この記事では、Apache Sparkでの結合操作にのみ焦点を当て、Spark結合テクノロジーが構築されている基盤の概要を説明します。

結合は、2つのデータセットを相互に関連付けるために、一般的なデータマイニングストリームでよく使用されます。統合分析エンジンであるApacheSparkは、さまざまな結合シナリオを実行するための強固な基盤も提供しています。





Join , , , , . ( ) Join , Joined . .





Join:

, Join Apache Spark. :





1) : Join. , Join, Join.





2) Join: , , (Join Condition). () , . , : Join Joins.





, , . . , (A.x == B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





. , . , (A.x < B.x) ((A.x == B.x) (A.y == B.y)) -   x, y  A B, Join.





3) Join type: Join Join Join . Join:





(Inner Join): Inner Join Joined ( Join) .





(Outer Join): Outer Join , . , () .





(Semi Join): Semi Join , , , . , , , (Semi Join) (Anti Join).





: Cross Join , .





Join, Apache Spark Join.





Join

, Join, .





Apache Spark Join. :





  • (Shuffle Hash Join)





  • (Broadcast Hash Join)





  • (Sort Merge Join)





  • (Cartesian Join)





  • (Broadcast Nested Loop Join)





Broadcast Hash Join: «Broadcast Hash Join» ( Join) . - , , -.





“Broadcast Hash Join" . , . Spark , .





Shuffle Hash Join: 'Shuffle Hash Join' () ( , ”Guide to Spark Partitioning ( Spark)”. , , (shuffle) Join.





, , Shuffle Hash Join, , Hash Join. , - , -.





"Shuffle Hash Join" "Broadcast Hash Join". , - . , , Join 'Shuffle Hash Join'. , 'Broadcast Hash Join', Spark .





Sort Merge Join: 'Sort Merge Join' 'Shuffle Hash Join'. () . , , (shuffle) Join.





, , Sort Merge Join , Sort Merge Join.





'Sort Merge Join' 'Shuffle Hash Join' 'Broadcast Hash Join', , 'Sort Merge Join' , 'Shuffle Hash' 'Broadcast Hash'. , 'Shuffle Hash Join', , (shuffle) , , 'Sort Merge Join'.





Cartesian Join: Cartesian Join . . , . , .





Cartesian Join . Join, Cartesian - .





Broadcast Nested Loop Join: 'Broadcast Nested Loop Join' . Nested Loop Join .





«Broadcast Nested Loop Join» , . , , .





Spark Join?

Join Join, , Spark :





Spark Join, :









  • Join









  • Join





  • (Equi or Non-Equi Join)





Spark API Join Join Join. Join, 'broadcast', 'merge', 'shuffle_hash' 'shuffle_replicate_nl', , Join.





, Spark Join :





'Broadcast Hash Join'





  • Equi Join





  • 'Full Outer' Join





, :





  • 'Broadcast', Join - 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Right Outer', 'Right Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', or 'Inner'.





  • 'Broadcast' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





  • , 'spark.sql.autoBroadcastJoinThreshold



    ( 10 )' Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi' 'Inner'.





'Shuffle Hash Join'





  • Equi Join





  • 'Full Outer' Join





  • 'spark.sql.join.prefersortmergeJoin



    ( true)' false





, :





  • 'shuffle_hash' , Join - 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Right Outer', 'Right Semi' 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Inner'.





  • 'shuffle_hash' , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





  • , , Join - 'Left Outer', 'Left Semi', 'Right Outer', 'Right Semi', 'Inner'.





'Sort Merge Join'





  • Equi Join





  • Join Keys, Equi Join,





  • 'spark.sql.join.prefersortmergeJoin ( true)' true.





, :





  • 'merge' , Join .





  • , Join .





'Cartesian Join'





  • 'Inner'





, :





  • 'shuffle_replicate_nl' , Join Equi Non-Equi.





  • , Join Equi Non-Equi.





'Broadcast Nested Loop Join'

'Broadcast Nested Loop Join' - Join ; , 'Broadcast Nested Loop Join' Join Join.





, Join , 'Broadcast Hash Join', 'Sort Merge Join', 'Shuffle Hash Join', 'Cartesian Join'.





Cartesian Broadcast Nested Loop Join, Broadcast Nested Loop Inner, Non-Equi Joins, Cartesian Join, , .





, : Join. , .





, Join Apache Spark. - , , .






« Hadoop, Spark, Hive»





« Spark »








All Articles