Amazon Redshift Analytics Engine +クラウドのメリット

Amazon Redshift Analytics Engine +クラウドのメリット







こんにちは、Habr!







AnalyticsチームのArtemyKozyrが連絡を取り合っており、引き続きWheelyを紹介します。この問題で:







  • アジャイルクラスターコンピューティングの基礎
  • 列の保存とデータの圧縮
  • インデックスの代わりに:セグメンテーションとソートキー
  • アクセス、権利、リソースの管理
  • S3またはDataleykとの統合


Wheely , Redshift Amazon. ?







. -, . .









leader node () compute node ( ). , , . — , .







Redshiftは、マシンのクラスター上で並列データ処理を提供します







Redshift







MPP (massively parallel processing, - ). , : BigQuery, Vertica, Greenplum, Teradata, Azure SQL DW.







compute node : CPU, memory, disk, .







AmazonRedshiftで利用可能な計算ノードタイプの仕様







compute node Amazon Redshift







Wheely 8 dc2.large, 16 vCPU, 120 GiB , 1.3 TB . - , , : , , .







(elasticity), . : (resize), (node upgrade).







同時実行スケーリングを使用すると、ピーク負荷に対処するために一時クラスターを上げることができます。







Concurrency Scaling







concurrency scaling, compute nodes, "" .









(PostgreSQL, MySQL, SQL Server), (row-oriented storage), Redshift (column-oriented storage). , .







. ( - ). 100+ ; 5. 5 . 100+ , I/O .







列ストレージは、単一の列データを(行ではなく)ブロックに編成します。







( !)







-, , (compression). , (Business, First, Luxe) . , : {100, Luxe}. , . : () , .







ランレングスエンコーディングアルゴリズムを使用したデータ圧縮の例







Runlength encoding







, , Amazon Redshift. Compressing Redshift columnar data even further with proper encodings.







データ圧縮アルゴリズムの改訂により、ディスク容量が21%節約されました







21%









Redshift () (B-Tree, Bitmap) . , :







, : (EVEN), (ALL), (KEY). (distribution style).







クラスター内のデータの各行の3種類の分布:KEY、ALL、EVEN







3 : KEY, ALL, EVEN







, . , , . , distribution key, sort key — SORT MERGE JOIN:







ORDER BY request_id:
{{
   config(
       dist='request_id',
       sort='request_id'
   )
}}
      
      





ORDER BY SQL-, . — (compound sort key).







ORDER BY city, date, product_name:
{{
   config(
       materialized='table',
       dist="city",
       sort=['city', 'date', 'product_name']
   )
}}
      
      





, Interleaved sort key, ( ) . , .







INTERLEAVED SORT BY completed_ts_loc, city, country, service_group, is_airport, is_wheely_journey:
{{
   config(
       materialized='table',
       dist="journey_id",
       sort_type='interleaved',
       sort=["completed_ts_loc"
               , "city"
               , "country"
               , "service_group"
               , "is_airport"
               , "is_wheely_journey"]
   )
}}
      
      





.









, , :









. :







  • -
  • :)


Amazon Redshift PostgreSQL :







:
----------------------
-- USER MANAGEMENT ---
----------------------

CREATE USER etl WITH PASSWORD '' ;
CREATE USER hevo WITH PASSWORD '' ;

CREATE USER dbt WITH PASSWORD '' ;

CREATE USER da WITH PASSWORD '' ;
CREATE USER nb WITH PASSWORD '' ;

CREATE USER looker WITH PASSWORD '' SYSLOG ACCESS UNRESTRICTED ;

CREATE USER ar WITH PASSWORD '' ;
CREATE USER ak WITH PASSWORD '' ;

-------------------------
--- SCHEMA MANAGEMENT ---
-------------------------

CREATE SCHEMA IF NOT EXISTS hevo AUTHORIZATION hevo ;
CREATE SCHEMA IF NOT EXISTS ext AUTHORIZATION etl ;

CREATE SCHEMA IF NOT EXISTS flatten AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS staging AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS intermediate AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS analytics AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS ad_hoc AUTHORIZATION dbt ;
CREATE SCHEMA IF NOT EXISTS meta AUTHORIZATION dbt ;

CREATE SCHEMA IF NOT EXISTS looker_scratch AUTHORIZATION looker ;

-----------------------
-- GROUP MANAGEMENT ---
-----------------------

CREATE GROUP etl
WITH USER etl, hevo ;

CREATE GROUP dbt
WITH USER dbt ;

CREATE GROUP analytics
WITH USER da, nb ;

CREATE GROUP bi
WITH USER looker ;

CREATE GROUP business_users
WITH USER ar, ak ;

----------------------------
-- PRIVILEGES MANAGEMENT ---
----------------------------

-- GROUP etl
GRANT USAGE ON SCHEMA hevo, staging, flatten, analytics, intermediate TO GROUP etl ;

-- GROUP dbt
GRANT USAGE ON SCHEMA hevo, snapshots TO GROUP dbt ;
GRANT USAGE, CREATE ON SCHEMA ext TO GROUP dbt ;

GRANT SELECT ON ALL tables IN SCHEMA hevo TO GROUP dbt ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA meta, ext, snapshots TO GROUP dbt ;

ALTER DEFAULT PRIVILEGES FOR USER hevo IN SCHEMA hevo
   GRANT SELECT ON tables TO GROUP dbt ;

ALTER DEFAULT PRIVILEGES FOR USER etl IN SCHEMA ext
   GRANT ALL PRIVILEGES ON tables TO GROUP dbt ;

-- GROUP analytics
GRANT USAGE ON SCHEMA hevo, ext, flatten, staging, intermediate, restricted, analytics, meta TO GROUP analytics ;
GRANT USAGE, CREATE ON SCHEMA ad_hoc TO GROUP analytics ;

GRANT SELECT ON ALL tables IN SCHEMA hevo, ext, flatten, staging, intermediate, restricted, analytics, meta TO GROUP analytics ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA ad_hoc TO GROUP analytics ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA flatten, staging, intermediate, restricted, analytics, meta
   GRANT SELECT ON tables TO GROUP analytics ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA ad_hoc
   GRANT ALL PRIVILEGES ON tables TO GROUP analytics ;

-- GROUP bi
GRANT USAGE ON SCHEMA analytics, meta, ad_hoc, snapshots TO GROUP bi ;

GRANT SELECT ON ALL tables IN SCHEMA analytics, meta, ad_hoc TO GROUP bi ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA analytics, meta, ad_hoc
   GRANT SELECT ON tables TO GROUP bi ;

-- GROUP business users
GRANT USAGE ON SCHEMA analytics TO GROUP business_users ;
GRANT USAGE, CREATE ON SCHEMA ad_hoc TO GROUP business_users ;

GRANT SELECT ON ALL tables IN SCHEMA analytics TO GROUP business_users ;
GRANT ALL PRIVILEGES ON ALL tables IN SCHEMA ad_hoc TO GROUP business_users ;

ALTER DEFAULT PRIVILEGES FOR USER dbt IN SCHEMA analytics
   GRANT SELECT ON tables TO GROUP business_users ;
      
      





Amazon Redshift automatic workload management (WLM), .. . - . Redshift :







  • ELT
  • BI Ad-hoc
  • “”
  • ,


キュー構成(WLM)により、クラスター負荷を柔軟に管理できます







(WLM)







, .







short-query acceleration. Amazon Redshift ML- , (~ 30 .) , . , Wheely , BI .







Data Lake



? ! Redshift Spectrum :







  • SQL DDL ( S3)
  • : (SELECT), (WHERE), (JOINs), (INSERT)
  • - : JSON, ORC, PARQUET


EXTERNAL TABLE _Hive. _ : Hive Metastore.







Spectrumは、Redshiftで広く普及しているデータ形式を幅広く提供します







Spectrum Redshift







Wheely Spectrum:







1. Data Quality Pipeline







S3 Spectrum - (MongoDB) (Redshift). : QA , .







2. S3







>3 S3 parquet. , Redshift , .







S3 ( parquet)
UNLOAD ('SELECT * FROM "hevo"."events_prod_clickstream_archive"')
 TO 's3://wheely_analytics/dwh/hevo/events_prod/clickstream_archive' 
 IAM_ROLE ''
 MANIFEST
 FORMAT AS PARQUET
 NULL AS ''
 MAXFILESIZE AS 256 
 ENCRYPTED AUTO
 ;
      
      





: -. .









千の言葉の代わりに













, . . , . production-, ad-hoc .







, (target table). . , . . , , , .







削除または破損したテーブルをバックアップから回復することが、これまでになく簡単かつ迅速になりました













, , 7 .







: in-database ML, native semi-structured data support



. Wheely .







, Amazon Redshift, preview ( ):









, . ! , , .







? Wheely. — .







: https://t.me/enthusiastech.







.








All Articles