hivemall

Hivemall是什么
  • apache主页:http://hivemall.incubator.apache.org/index.html
  • 包含: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering
  • 包含ML: Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta
  • Support: Hive, Spark, Pig
  • Architecture
Hive on Spark VS Hive on MR
Hive on Spark with Hivemall
AUC calc
  • Single node

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    with data as (
    select 0.5 as prob, 0 as label
    union all
    select 0.3 as prob, 1 as label
    union all
    select 0.2 as prob, 0 as label
    union all
    select 0.8 as prob, 1 as label
    union all
    select 0.7 as prob, 1 as label
    )
    select
    auc(prob, label) as auc
    from (
    select prob, label
    from data
    ORDER BY prob DESC
    ) t;

  • Parallel approximate

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    with data as (
    select 0.5 as prob, 0 as label
    union all
    select 0.3 as prob, 1 as label
    union all
    select 0.2 as prob, 0 as label
    union all
    select 0.8 as prob, 1 as label
    union all
    select 0.7 as prob, 1 as label
    )
    select
    auc(prob, label) as auc
    from (
    select prob, label
    from data
    DISTRIBUTE BY floor(prob / 0.2)
    SORT BY prob DESC
    ) t;
Compile from source
Hive & Json

hive解析json:

  • json_split: brickhouse split array
  • get_json_object: hive udf