Hi, I am able to run my pig script in local mode, but fails when running in mapreduce mode.
My Pig script is:
Register ‘pyudf.py’ using streaming_python as myfuncs;
a = LOAD ‘nyse_daily_price’ USING org.apache.hcatalog.pig.HCatLoader();
stocks = GROUP a BY stock_symbol;
res = FOREACH stocks generate group, flatten(myfuncs.summary_stats(a.stock_price_adj_close));
STORE res INTO ‘stock_stats_py_output’;
And the Python script I have is just pyudf.py:
from pig_util import outputSchema
import numpy as np
@outputSchema(“(Q25:double, MEDIAN:double, Q75:double, Q99:double)”)
def summary_stats(input):
input = [float(item[0]) for item in input]
feature_25 = np.percentile(input, 25)
feature_50 = np.median(input)
feature_75 = np.percentile(input, 75)
feature_99 = np.percentile(input, 99)
return feature_25, feature_50, feature_75, feature_99