\n"}]},"apps":[],"jobName":"paragraph_1563216591603_-309578561","id":"20190715-144951_470679494","dateCreated":"2019-07-15T14:49:51-0400","dateStarted":"2019-08-19T18:04:50-0400","dateFinished":"2019-08-19T18:04:52-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45088"},{"text":"%md\nWe now would like to build a feature matrix for our predictive model.\n\nLet's look at possible predictive variables for our model:\n\n1. month: winter months should have more delays than summer months\n2. day of month: this is likely not a very predictive variable, but let's keep it in anyway\n3. day of week: weekend vs. weekday\n4. hour of the day: later hours tend to have more delays\n5. Carrier: we might expect some carriers to be more prone to delays than others\n6. Destination airport: we expect some airports to be more prone to delays than others\n7. Distance: interesting to see if this variable is a good predictor of delay\n\nAnother **generated** feature is the number of days from closest national holiday, with the assumption that holidays tend to be associated with more delays.\n\nWe implement this \"feature generation\" process using Pig and some simple Python user-defined-functions (UDFs). First, let's create some Python UDFs using a shell *heredoc* method. (The *heredoc* method allows the Notebook to write a file from the notebook to the local working directory. In the case below we will create the file `util.py`)\n","user":"deadline","dateUpdated":"2019-08-20T09:01:05-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
We now would like to build a feature matrix for our predictive model.
\n
Let’s look at possible predictive variables for our model:
\n
\n - month: winter months should have more delays than summer months
\n - day of month: this is likely not a very predictive variable, but let’s keep it in anyway
\n - day of week: weekend vs. weekday
\n - hour of the day: later hours tend to have more delays
\n - Carrier: we might expect some carriers to be more prone to delays than others
\n - Destination airport: we expect some airports to be more prone to delays than others
\n - Distance: interesting to see if this variable is a good predictor of delay
\n
\n
Another generated feature is the number of days from closest national holiday, with the assumption that holidays tend to be associated with more delays.
\n
We implement this “feature generation” process using Pig and some simple Python user-defined-functions (UDFs). First, let’s create some Python UDFs using a shell heredoc method. (The heredoc method allows the Notebook to write a file from the notebook to the local working directory. In the case below we will create the file util.py
)
\n
"}]},"apps":[],"jobName":"paragraph_1563280955319_1876266887","id":"20190716-084235_1849126309","dateCreated":"2019-07-16T08:42:35-0400","dateStarted":"2019-08-20T09:01:05-0400","dateFinished":"2019-08-20T09:01:05-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45089"},{"text":"%sh\n# This paragraph uses the \"sh\" command to run a shell command in the working directory.\n# The command below will \"print the working directroy path\" (pwd) and\n# list the files (ls)\n\npwd\nls\n","user":"deadline","dateUpdated":"2019-08-20T08:58:42-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"sh","editOnDblClick":false},"editorMode":"ace/mode/sh","editorHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"/home/zeppelin\nderby.log\nfigure\npig_1566236659327.log\npreprocess1.pig\npreprocess2.pig\ntemp\ntests\nutil.py\n"}]},"apps":[],"jobName":"paragraph_1563298064472_-867236689","id":"20190716-132744_253871478","dateCreated":"2019-07-16T13:27:44-0400","dateStarted":"2019-08-20T08:58:42-0400","dateFinished":"2019-08-20T08:58:42-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45090"},{"text":"%sh\n# user heredoc to write util.py to the working directory\n cat << EOF > util.py\n#\n# Python UDFs for our PIG script\n#\nfrom datetime import date\n\n# get hour-of-day from HHMM field\n@outputSchema(\"value: int\")\ndef get_hour(val):\n return int(val.zfill(4)[:2])\n \n# get calender date that matches weather date format \n@outputSchema(\"onestring:chararray\")\ndef to_date(year, month, day):\n d = date(year, month, day)\n return d.strftime(\"%Y%m%d\")\n\n# this array defines the dates of holiday in 2007 and 2008\nholidays = [\n date(2007, 1, 1), date(2007, 1, 15), date(2007, 2, 19), date(2007, 5, 28), date(2007, 6, 7), date(2007, 7, 4), \\\n date(2007, 9, 3), date(2007, 10, 8), date(2007, 11, 11), date(2007, 11, 22), date(2007, 12, 25), \\\n date(2008, 1, 1), date(2008, 1, 21), date(2008, 2, 18), date(2008, 5, 22), date(2008, 5, 26), date(2008, 7, 4), \\\n date(2008, 9, 1), date(2008, 10, 13), date(2008, 11, 11), date(2008, 11, 27), date(2008, 12, 25) \\\n ]\n# get number of days from nearest holiday\n@outputSchema(\"days: int\")\ndef days_from_nearest_holiday(year, month, day):\n d = date(year, month, day)\n x = [(abs(d-h)).days for h in holidays]\n return min(x)\nEOF\n","user":"deadline","dateUpdated":"2019-08-20T08:59:32-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"sh","editOnDblClick":false},"editorMode":"ace/mode/sh"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[]},"apps":[],"jobName":"paragraph_1563217294361_223435234","id":"20190715-150134_2025525704","dateCreated":"2019-07-15T15:01:34-0400","dateStarted":"2019-08-20T08:59:32-0400","dateFinished":"2019-08-20T08:59:32-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45091"},{"text":"%md\nOur Pig script (below) is relatively simple:\n\n1. Load the dataset (2007 or 2008)\n2. Filter out flights that were cancelled or that are NOT originating in ORD\n3. Project only variables that we want to use in the analysis\n4. Generate the output feature matrix, using the Python UDFs\n\n**Note:** this script is not working when run though this notebook. We will run this from the command line. (working on a solution)\n","user":"deadline","dateUpdated":"2019-08-20T09:01:45-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Our Pig script (below) is relatively simple:
\n
\n - Load the dataset (2007 or 2008)
\n - Filter out flights that were cancelled or that are NOT originating in ORD
\n - Project only variables that we want to use in the analysis
\n - Generate the output feature matrix, using the Python UDFs
\n
\n
Note: this script is not working when run though this notebook. We will run this from the command line. (working on a solution)
\n
"}]},"apps":[],"jobName":"paragraph_1563281500390_252035464","id":"20190716-085140_912506178","dateCreated":"2019-07-16T08:51:40-0400","dateStarted":"2019-08-20T09:01:45-0400","dateFinished":"2019-08-20T09:01:45-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45092"},{"text":"%pig\nRegister 'util.py' USING jython as util;\nDEFINE preprocess(year_str, airport_code) returns data\n{\n -- load airline data from specified year (need to specify fields since it's not in HCat)\n airline = load 'flights/$year_str.csv' using PigStorage(',') \n as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray, \n CRSDepTime: chararray, ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime, \n CRSElapsedTime, AirTime, ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int, \n TaxiIn, TaxiOut, Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay, \n NASDelay, SecurityDelay, LateAircraftDelay);\n\n -- keep only instances where flight was not cancelled and originate at ORD\n airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';\n\n -- Keep only fields I need\n $data = foreach airline_flt generate DepDelay as delay, Month, DayOfMonth, DayOfWeek, \n util.get_hour(CRSDepTime) as hour, Distance, Carrier, Dest,\n util.days_from_nearest_holiday(Year, Month, DayOfMonth) as hdays;\n};\n\nORD_2007 = preprocess('2007', 'ORD');\nrmf airline/fm/ord_2007_1\nstore ORD_2007 into 'airline/fm/ord_2007_1' using PigStorage(',');\n\nORD_2008 = preprocess('2008', 'ORD');\nrmf airline/fm/ord_2008_1\nstore ORD_2008 into 'airline/fm/ord_2008_1' using PigStorage(',');","user":"deadline","dateUpdated":"2019-07-15T17:19:57-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"pig","editOnDblClick":false},"editorMode":"ace/mode/pig"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1563216639820_1122533466","id":"20190715-145039_1112446165","dateCreated":"2019-07-15T14:50:39-0400","dateStarted":"2019-07-15T17:19:57-0400","dateFinished":"2019-07-15T17:19:57-0400","status":"ERROR","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:45093"},{"text":"%md\nLet's take a look at data in HDFS using `%sh` interpreter and `hdfs dfs` commands. Note that the results are spread accross six files indicating that the Pig MapReduce was operating in parallel. \n","user":"deadline","dateUpdated":"2019-08-20T09:04:33-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Let’s take a look at data in HDFS using %sh
interpreter and hdfs dfs
commands. Note that the results are spread accross six files indicating that the Pig MapReduce was operating in parallel.
\n
"}]},"apps":[],"jobName":"paragraph_1563283388839_1761887550","id":"20190716-092308_194144036","dateCreated":"2019-07-16T09:23:08-0400","dateStarted":"2019-08-20T09:04:33-0400","dateFinished":"2019-08-20T09:04:33-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45094"},{"text":"%sh \necho \"2007 ORD data in HDFS\"\nhdfs dfs -ls airline/fm/ord_2007_1\necho \"2008 ORD data in HDFS\"\nhdfs dfs -ls airline/fm/ord_2008_1\n","user":"deadline","dateUpdated":"2019-08-19T12:35:24-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"sh","editOnDblClick":false},"editorMode":"ace/mode/sh","tableHide":false,"editorHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"2007 ORD data in HDFS\nFound 7 items\n-rw-r--r-- 2 zeppelin hdfs 0 2019-08-19 12:32 airline/fm/ord_2007_1/_SUCCESS\n-rw-r--r-- 2 zeppelin hdfs 1683189 2019-08-19 12:32 airline/fm/ord_2007_1/part-v000-o000-r-00000\n-rw-r--r-- 2 zeppelin hdfs 1840431 2019-08-19 12:32 airline/fm/ord_2007_1/part-v000-o000-r-00001\n-rw-r--r-- 2 zeppelin hdfs 1917108 2019-08-19 12:32 airline/fm/ord_2007_1/part-v000-o000-r-00002\n-rw-r--r-- 2 zeppelin hdfs 1618681 2019-08-19 12:32 airline/fm/ord_2007_1/part-v000-o000-r-00003\n-rw-r--r-- 2 zeppelin hdfs 1983023 2019-08-19 12:32 airline/fm/ord_2007_1/part-v000-o000-r-00004\n-rw-r--r-- 2 zeppelin hdfs 378754 2019-08-19 12:32 airline/fm/ord_2007_1/part-v000-o000-r-00005\n2008 ORD data in HDFS\nFound 7 items\n-rw-r--r-- 2 zeppelin hdfs 0 2019-08-19 12:33 airline/fm/ord_2008_1/_SUCCESS\n-rw-r--r-- 2 zeppelin hdfs 1455488 2019-08-19 12:33 airline/fm/ord_2008_1/part-v000-o000-r-00000\n-rw-r--r-- 2 zeppelin hdfs 1789885 2019-08-19 12:33 airline/fm/ord_2008_1/part-v000-o000-r-00001\n-rw-r--r-- 2 zeppelin hdfs 1807637 2019-08-19 12:33 airline/fm/ord_2008_1/part-v000-o000-r-00002\n-rw-r--r-- 2 zeppelin hdfs 1653248 2019-08-19 12:33 airline/fm/ord_2008_1/part-v000-o000-r-00003\n-rw-r--r-- 2 zeppelin hdfs 1767646 2019-08-19 12:33 airline/fm/ord_2008_1/part-v000-o000-r-00004\n-rw-r--r-- 2 zeppelin hdfs 321380 2019-08-19 12:33 airline/fm/ord_2008_1/part-v000-o000-r-00005\n"}]},"apps":[],"jobName":"paragraph_1563283218291_1903781629","id":"20190716-092018_1008065710","dateCreated":"2019-07-16T09:20:18-0400","dateStarted":"2019-08-19T12:35:24-0400","dateFinished":"2019-08-19T12:35:27-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45095"},{"text":"%md\nWe have the files in `ord_2007_1` and `ord_2008_1` under `airline/fm` folder in HDFS (these contain the 359,169 ORD flights). Let's read those files into Python (using `read_csv_from_hdfs()` defined above), and prepare the training and testing (validation) datasets as Pandas DataFrame objects.\n\nThere are comments in the code that begin with `# VIEW`. The lines after this comment can be “uncommented” and used to view the data as the analysis progresses.\n\nInitially, we use only the numerical variables:\n","user":"deadline","dateUpdated":"2019-08-20T09:09:52-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
We have the files in ord_2007_1
and ord_2008_1
under airline/fm
folder in HDFS (these contain the 359,169 ORD flights). Let’s read those files into Python (using read_csv_from_hdfs()
defined above), and prepare the training and testing (validation) datasets as Pandas DataFrame objects.
\n
There are comments in the code that begin with # VIEW
. The lines after this comment can be “uncommented” and used to view the data as the analysis progresses.
\n
Initially, we use only the numerical variables:
\n
"}]},"apps":[],"jobName":"paragraph_1563283025326_-466143545","id":"20190716-091705_1541223639","dateCreated":"2019-07-16T09:17:05-0400","dateStarted":"2019-08-20T09:09:52-0400","dateFinished":"2019-08-20T09:09:52-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45096"},{"text":"%python\nfrom itertools import islice\n\n# read processed file (from above Pig script)\ncols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']\ncol_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, \n 'carrier': str, 'dest': str, 'days_from_holiday': int}\ndata_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)\ndata_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)\n\n# Create training set and test set (numberic variables only)\ncols = ['month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday']\ntrain_y = data_2007['delay'] >= 15\ntrain_x = data_2007[cols]\n\ntest_y = data_2008['delay'] >= 15\ntest_x = data_2008[cols]\n\n# VIEW look at data (first 10 lines)\n#print(train_y[:10])\n#print(train_x[:10])\n#print(train_x.shape)\n","user":"deadline","dateUpdated":"2019-08-20T14:02:32-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python","tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[]},"apps":[],"jobName":"paragraph_1563218824506_-2041978325","id":"20190715-152704_1664933868","dateCreated":"2019-07-15T15:27:04-0400","dateStarted":"2019-08-20T14:02:32-0400","dateFinished":"2019-08-20T14:02:32-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45097"},{"text":"%md\n### Logistic Regression and Random Forest (Iteration 1) ###\nOur data is 359,169 rows and 6 features in our model.\n\nNow we use Python's Scikit-learn machine learning package to to build two predictive models (Logistic regression and Random Forest) and compare their performance. To tell if we are making progress we print the confusion matrix, which counts the true positive, true negatives, false positives and false negatives. Then from the confusion matrix, we compute precision, recall, F1 metric and accuracy. We start with a logistic regression model and evaluate its performance on the testing dataset.\n","user":"deadline","dateUpdated":"2019-07-16T09:40:30-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Logistic Regression and Random Forest (Iteration 1)
\n
Our data is 359,169 rows and 6 features in our model.
\n
Now we use Python’s Scikit-learn machine learning package to to build two predictive models (Logistic regression and Random Forest) and compare their performance. To tell if we are making progress we print the confusion matrix, which counts the true positive, true negatives, false positives and false negatives. Then from the confusion matrix, we compute precision, recall, F1 metric and accuracy. We start with a logistic regression model and evaluate its performance on the testing dataset.
\n
"}]},"apps":[],"jobName":"paragraph_1563283446604_-53899040","id":"20190716-092406_829447660","dateCreated":"2019-07-16T09:24:06-0400","dateStarted":"2019-07-16T09:40:30-0400","dateFinished":"2019-07-16T09:40:30-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45098"},{"text":"%python\n# Create logistic regression model object with L2 regularization \nclf_lr = linear_model.LogisticRegression(penalty='l2', class_weight='balanced')\n\n# Train the model using the training sets\nclf_lr.fit(train_x, train_y)\n\n# Predict output labels on test set\npr = clf_lr.predict(test_x)\n\n# display evaluation metrics\ncm = confusion_matrix(test_y, pr)\nprint(\"Confusion matrix\")\nprint(pd.DataFrame(cm))\nreport_lr = precision_recall_fscore_support(list(test_y), list(pr), average='binary')\nprint(\"\\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n (report_lr[0], report_lr[1], report_lr[2], accuracy_score(list(test_y), list(pr))))","user":"deadline","dateUpdated":"2019-08-19T16:49:47-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"... ... LogisticRegression(C=1.0, class_weight='balanced', dual=False,\n fit_intercept=True, intercept_scaling=1, max_iter=100,\n multi_class='warn', n_jobs=None, penalty='l2', random_state=None,\n solver='warn', tol=0.0001, verbose=0, warm_start=False)\n... ... Confusion matrix\n 0 1\n0 149547 90347\n1 33923 61513\n... \nprecision = 0.41, recall = 0.64, F1 = 0.50, accuracy = 0.63\n\n"}]},"apps":[],"jobName":"paragraph_1563226243733_1657973340","id":"20190715-173043_568146155","dateCreated":"2019-07-15T17:30:43-0400","dateStarted":"2019-08-19T16:49:47-0400","dateFinished":"2019-08-19T16:49:53-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45099"},{"text":"%md\nLogistic regression model got overall accuracy of 60%. Now let's try Random Forest:\n","user":"deadline","dateUpdated":"2019-07-16T09:28:47-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Logistic regression model got overall accuracy of 60%. Now let’s try Random Forest:
\n
"}]},"apps":[],"jobName":"paragraph_1563283683291_-1422665416","id":"20190716-092803_658853650","dateCreated":"2019-07-16T09:28:03-0400","dateStarted":"2019-07-16T09:28:47-0400","dateFinished":"2019-07-16T09:28:47-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45100"},{"text":"%python\n# Create Random Forest classifier with 50 trees\nclf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)\nclf_rf.fit(train_x, train_y)\n\n# Evaluate on test set\npr = clf_rf.predict(test_x)\n\n# print results\ncm = confusion_matrix(test_y, pr)\nprint(\"Confusion matrix\")\nprint(pd.DataFrame(cm))\nreport_svm = precision_recall_fscore_support(list(test_y), list(pr), average='binary')\nprint(\"\\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n (report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr))))","user":"deadline","dateUpdated":"2019-07-16T14:24:29-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python","tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"... RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n max_depth=None, max_features='auto', max_leaf_nodes=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False)\n... ... Confusion matrix\n 0 1\n0 197476 42418\n1 65692 29744\n... \nprecision = 0.41, recall = 0.31, F1 = 0.35, accuracy = 0.68\n\n"}]},"apps":[],"jobName":"paragraph_1563226362929_-274884185","id":"20190715-173242_1888587451","dateCreated":"2019-07-15T17:32:42-0400","dateStarted":"2019-07-16T14:24:29-0400","dateFinished":"2019-07-16T14:24:37-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45101"},{"text":"%md\nAs we can see, Random Forest has overall better accuracy, but lower F1 score. For our problem -- we are trying to predict delays, so the higher level of true positives (197K vs. 143K) is better!\n\nWith any supervised learnign algorithm, one typically needs to choose values for the parameters of the model. For example, we chose \"L1\" regularization for the logistic regression model, and 50 trees for the Random Forest. Such choices are based on some experimentation and [hyperparameter tuning] (http://en.wikipedia.org/wiki/Hyperparameter_optimization). We are not addressing this topic in this class, although such choices (experimentation) are important to achieve the overall best model.\n","user":"deadline","dateUpdated":"2019-07-16T09:31:43-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
As we can see, Random Forest has overall better accuracy, but lower F1 score. For our problem – we are trying to predict delays, so the higher level of true positives (197K vs. 143K) is better!
\n
With any supervised learnign algorithm, one typically needs to choose values for the parameters of the model. For example, we chose “L1” regularization for the logistic regression model, and 50 trees for the Random Forest. Such choices are based on some experimentation and hyperparameter tuning. We are not addressing this topic in this class, although such choices (experimentation) are important to achieve the overall best model.
\n
"}]},"apps":[],"jobName":"paragraph_1563283740531_1260047457","id":"20190716-092900_1395771202","dateCreated":"2019-07-16T09:29:00-0400","dateStarted":"2019-07-16T09:31:43-0400","dateFinished":"2019-07-16T09:31:43-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45102"},{"text":"%md\n### One Hot Encoding (Iteration 2) ###\nIt is very common in data science to work iteratively, and improve the model with each iteration.\n\nIn this iteration, we improve our feature by converting existing variables that are categorical in nature (such as \"hour\", or \"month\") as well as categorical variables that are strings (like \"carrier\" and \"dest\"), into what is known as \"dummy variables\". Each \"dummy variable\" is a binary (0 or 1) that indicates whether a certain category value is \"on\" or \"off.\n\nscikit-learn has the OneHotEncoder functionality to make this easy:","user":"deadline","dateUpdated":"2019-08-19T13:17:16-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
One Hot Encoding (Iteration 2)
\n
It is very common in data science to work iteratively, and improve the model with each iteration.
\n
In this iteration, we improve our feature by converting existing variables that are categorical in nature (such as “hour”, or “month”) as well as categorical variables that are strings (like “carrier” and “dest”), into what is known as “dummy variables”. Each “dummy variable” is a binary (0 or 1) that indicates whether a certain category value is “on” or "off.
\n
scikit-learn has the OneHotEncoder functionality to make this easy:
\n
"}]},"apps":[],"jobName":"paragraph_1563283991732_1719020602","id":"20190716-093311_1567955121","dateCreated":"2019-07-16T09:33:11-0400","dateStarted":"2019-07-16T09:39:21-0400","dateFinished":"2019-07-16T09:39:21-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45103"},{"text":"%python\nfrom sklearn.preprocessing import OneHotEncoder\n\n# read files\ncols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday']\ncol_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, \n 'carrier': str, 'dest': str, 'days_from_holiday': int}\ndata_2007 = read_csv_from_hdfs('airline/fm/ord_2007_1', cols, col_types)\ndata_2008 = read_csv_from_hdfs('airline/fm/ord_2008_1', cols, col_types)\n\n# Create training set and test set\ntrain_y = data_2007['delay'] >= 15\ncateg = [cols.index(x) for x in ('hour', 'month', 'day', 'dow', 'carrier', 'dest')]\nenc = OneHotEncoder(categorical_features = categ)\ndf = data_2007.drop('delay', axis=1)\ndf['carrier'] = pd.factorize(df['carrier'])[0]\ndf['dest'] = pd.factorize(df['dest'])[0]\ntrain_x = enc.fit_transform(df)\n\ntest_y = data_2008['delay'] >= 15\ndf = data_2008.drop('delay', axis=1)\ndf['carrier'] = pd.factorize(df['carrier'])[0]\ndf['dest'] = pd.factorize(df['dest'])[0]\ntest_x = enc.transform(df)\n\nprint(train_x.shape)","user":"deadline","dateUpdated":"2019-08-19T16:52:02-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"... ... ... (359169, 409)\n"}]},"apps":[],"jobName":"paragraph_1563226713265_-791483177","id":"20190715-173833_689674818","dateCreated":"2019-07-15T17:38:33-0400","dateStarted":"2019-08-19T16:52:02-0400","dateFinished":"2019-08-19T16:52:03-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45104"},{"text":"%md\nNow we have ~359K rows and 409 (!) features in our model. Let's re-run the Random Forest model and see if this improved our model:\n","user":"deadline","dateUpdated":"2019-07-16T09:35:52-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Now we have ~359K rows and 409 (!) features in our model. Let’s re-run the Random Forest model and see if this improved our model:
\n
"}]},"apps":[],"jobName":"paragraph_1563284082873_369845772","id":"20190716-093442_181156229","dateCreated":"2019-07-16T09:34:42-0400","dateStarted":"2019-07-16T09:35:52-0400","dateFinished":"2019-07-16T09:35:52-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45105"},{"text":"%python\n# Create Random Forest classifier with 50 trees\nclf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)\nclf_rf.fit(train_x.toarray(), train_y)\n\n# Evaluate on test set\npr = clf_rf.predict(test_x.toarray())\n\n# print results\ncm = confusion_matrix(test_y, pr)\nprint(\"Confusion matrix\")\nprint(pd.DataFrame(cm))\nreport_svm = precision_recall_fscore_support(list(test_y), list(pr), average='binary')\nprint(\"\\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n (report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr))))","user":"deadline","dateUpdated":"2019-08-19T16:52:09-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"... RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n max_depth=None, max_features='auto', max_leaf_nodes=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False)\n... ... Confusion matrix\n 0 1\n0 216740 23154\n1 75662 19774\n... \nprecision = 0.46, recall = 0.21, F1 = 0.29, accuracy = 0.71\n\n"}]},"apps":[],"jobName":"paragraph_1563226945938_-1062424668","id":"20190715-174225_1884403148","dateCreated":"2019-07-15T17:42:25-0400","dateStarted":"2019-08-19T16:52:10-0400","dateFinished":"2019-08-19T16:52:49-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45106"},{"text":"%md\nThis clearly helped -- accuracy is higher at ~70%, and true positive are also better at 216K (vs 197K previously). (notice that the run took longer due to more features)","user":"deadline","dateUpdated":"2019-08-19T09:05:21-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
This clearly helped – accuracy is higher at ~70%, and true positive are also better at 216K (vs 197K previously). (notice that the run took longer due to more features)
\n
"}]},"apps":[],"jobName":"paragraph_1563284186975_-896393507","id":"20190716-093626_1846011644","dateCreated":"2019-07-16T09:36:26-0400","dateStarted":"2019-07-16T09:37:49-0400","dateFinished":"2019-07-16T09:37:49-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45107"},{"text":"%md\n### Logic Regression with PySpark ###\nStart with the cleaned prepocessed data (airline/fm/ord_2007_1 and airline/fm/ord_2008_1) and develop a logical regression model using PySaprk\nThere are comments in the code that begin with `# VIEW`. The lines after this comment can be \"uncommented\" and used to view the data as the analysis progresses.\n\n**Note:** Using Spark2 and Python 3 (The PySpark V1 Logical Regression model does not seem to work correctly)\n","user":"deadline","dateUpdated":"2019-08-20T09:11:56-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Logic Regression with PySpark
\n
Start with the cleaned prepocessed data (airline/fm/ord_2007_1 and airline/fm/ord_2008_1) and develop a logical regression model using PySaprk
There are comments in the code that begin with # VIEW
. The lines after this comment can be “uncommented” and used to view the data as the analysis progresses.
\n
Note: Using Spark2 and Python 3 (The PySpark V1 Logical Regression model does not seem to work correctly)
\n
"}]},"apps":[],"jobName":"paragraph_1565976962113_1236495954","id":"20190816-133602_771523531","dateCreated":"2019-08-16T13:36:02-0400","dateStarted":"2019-08-20T09:11:56-0400","dateFinished":"2019-08-20T09:11:56-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45108"},{"text":"%spark2.pyspark\nfrom pyspark.sql import SQLContext\nfrom pyspark.sql.types import *\nfrom pyspark.sql import Row\nfrom pyspark.sql.functions import UserDefinedFunction\nfrom pyspark.sql.types import IntegerType\nfrom itertools import islice\n\n# A user defined function (UDF) that adjusts dealys (greater then 15 minutes=1.0, less=0.0)\ndef delay(d):\n if (d >= 15): return 1.0\n else: return 0.0\n\n#report Python version, Spark2 uses Python 3\nimport sys\nprint(\"Python Version:\")\nprint(sys.version)\n\n# read preprocessed data into RDD, split on commas\nrdd_2007 = sc.textFile('airline/fm/ord_2007_1').map(lambda p: p.split(\",\"))\nrdd_2008 = sc.textFile('airline/fm/ord_2008_1').map(lambda p: p.split(\",\"))\n\n# VIEW check first ten listings\n#for y in islice(rdd_2007.collect(), 10): print y\n\n# the columns that are used include delay, month, day, dow, hour, distance, carrier, dest, days_from_holiday\n# only read numberic values and import to a PySpark DataFrame\ndf_2007_raw = rdd_2007.map(lambda p: Row(delay = int(p[0]), month = int(p[1]), day=int(p[2]), dow=int(p[3]), hour=int(p[4]), distance=int(p[5]),dfh=int(p[8]) )).toDF()\ndf_2008_raw = rdd_2008.map(lambda p: Row(delay = int(p[0]), month = int(p[1]), day=int(p[2]), dow=int(p[3]), hour=int(p[4]), distance=int(p[5]),dfh=int(p[8]) )).toDF()\n\n# VIEW print dataframe and schema\n#df_2008_raw.printSchema()\n#df_2007_raw.show(5)\n\nadjust_delay=UserDefinedFunction(delay,DoubleType())\n\ndf_2007 = df_2007_raw.withColumn('delay', adjust_delay(df_2007_raw.delay))\ndf_2008 = df_2008_raw.withColumn('delay', adjust_delay(df_2008_raw.delay))\n\n# VIEW inspect the changes to delay column\n#df_2007.show(5)\n#df_2008.show(5)","user":"deadline","dateUpdated":"2019-08-20T14:01:52-0400","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"table","height":334,"optionOpen":false}}},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Python Version:\n3.7.3 (default, Mar 27 2019, 22:11:17) \n[GCC 7.3.0]\n"}]},"apps":[],"jobName":"paragraph_1565976976011_717084939","id":"20190816-133616_193341461","dateCreated":"2019-08-16T13:36:16-0400","dateStarted":"2019-08-20T14:01:52-0400","dateFinished":"2019-08-20T14:01:52-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45109"},{"text":"%md\n\n### Prepare Training Data ###\nThere are three steps:\n\n1. The VectorAssember - creates a single feature column with all features that will be used by the model\n2. The StandardScaler - standardizes a set of features to have zero mean and a standard deviation of 1. Scaled data can improve the convergence rate during the optimization process, and also prevents against features with very large variances exerting an overly large influence during model training.\n3. The StringIndexer - adds a new column \"called label\" to the DataFrame. The data in the label column comes from the delay data.\n\nUncomment the statements under the `# VIEW` comments to see progression\n","user":"deadline","dateUpdated":"2019-08-20T09:15:22-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Prepare Training Data
\n
There are three steps:
\n
\n - The VectorAssember - creates a single feature column with all features that will be used by the model
\n - The StandardScaler - standardizes a set of features to have zero mean and a standard deviation of 1. Scaled data can improve the convergence rate during the optimization process, and also prevents against features with very large variances exerting an overly large influence during model training.
\n - The StringIndexer - adds a new column “called label” to the DataFrame. The data in the label column comes from the delay data.
\n
\n
Uncomment the statements under the # VIEW
comments to see progression
\n
"}]},"apps":[],"jobName":"paragraph_1566217134213_1994604707","id":"20190819-081854_32864940","dateCreated":"2019-08-19T08:18:54-0400","dateStarted":"2019-08-20T09:15:22-0400","dateFinished":"2019-08-20T09:15:22-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45111"},{"text":"%spark2.pyspark\n\nfrom pyspark.ml.feature import VectorAssembler\nfrom pyspark.ml.feature import StandardScaler\nfrom pyspark.ml.feature import StringIndexer\n\n# These are the \"feature\" columns\ncols = ['month', 'day', 'dow', 'hour', 'distance', 'dfh']\n\n# set the assembert to use \"cols\" and output to new \"features\" column\n# Also note, converts everthing to floats\nassembler = VectorAssembler(inputCols=cols,outputCol=\"features\")\ndf_2007_va = assembler.transform(df_2007)\ndf_2008_va = assembler.transform(df_2008)\n\n# VIEW first 5 lines of the new DataFrame,\"False\" indicates no column truncation.\n#df_2007_va.show(5,False)\n#df_2008_va.show(5, False)\n\n# Normalize each feature to have unit standard deviation.\nscaler = StandardScaler(inputCol=\"features\", outputCol=\"scaledFeatures\",withStd=True, withMean=True)\ndf_2007_scaled_train = scaler.fit(df_2007_va).transform(df_2007_va)\ndf_2008_scaled_test = scaler.fit(df_2008_va).transform(df_2008_va)\n\n# VIEW first 5 lines of the new scaled DataFrame\n#df_2007_scaled_train.show(5,False)\n#df_2008_scaled_test.show(5,False)\n\n# Use StringIndexer to add \"label\" column using the \"delay\" data\nlabel_Indexer = StringIndexer(inputCol = 'delay', outputCol = 'label')\ndf_2007_scaled_train_label = label_Indexer.fit(df_2007_scaled_train).transform(df_2007_scaled_train)\ndf_2008_scaled_test_label = label_Indexer.fit(df_2008_scaled_test).transform(df_2008_scaled_test)\n\n#VIEW The final DataFRAMES ready for the model\n#df_2007_scaled_train_label.show(5)\n#df_2008_scaled_test_label.show(5)\n ","user":"deadline","dateUpdated":"2019-08-19T17:57:37-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python"},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[]},"apps":[],"jobName":"paragraph_1565983026574_1405069570","id":"20190816-151706_34127126","dateCreated":"2019-08-16T15:17:06-0400","dateStarted":"2019-08-19T17:57:37-0400","dateFinished":"2019-08-19T17:57:54-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45112"},{"text":"%md\n### Run and Test the Model\n\nWe use the LogisticRegression model from mlib. This model works with DataFrames. \n","user":"deadline","dateUpdated":"2019-08-20T12:44:30-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Run and Test the Model
\n
We use the LogisticRegression model from mlib. This model works with DataFrames.
\n
"}]},"apps":[],"jobName":"paragraph_1566219148325_2038167223","id":"20190819-085228_756900292","dateCreated":"2019-08-19T08:52:28-0400","dateStarted":"2019-08-20T12:44:30-0400","dateFinished":"2019-08-20T12:44:30-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45113"},{"text":"%spark2.pyspark\n\n# utility to print confusion matrix and metrics\ndef print_cm(tp,tn,fn,fp):\n print(\"\\nConfusion Matrix\")\n print(\"\\n Prediction\")\n print(\"\\n 0 1\")\n print(\"\\n Actual 0 %d | %d\" % (tp,fn))\n print(\"\\n -----------------\")\n print(\"\\n 1 %d | %d\" % (fp,tn))\n a=(tp+tn)/(tp+tn+fp+fn)\n p=tp/(tp+fp)\n r=tp/(tp+fn)\n f1=2*(p*r/(p+r)) \n print(\"\\n \")\n print(\"\\n\\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % (p,r,f1,a))\n print(\"\\nCount total = %d, Correct = %d, Incorrect = %d\" % (tp+tn+fp+fn, tp+tn,fp+fn))\n print(\"\\nTrue Positives = %d, True Negatives = %d, False Negatives = %d, False Positives = %d\" % (tp,tn,fn,fp))\n return\n","user":"deadline","dateUpdated":"2019-08-20T13:54:07-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1566323448987_1105862426","id":"20190820-135048_150430299","dateCreated":"2019-08-20T13:50:48-0400","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:50555","dateFinished":"2019-08-20T13:54:07-0400","dateStarted":"2019-08-20T13:54:07-0400","results":{"code":"SUCCESS","msg":[]}},{"text":"%spark2.pyspark\nfrom pyspark.ml.classification import LogisticRegression\n\n# run the model using scaledFeatures and label column\nlr = LogisticRegression(featuresCol = 'scaledFeatures', labelCol = 'label', maxIter=25)\nlrModel = lr.fit(df_2007_scaled_train_label)\n\n# test the result\npredict_test=lrModel.transform(df_2008_scaled_test_label)\n\n# VIEW inspect resulting prediction\n#predict_test.show(5)\n#predict_test.select(\"label\",\"prediction\").show(10,False)\n\n# Calculate metrics from results\n# times delay occured and was predicted correctly\ntrueN = predict_test.filter(predict_test.prediction == 1.0).filter( predict_test.label == predict_test.prediction).count()\n# times no-delay occured and was predicted correctly\ntrueP = predict_test.filter(predict_test.prediction == 0.0).filter( predict_test.label == predict_test.prediction).count()\n# times delay occured and was not predicted\nfalseN = predict_test.filter(predict_test.prediction == 1.0).filter(predict_test.label != predict_test.prediction).count()\n# times there was no-delay and delay was predicted\nfalseP = predict_test.filter(predict_test.prediction == 0.0).filter(predict_test.label != predict_test.prediction).count()\n\n# print confustion matrix\nprint_cm(trueP,trueN,falseN,falseP)\n","user":"deadline","dateUpdated":"2019-08-20T13:56:13-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\nConfusion Matrix\n\n Prediction\n\n 0 1\n\n Actual 0 236673 | 3221\n\n -----------------\n\n 1 92255 | 3181\n\n \n\n\nprecision = 0.72, recall = 0.99, F1 = 0.83, accuracy = 0.72\n\n\nCount total = 335330, Correct = 239854, Incorrect = 95476\n\nTrue Positives = 236673, True Negatives = 3181, False Negatives = 3221, False Positives = 92255\n"}]},"apps":[],"jobName":"paragraph_1566146700160_-1847219667","id":"20190818-124500_1972656133","dateCreated":"2019-08-18T12:45:00-0400","dateStarted":"2019-08-20T13:54:12-0400","dateFinished":"2019-08-20T13:54:25-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45114"},{"text":"%md\nThe PySpark model provides good results, however there are a lot of false positives. This result means we predict no-delay when there were actualy delay. Some more work on the model is required. \n","user":"deadline","dateUpdated":"2019-08-20T14:01:04-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
The PySpark model provides good results, however there are a lot of false positives. This result means we predict no-delay when there were actualy delay. Some more work on the model is required.
\n
"}]},"apps":[],"jobName":"paragraph_1566219944265_-68956362","id":"20190819-090544_1837670598","dateCreated":"2019-08-19T09:05:44-0400","dateStarted":"2019-08-20T14:01:04-0400","dateFinished":"2019-08-20T14:01:04-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45115"},{"text":"%md\n### Adding Weather Data (Iteration 3) ###\n\n(Back to SKLearn approach)\n\nAnother common path to improve accuracy is by bringing in new types of data - enriching our dataset - and generating more features. Our idea is to layer-in weather data. We can get this data from a publicly available dataset [here] (http://www.ncdc.noaa.gov/cdo-web/datasets/). See above paragraph for downloading data.\n\nWe will look at daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). Clearly, weather conditions in the destination airport also affect delays, but for simplicity of this demo we just include weather at the origin (ORD).\n\nFirst, let's re-write our Pig original script to add these new features to our feature matrix:","user":"deadline","dateUpdated":"2019-08-20T14:01:30-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Adding Weather Data (Iteration 3)
\n
(Back to SKLearn approach)
\n
Another common path to improve accuracy is by bringing in new types of data - enriching our dataset - and generating more features. Our idea is to layer-in weather data. We can get this data from a publicly available dataset here. See above paragraph for downloading data.
\n
We will look at daily temperatures (min/max), wind speed, snow conditions and precipitation in the flight origin airport (ORD). Clearly, weather conditions in the destination airport also affect delays, but for simplicity of this demo we just include weather at the origin (ORD).
\n
First, let’s re-write our PIG original script to add these new features to our feature matrix:
\n
"}]},"apps":[],"jobName":"paragraph_1563284292028_-408117854","id":"20190716-093812_1095272170","dateCreated":"2019-07-16T09:38:12-0400","dateStarted":"2019-08-19T09:10:22-0400","dateFinished":"2019-08-19T09:10:22-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45116"},{"text":"%pig\nregister 'util.py' USING jython as util;\n\n-- Helper macro to load data and join into a feature vector per instance\nDEFINE preprocess(year_str, airport_code) returns data\n{\n -- load airline data from specified year (need to specify fields since it's not in HCat)\n airline = load 'flights/$year_str.csv' using PigStorage(',') \n as (Year: int, Month: int, DayOfMonth: int, DayOfWeek: int, DepTime: chararray, CRSDepTime:chararray, \n ArrTime, CRSArrTime, Carrier: chararray, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, \n ArrDelay, DepDelay: int, Origin: chararray, Dest: chararray, Distance: int, TaxiIn, TaxiOut, \n Cancelled: int, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, \n SecurityDelay, LateAircraftDelay);\n\n -- keep only instances where flight was not cancelled and originate at ORD\n airline_flt = filter airline by Cancelled == 0 and Origin == '$airport_code';\n\n -- Keep only fields I need\n airline2 = foreach airline_flt generate Year as year, Month as month, DayOfMonth as day, DayOfWeek as dow,\n Carrier as carrier, Origin as origin, Dest as dest, Distance as distance,\n CRSDepTime as time, DepDelay as delay, util.to_date(Year, Month, DayOfMonth) as date;\n\n -- load weather data\n weather = load 'weather/$year_str.csv' using PigStorage(',') \n as (station: chararray, date: chararray, metric, value, t1, t2, t3, time);\n\n -- keep only TMIN and TMAX weather observations from ORD\n weather_tmin = filter weather by station == 'USW00094846' and metric == 'TMIN';\n weather_tmax = filter weather by station == 'USW00094846' and metric == 'TMAX';\n weather_prcp = filter weather by station == 'USW00094846' and metric == 'PRCP';\n weather_snow = filter weather by station == 'USW00094846' and metric == 'SNOW';\n weather_awnd = filter weather by station == 'USW00094846' and metric == 'AWND';\n\n joined = join airline2 by date, weather_tmin by date, weather_tmax by date, weather_prcp by date, \n weather_snow by date, weather_awnd by date;\n $data = foreach joined generate delay, month, day, dow, util.get_hour(airline2::time) as tod, distance, carrier, dest,\n util.days_from_nearest_holiday(year, month, day) as hdays,\n weather_tmin::value as temp_min, weather_tmax::value as temp_max,\n weather_prcp::value as prcp, weather_snow::value as snow, weather_awnd::value as wind;\n};\n\nORD_2007 = preprocess('2007', 'ORD');\nrmf airline/fm/ord_2007_2;\nstore ORD_2007 into 'flights/fm/ord_2007_2' using PigStorage(',');\n\nORD_2008 = preprocess('2008', 'ORD');\nrmf airline/fm/ord_2008_2;\nstore ORD_2008 into 'flights/fm/ord_2008_2' using PigStorage(',');","user":"deadline","dateUpdated":"2019-07-15T19:26:21-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"pig","editOnDblClick":false},"editorMode":"ace/mode/pig"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1563231625050_-743019092","id":"20190715-190025_1460200128","dateCreated":"2019-07-15T19:00:25-0400","dateStarted":"2019-07-15T19:26:21-0400","dateFinished":"2019-07-15T19:26:21-0400","status":"ERROR","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:45117"},{"text":"%md\nWe now read this data in, convert temparatures to Fahrenheit (note original temp is in Celcius*10), and prepare the training and testing datasets for OHE modeling.\n","user":"deadline","dateUpdated":"2019-08-19T09:11:14-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":false,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
We now read this data in, convert temparatures to Fahrenheit (note original temp is in Celcius*10), and prepare the training and testing datasets for OHE modeling.
\n
Note: there are some errors with this step (investigatin), however it does produce data.
\n
"}]},"apps":[],"jobName":"paragraph_1563284622336_1513593988","id":"20190716-094342_759493363","dateCreated":"2019-07-16T09:43:42-0400","dateStarted":"2019-07-16T09:52:13-0400","dateFinished":"2019-07-16T09:52:13-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45118"},{"text":"%python\nfrom sklearn.preprocessing import OneHotEncoder\n\n# Convert Celsius to Fahrenheit\ndef fahrenheit(x): return(x*1.8 + 32.0)\n\n# read files\ncols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'carrier', 'dest', 'days_from_holiday',\n 'origin_tmin', 'origin_tmax', 'origin_prcp', 'origin_snow', 'origin_wind']\ncol_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, \n 'carrier': str, 'dest': str, 'days_from_holiday': int,\n 'origin_tmin': float, 'origin_tmax': float, 'origin_prcp': float, 'origin_snow': float, 'origin_wind': float}\n\ndata_2007 = read_csv_from_hdfs('airline/fm/ord_2007_2', cols, col_types)\ndata_2008 = read_csv_from_hdfs('airline/fm/ord_2008_2', cols, col_types)\n\ndata_2007['origin_tmin'] = data_2007['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))\ndata_2007['origin_tmax'] = data_2007['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))\ndata_2008['origin_tmin'] = data_2008['origin_tmin'].apply(lambda x: fahrenheit(x/10.0))\ndata_2008['origin_tmax'] = data_2008['origin_tmax'].apply(lambda x: fahrenheit(x/10.0))\n\n# Create training set and test set\ntrain_y = data_2007['delay'] >= 15\ncateg = [cols.index(x) for x in ('hour', 'month', 'day', 'dow', 'carrier', 'dest')]\nenc = OneHotEncoder(categorical_features = categ)\ndf = data_2007.drop('delay', axis=1)\ndf['carrier'] = pd.factorize(df['carrier'])[0]\ndf['dest'] = pd.factorize(df['dest'])[0]\ntrain_x = enc.fit_transform(df)\n\ntest_y = data_2008['delay'] >= 15\ndf = data_2008.drop('delay', axis=1)\ndf['carrier'] = pd.factorize(df['carrier'])[0]\ndf['dest'] = pd.factorize(df['dest'])[0]\ntest_x = enc.transform(df)\n\nprint(train_x.shape)","user":"deadline","dateUpdated":"2019-08-19T16:49:07-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"... ... ... ... ... ... ... (359169, 414)\n"}]},"apps":[],"jobName":"paragraph_1563233120739_-1062047094","id":"20190715-192520_1378207123","dateCreated":"2019-07-15T19:25:20-0400","dateStarted":"2019-08-19T16:49:08-0400","dateFinished":"2019-08-19T16:49:21-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45119"},{"text":"%md\nNext, rerun the training set with new features (even though there was the error above). Update the number of trees to 100.\n","user":"deadline","dateUpdated":"2019-07-16T09:53:00-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Next, rerun the training set with new features (even though there was the error above). Update the number of trees to 100.
\n
"}]},"apps":[],"jobName":"paragraph_1563284667476_195105480","id":"20190716-094427_1775537670","dateCreated":"2019-07-16T09:44:27-0400","dateStarted":"2019-07-16T09:53:00-0400","dateFinished":"2019-07-16T09:53:00-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45120"},{"text":"%python\n# Create Random Forest classifier with 100 trees\nclf_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)\nclf_rf.fit(train_x.toarray(), train_y)\n\n# Evaluate on test set\npr = clf_rf.predict(test_x.toarray())\n\n# print results\ncm = confusion_matrix(test_y, pr)\nprint(\"Confusion matrix\")\nprint(pd.DataFrame(cm))\nreport_rf = precision_recall_fscore_support(list(test_y), list(pr), average='binary')\nprint(\"precision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\\n\" % \\\n (report_rf[0], report_rf[1], report_rf[2], accuracy_score(list(test_y), list(pr))))","user":"deadline","dateUpdated":"2019-08-15T18:11:39-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"python","editOnDblClick":false},"editorMode":"ace/mode/python"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"... RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n max_depth=None, max_features='auto', max_leaf_nodes=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False)\n... ... Confusion matrix\n 0 1\n0 223912 15982\n1 70123 25313\n... precision = 0.61, recall = 0.27, F1 = 0.37, accuracy = 0.74\n\n"}]},"apps":[],"jobName":"paragraph_1563239757286_570421863","id":"20190715-211557_1640139280","dateCreated":"2019-07-15T21:15:57-0400","dateStarted":"2019-08-15T18:11:39-0400","dateFinished":"2019-08-15T18:12:53-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45121"},{"text":"%md\nThe results better with the weather data. Our first model gave 63% accuracy, not the modle is up to 74%\n","user":"deadline","dateUpdated":"2019-08-19T16:51:26-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
The results better with the weather data. Our first model gave 63% accuracy, not the modle is up to 74%
\n
"}]},"apps":[],"jobName":"paragraph_1563240463575_-1807543644","id":"20190715-212743_1416883405","dateCreated":"2019-07-15T21:27:43-0400","dateStarted":"2019-08-19T16:51:26-0400","dateFinished":"2019-08-19T16:51:26-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45122"},{"text":"%md\n### Conclusion ###\nThis concludes the Data Science Notebook. Updates will include fixing the issues found above and additional tools (including pyspark)\n\n","user":"deadline","dateUpdated":"2019-07-16T09:50:32-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","editorHide":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"
\n
Conclusion
\n
This concludes the Data Science Notebook. Updates will include fixing the issues found above and additional tools (including pyspark)
\n
"}]},"apps":[],"jobName":"paragraph_1563284899363_727550341","id":"20190716-094819_13555703","dateCreated":"2019-07-16T09:48:19-0400","dateStarted":"2019-07-16T09:50:25-0400","dateFinished":"2019-07-16T09:50:25-0400","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:45123"},{"text":"%md\n","user":"deadline","dateUpdated":"2019-07-16T09:49:53-0400","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1563284993339_1181492746","id":"20190716-094953_1222092705","dateCreated":"2019-07-16T09:49:53-0400","status":"READY","progressUpdateIntervalMs":500,"$$hashKey":"object:45124"}],"name":"Scalable Analytics Airline Delays","id":"2EF9YGA8N","angularObjects":{"2EFNH2N1T:shared_process":[],"2C4U48MY3_spark2:shared_process":[],"2EF3MC3PD:shared_process":[],"2CUSV2RC8:shared_process":[],"2CXH6CR6P:shared_process":[],"2CW1CJHSN:shared_process":[],"2CXFTT4PT:shared_process":[],"2C8A4SZ9T_livy2:shared_process":[],"2CXAF2UNW:shared_process":[],"2CX3VRR2H:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}