Assignment – 4
Experiment Design & Result Analysis
Big Data Predictive Analytics to Overcome Flight Delays
Masters in Applied Information Technology (NMIT) Victoria University, Melbourne, Victoria
Table of Contents
1a. Identification and selection of available data sources
1b. Collection of Raw Data
2a. Data pre-processing
2b. Feature Selection or Dimension Reduction
2c. Experiment Design
2d. Experiment Implementation Records
1a. Identification and selection of available data sources
In order to conduct experiment analysis, the available data sources are analysed and collected. The following table gives a brief description of the available data sources.
Data Source Name |
Source Organization |
Data Description |
Data File Format |
URL |
Charge/ Fee |
Target data source |
Flight Delay Data 1 |
Department of Transportation, Washington, United States |
Commercial Airline Flight Delay Records in 2015 |
.csv |
https://www.kaggle.com/usdot/flight-delays/data |
Free |
Yes |
Flight Delay Data 2 |
Bureau of Transportation Statistics |
Commercial Airline (US) Flight Delay Records in 2017 |
.csv |
https://www.transtats.bts.gov /DL_SelectFields.asp?Table_ID= 236&DB_Short_Name=On-Time |
Free |
Yes |
Flight Delay Data 3 |
Open Flights Airport Database |
Select the delay criteria or reasons |
.csv .txt |
https://openflights.org/data.html |
$50 |
Yes |
Flight Delay Data 4 |
Data World Organization |
Departure Delay Record |
.csv |
https://data.world/data-society/airlines-delay/workspace/file?filename=airlinedelaycauses%2FDelayedFlights.csv |
Free |
Yes |
Flight Delay Data 5 |
Bureau of Infrastructure, Transport and Regional Economics |
International Airline Activity Link |
.csv |
https://data.gov.au/dataset/international-airline-activity |
Free |
Yes |
1b. Collection of Raw Data
The relevant data for the experimental purpose is downloaded from the web and saved in a folder called ‘Raw Data’. These files are in the Microsoft Excel (*.csv) format. The details about these records are summarised in the given table.
Data Source Name |
Date of Collection |
Saved File Location |
Saved File Name |
Saved File Format |
No. of Data Records |
Flight Data 1 |
19/10/2017 |
C:\Users\LIZA\Desktop\Introduction to Research\Raw Dataset |
AirlineDelayCauses.csv |
csv (Excel) |
1048576 |
Flight Data 2 |
19/10/2017 |
C:\Users\LIZA\Desktop\Introduction to Research\Raw Dataset |
Delya_T_Ontime.csv |
csv (Excel) |
450018 |
Flight Data 3 |
21/10/2017 |
C:\Users\LIZA\Desktop\Introduction to Research\Raw Dataset |
Flights.csv |
csv (Excel) |
1048500 |
Flight Data 4 |
22/10/2017 |
C:\Users\LIZA\Desktop\Introduction to Research\Raw Dataset |
PredictingAirlineDelays.csv |
csv (Excel) |
560002 |
Flight Data 5 |
22/10/2017 |
C:\Users\LIZA\Desktop\Introduction to Research\Raw Dataset |
InternationalAirlineActivity.csv |
csv (Excel) |
402050 |
2a. Data pre-processing
Huge amount of raw data is available for the research experiment. All this data cannot be utilised for the experimentation. Therefore, this collection of data needs to be pre-processed to conduct the experiment.
2b. Feature Selection or Dimension Reduction
The entire data collection files consist of multiple data features. Not all of them are relevant to the experimental process. So, few fields have been eliminated from the existing records and new files are updated accordingly. The dimensionality of the collected data is reduced in order to simplify data processing during experiment analysis. The new result data set are recorded in the following sample table.
Date |
Data Source Name |
Purpose of Pre-processing |
Pre-processing Method |
No. of Original Data Records |
No. of Result Data Records |
No. of Original Features |
No. of Result Features |
New Data File Name |
23/10/17 |
Flight Data1 |
Featured Selection |
Manual data processing |
1048576 |
2000 |
46 |
20 |
AirlineDelayCauses_Updated.csv |
23/10/17 |
Flight Data 2 |
Clean the missing data |
Pre-fill the missing values |
450018 |
4000 |
32 |
15 |
Delya_T_Ontime_Updated.csv |
23/10/17 |
Flight Data 3 |
Discard data that is more than 5 years old |
Manual data processing |
1048500 |
2000 |
30 |
15 |
Flights_Updated.csv |
23/10/17 |
Flight Data 4 |
Report-making followed by better analysis |
Automated data processing using Excel features |
560002 |
2000 |
35 |
15 |
PredictingAirlineDelays_Updated.csv |
23/10/17 |
Flight Data 5 |
Featured Selection |
Manual Data Processing |
402050 |
3000 |
28 |
15 |
InternationalAirlineActivity_Updated.csv |
2c. Experiment Design
Date |
Experiment |
Purpose of Experiment |
Description of Procedure |
Input Data |
Expected Output |
Result File Format |
24/10/2017 |
Experiment 1 |
Evaluate Method 1 |
The Ground Delay Program (GDP) Procedure |
Historical data and weather information using Map Reduce |
A join key and table tag |
Output1.csv |
24/10/2017 |
Experiment 2 |
Evaluate Method 2 |
Regression Prediction Mechanism |
Database input to Naive Bay’s Algorithm |
Result for the prediction of departure delays |
Output2.csv |
24/10/2017 |
Experiment 3 |
Evaluate Method 3 |
Flight delay propagation and Delay probability distribution |
The itineraries of passengers who have missed a flight. Reschedule_Pax algorithm |
new passenger itineraries |
Output3.txt |
24/10/2017 |
Experiment 4 |
Evaluate Method 4 |
Heuristic algorithm – Schedule Minimization for Passenger trip delay |
The flight schedule Itineraries- Regression Based Algorithm |
The updated flight schedule |
Output4.txt |
2d. Experiment Implementation Records
A basic and simple delay model can be built with the help of Empirical Cumulative Distribution Model. The Kernel Density Estimation method is a basic function of the programming language that will be used. A Map-Reduce algorithm will be used that will split the input data set into individual chunks which will be processed be the map tasks in a completely parallel manner. The Linear Regression Model of the average daily delay analyzes the effects of arrival delay, airport capacity, traffic congestion and weather conditions.
After conducting the aforementioned experiments, there are certain results that are desired to be obtained. They are analysed as below –
4.1 Data Analysis
4.1.1 Data Pre-processing and Transformation
4.1.2 Target Data Creation
4.1.3 Model descriptions and variables
4.1.3.1 The training dataset
4.1.3.2 Decision Trees
4.1.3.3 Random Forest Model
4.2 Delay Prediction
4.2.1 Classification Technique
4.2.2 Hadoop MapReduce
4.3 Analysis of Variance (ANOVA) on the average daily arrival delay
4.3.1 ANOVA test on seasonal pattern
4.4 BRYAGH: Basic Reduction Yare Approach for Flights
4.4.1 BRYAGH Algorithm
4.4.2 Pseudo code
4.5 Conclusion
Urgenthomework helped me with finance homework problems and taught math portion of my course as well. Initially, I used a tutor that taught me math course I felt that as if I was not getting the help I needed. With the help of Urgenthomework, I got precisely where I was weak: Sheryl. Read More
Follow Us