In my last post we looked at how to load data into Microsoft Azure Machine Learning using the browser based ML Studio. We also started to look at the data around predicting delayed flights and identified some problems with it and this post is all about getting the data into the right shape to ensure that the predictive algorithms in MAML have the best chance of giving us the right answer.
Our approach is three fold
- To discard the data we don’t need, either columns that aren’t relevant or are derived from other data and to discard rows where there is missing data for the columns (features in machine learning speak)
- To tag the features correctly as being numbers or strings and whether they are categorical or not. Categorical in this context means that the value puts them in a group rather than being continuous so AirportID is categorical as it puts a row into a group of rows for the same airport where temperature is a continuous variable and the numbers do represent point on a line (where AirportID 1 is nothing to do with ID 3 or 4).
- To join the flight delay dataset to the weather data set on the Airport and the date/time. In my last post I mentioned that we could either join the weather data in twice once to the departure airport and once to the arriving airport and indeed the sample experiment on flight delay prediction does exactly this but I think a simpler approach is to just model the arrival delay on the fact that some flight have a delayed departure time which may or may not be influenced by the weather at the departure airport.
Let’s get started..
Open ML Studio, create a new experiment , give it a suitable name and drag the flight delays and the Weather datasets onto the design surface so it looks like this ..
Clean the data
As before we can right click on the circle at the bottom of the data set and select visualize data to see what we are working with- for example here’s the weather data.
What’s is odd here is that the data is not properly typed in that some of the numeric data is in a column marked string such as the weather data set temperature columns. I spent ages trying to work out how to fix this and the answer turns out to be to use the Convert to Data set module which automatically does this. So our first step is to drag tow of them onto the design service and connect them to each of our data sets..
If we run our model (run is at the bottom of the screen) we can then visualize the output of the convert to dataset steps and now our data is correctly identified as being numeric etc.
The next step is to get rid of any unwanted columns and this is simply a case of using the project columns module (to find it just use the search at the top of the modules list). You can either start with a full list of columns and remove what you don’t need or start with an empty list and add in what you do need. So lets drag it onto to the design surface and then drag a line from the Flight Delays Data to it. It’ll have a red X against it as it’s not configured and we can do this from the select columns on the task pane
Here I have selected all columns and then excluded Year , Cancelled, ArrDelay, DepDelay15, and CRSDeptime. At this point we can check to see that what we get is what we wanted by clicking the run button at the bottom of the screen.
Note It’s only when we run stuff in ML Studio that we are being charged for computation time using this service, the rest of the time we are just charged for the storage we are using (for our own data sets and experiments) |
As before at each stage we can visualize the data that’s produced by right clicking on its output node..
Here we can see that we have one column Depdelay that has missing values so the next thing we need to do is to get rid of that and we can use the Missing Values Scrubber module for this so search for that and drag it on to the design service and drag a connector from the output of the project columns module to it. We then need to set its properties to set how to deal with the missing values. As we have such a lot of clean data we can simply ignore any rows with missing values by setting the top option to remove entire row..
We can now run the experiment again to check we have no more missing values.
Now we need to do some of this again for the weather dataset. We can then add in another project column module to select the columns we need – this time I am starting with an empty list and specifying which columns to add..
and the data scrubber module again set to remove the entire row…
Tag the Features
Now we need to change the metadata about some of the columns to ensure ML studio handles them properly. Here I cheated which shows you another feature of ML studio. Remember that some of the number in our data are codes rather than being a continuous number for example the airport codes and the airline code. We need to tell MLStudio that these are categorical by using the Metadata Editor module. To this we are going to cheat and by simply copying that module form another experiment. Open another browser window and go into the ML Studio home page and navigate to the Flight Delay sample prediction. Find the Metadata Editor module on their and paste it to the clipboard and then go back into the browser with our experiment and paste it in, and you should see that this module is set to make Carrier, OriginalAirportID and DepAirPortID categorical...
Join the datasets
Now we have to sets of clean data we need to join them. They both have an airport ID, month and day and the flight delay data set has an arrival time to the nearest minute. However the weather data is taken at 56 minutes part the hour every hour and is in local time with a separate time zone column. So what we need to do is round up the flight arrival time to the nearest hour and do the same for the weather data as follows:
For the flight delay arrival time
1. Divide the arrival time by 100
2. Round down the arrival time to the nearest hour
For the weather data time
3. Divide the weather reading time by 100 to give the local time in hours
4. round up to the nearest hour
So how do we do that in ML studio? The answer is one step at a time making repeated use of the Apply Math Operation module. Help is pretty non existent for most of these modules at the time of writing so experimentation is the name of the game, and I hope I have done that for you. We’ll place 4 copies of the Maths Operation module on the design surface one for each step above (so two linked to the weather dataset and two to the flight delay set) ..
Notice the comments I have added to each module (right click and select add comment) and here’s the setting for each step..
Step 1
note the output mode of inplace which means that the value is overwritten and we get all the other columns in the output as well, so make sure this is set for each of the four steps.
Step 2
Step 3
Step 4
Now we can use the Join module (again just search for Join and drag it onto the design surface) to connect our data sets together. Not surprisingly this module has two inputs and one output and we’ll see several modules with multiple inputs and outputs in future. Connect the last module in each of our data set chains into the join and set the properties for the join as shown..
so on the left (flight data ) we have Month,DayofMonth,CRSArrTime,DestAirportID and on the right (the weather data) we have Month,Day,Time,AirportID.
I have to be honest it took a while to get here and initially I got zero rows back. Even now it’s not quite perfect as I have got slightly more rows than I started with which I have tracked down to having the odd hour in the weather data that has two readings. Finding that kind of data problem is beyond what you can do in ML studio in the preview so in my next post I’ll show you your options for examining this data outside of ML studio.