*Posted: 2 November 2018*

- TLDR
- The Challenge
- Understanding the Data
- Synthesized Understanding of Data
- Exploring the Data in Pandas
- Feature Engineering
- Building the Model
- Deploying in a production environment?
- Conclusion

I normalized the data by subtracting the mean and dividing by the standard
deviation of each feature. I created 5 new features: `duration`

, `speed`

,
`tip`

, `day_of_week`

, `weekend`

. I didn’t comment in the code extensively. I
focused more on the process and this post. The idea of deploying this in a
production environment is below. You can see the header in the contents page.
The classifier has an overall accuracy of 83.8%, 1% higher than the baseline,
and the regressor has an RMSE of 2.316, with 93% of predictions within 2 dollars
of the actual tip amount. The weakness of this model is that it tends to simply
predict all trips will have a tip, as the data is quite skewed. Having more
features that are more representative would greatly improve the model. If there
were constraints, I would lean towards optimizing the classifier as the
regressor tends to simply return the mean. More details below.

The challenge is to implement a machine learning model to predict for a given taxi trip, if a tip will be paid and for a trip with a tip, what is the expected tip amount. The data that we will be using is the 2017 Green Taxi Trip Data dataset.

As with all problems related to data, we first have to understand the data. I don’t have to use deep learning, but I would first like to try a classic fully connected neural network for this problem and see how it performs. Other methods are possible too. The first thing that I think of would be normalizing the features. Continuous variables have to be normalized, and categorical variables have to be one hot encoded. TensorFlow has a really good guide on feature columns. I will be using TensorFlow for the entire process.

`VendorID`

- A code indicating the LPEP provider that provided the record.
- Categorical variable (2 classes, Creative or VeriFone)
- It’s probably worth finding out more information about the signifiance of this variable.
- I’m guessing that this does not have that much significance - could be worth pruning.

`lpep_pickup_datetime`

- The date and time when the meter was engaged.
- Multiple categorical variables? Day and Night could be a set of categorical variables. Peak hours, lunch hours, clubbing hours could be a set of categorical variables.
- Definitely not a continuous variable as 0 is mapped to 0000H, but it doesn’t make sense to map 0.9999 to 2359H as it is technically really close to 0000H.

`lpep_dropoff_datetime`

- The date and time when the meter was disengaged.
- Same as
`lpep_pickup_datetime`

.

`store_and_fwd_flag`

- This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server
- Categorical variable (2 classes, Yes or No)
- It’s probably worth finding out more information about the signifiance of this variable.
- I’m guessing that this does not have that much significance - could be worth pruning.
- A check on the histogram shows that over 99.9% are ‘N’. Definitely prune.

`RatecodeID`

- The final rate code in effect at the end of the trip.
- Categorical variable (6 classes, Standard rate, JFK, Newark, Nassau or Westchester, Negotiated fare, Group ride).
- A check on the histogram shows over 99.9% are standard rate. It’s probably only going to help in the edge cases like negotiated fares. Definitely prune. Perhaps explore it if we are using a wide and deep model to memorize these exceptions. I read the paper briefly in the past but will have to explore it deeper.

`PULocationID`

- This is an interesting one. The attached PDF shows that there is a
`pickup_longtitude`

and`pickup_latitude`

field, but looking at the data, there clearly isn’t a longtitude or latitude. They seem to have converted it into zones. That’s great, as we don’t have to do our own preprocessing to convert the latitudes and longtitudes into zones. There probably is a mapping dataset out there to account for the slight tilt of Manhattan, which can map the latlongs into proper grids. - Categorical variable (? classes)
- A quick look at the data shows values of 42 and 255. It could potentially range from 0 - 255. It’s important to note this as we could potentially have a very large input vector. Perhaps the TF Hashed Column function?
- Here, we can clearly see that there are more pickups in some area than others.

- This is an interesting one. The attached PDF shows that there is a
`DOLocationID`

- Same as
`PULocationID`

- Interestingly, 40 - 50 is a zone where both pickup and dropoff is high. I suspect this is the business area…?
- Turns out it isn’t! A quick search on Google shows exactly what’s going. New York is zoned accordingly and you can see Manhattan here. Why do these taxis pick up and drop so many people at Zone 40 - 50?
- The highest PU and DO happens at 74. Not sure what goes on in Zone 74 but that’s East Harlem. Probably worth checking that out.
- At this point in time, I decided to read more about why this is the case, and found out that the Green Taxi actually means Boro Taxi. They can only pick people up from East 96th and West 110th and above! Everything else below is for “Yellow Taxis”. Also, these cars can drop passengers off anywhere, but can’t pick people from the yellow zones. A quick look at the histogram shows that some of these drivers actually did pick people from yellow zones. I mean if you drive someone from East 96th and West 110th to FiDi, you will probably want to pick someone to go back up north isn’t it? This also means that almost all pick-ups in yellow zones won’t be tracked. The Green Taxi drivers would probably negotiate a fare and not run the meter. This is really really interesting.

- Same as
`passenger_count`

- The number of passengers in the vehicle. This is a driver-entered value.
- Technically discrete but will model as continuous variable.
- Potentially ignore? Driver-entered values may not be that accurate.

`trip_distance`

- The elapsed trip distance in miles reported by the taximeter.
- Continuous variable.

`fare_amount`

- The time-and-distance fare calculated by the meter.
- Continuous variable.

`extra`

- Miscellaneous extras and surchages. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
- Continuous variable.
- Given that there is a rush hour charge and overnight charge, perhaps we don’t have to discretize pickup and dropoff time as it’s implicitly factored into this.
- It includes this currently - does the new data have more values? Given that the data sheet is outdated.
- Yup - it clearly does have some strange values that we have to clean up, or ignore as noise.

`mta_tax`

- $0.50 MTA tax that is automatically triggered based on the metered rate in use.
- It’s not true that it’s strictly $0.50. It’s surprising that there are values like $0.81, $0.60, and -$0.50. The negative one surprises me the most.

`tip_amount`

- This field is automatically populated for credit card tips. Cash tips are not included.
- Continuous variable.
- Does this mean that all cash transactions have a 0
`tip_amount`

? Need to check this. - In the histogram below, we can see that those with 0 dollar tips are mostly cash transactions. Those with tips are exclusively credit card transactions. We can almost safely say that if it’s a cash transaction, there will not be a tip. This is a very strong feature.

`ehail_fee`

- Not in the data sheet.
- A quick histogram shows that all of them are null too. Simply remove this.

`improvement_surchage`

- $0.30 improvement surchage assessed on hailed trips at the flag drop. This is a confusing term, so I did some research on it.
- A histogram shows that over 10M points are $0.30, 220K are $0.00, and 25K are -$0.30.
- The negative one stands out the most. Does that mean it’s deducted from their fare?
- It seems like this probably won’t have an impact on tipping, so we could probably prune this.

`total_amount`

- The total amount charged to passengers. Does not include cash tips.
- What does this mean? Should probably check how this sum was computed.
- Continuous variable.

`payment_type`

- A numeric code signifying how the passenger paid for the trip.
- Categorical variable (6 classes, Credit card, Cash, No charge, Dispute, Unknown, Voided trip).
- The histogram analysis doesn’t show “Voided trip”. Looks like it’s only 5 classes.

`trip_type`

- A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver.
- Categorical variable (2 classes, Street-hail, Dispatch).
- Since it can be altered by the driver, I assume one has a higher metered rate than the other. Changing this could probably cause a dispute and thus no tip. Might be an important variable.

- It is important to understand exactly how the Green Taxi functions. That would guide certain decisions into choosing the right features to use.
- Some of the variables have dirty values and have to be cleaned. It might be worth ignoring them for now and letting the neural network handle the noise. We can then clean it and show an improvement.
- Additional verification on certain variables is necessary as well.
`total_amount`

and`tip_amount`

is one such example. - It is important to go through every variable and understand what it is. The discrepancy in the data sheet also shows that we can’t always trust what is given to us. More probing is needed on the side.
- For
`lpep_pickup_datetime`

and`lpep_dropoff_datetime`

, it’s going to be hard to model them as they are and the peak hours seem to be factored in in`extra`

already. We might want to construct a derived feature of duration of journey. If the journey is long and time taken is short, a tip might be given. Conversely, if the journey is short and time taken is long, there might be a traffic jam and a tip isn’t given. This might even predict most of the tips!

```
VendorID 1
lpep_pickup_datetime 01/01/2017 12:13:13 AM
lpep_dropoff_datetime 01/01/2017 01:08:34 AM
store_and_fwd_flag N
RatecodeID 5
PULocationID 80
DOLocationID 265
passenger_count 1
trip_distance 46
fare_amount 0
extra 0
mta_tax 0
tip_amount 45
tolls_amount 5.54
ehail_fee NaN
improvement_surcharge 0
total_amount 50.54
payment_type 1
trip_type 2
```

This was an interesting one. The `fare_amount`

was actually 0, but the
`total_amount`

was 50.54! Turns out that this is a 46Km long trip and had a
negotiated fare. Does this mean that all negotiated fares have this
characteristic?

I also found out that the max `fare_amount`

was `6003.5`

and the min is
`-480.0`

. How can a fare be negative?! I explored the data a little more and did
some cleaning up, which you can find in `clean_data.py`

. There’s a lot more to
the exploration which I didn’t describe here.

Instead of using the actual time of pickup and dropoff, we can come up with an easy metric - the length of the journey.

As an extension of `duration`

, we can have `speed`

. We simply divide
`trip_distance`

by `duration`

.

A boolean value. If the tip is non-zero, then this returns `True`

, or `1`

, in
our case. We build a classifier with this target value later on.

0 - 6. Perhaps people tip less on Monday, and more on Friday?

0 or 1. Perhaps people simply tip on weekends, and generally don’t on weekdays?

With these features, and after a clean up of values that don’t make sense, we prune the data from approximately 11m rows to 5m rows. Most of the pruning comes from the fact that we don’t include cash transactions, which accounts for 5.7m rows. When making a prediction, we simply cast cash transactions to have no tip and hence 0 value.

At this point in time, I decided to code the basic model first and get a feel of
the current accuracy, before diving deeper. We first build a `DNNClassifier`

, to
get a feel of how many trips it predicted correctly. We will then build a
`DNNRegressor`

on top of that. Perhaps end to end models could be explored, or a
simply regression with a threshold could work too.

Side Note: TensorFlow GPU took less than 3 minutes to install (including CUDA cuBLAS, etc. I still remember the days of TensorFlow 0.6 where installing CUDA and the dependencies were a nightmare). Tested the install by running the sample code with

`CUDA_VISIBLE_DEVICE=`

, to prevent any GPU usage. There was a significant difference when the environment variable was set versus when it was not set, thus verifying the GPU indeed works.

TensorFlow has changed significantly over the past year, and I’ve been using
their new APIs. `tf.eager.execution()`

is a HUGE game changer. I can’t emphasize
how important that is. You should definitely check it out on their docs.
`tf.data.Dataset`

is also pretty awesome. No more janky queue runners and the
likes. When I compare what I wrote now to what I wrote a year ago, the change is
pretty significant. TensorFlow has improved significantly. I coded up the first
model pretty quickly, and then ran a simple experiment to make sure that the
TensorFlow code works. It did.

I then added all the relevant features, and it turned out that the loss was
diverging. I didn’t really have an idea why at first. I thought it was because
the features weren’t normalized. I added the normalization, but it still didn’t
work. I dug deeper into the data and realized that `speed`

was returning
infinite speeds. This was because we had `duration`

of value 0. I cleaned the
data again, and the model ran. The standard deviation of the columns also
strongly suggest that we should clean the data even more to account for these
situations.

```
passenger_count 1.3654899165449583 1.0452344257422215
trip_distance 3.0419148386103907 3.0557791184587324
fare_amount 13.032481964650795 9.848360928439012
extra 0.35816182232546484 0.3885605888383032
mta_tax 0.4920059988529926 0.06271483421703997
tip_amount 2.2817976371139967 2.595756212593663
tolls_amount 0.14511203827191288 1.2329375456225349
ehail_fee nan nan
improvement_surcharge 0.29547378107209477 0.036570196927254564
total_amount 16.72000659328439 11.768082912394933
trip_type 1.0146244501985067 0.12004406982922693
duration 1238.402000942381 5656.356759908103
speed 13.804815278283268 103.47450536216657
```

An average duration of 20 minutes make sense, but how can the standard deviation be 1.5 hours? I’m not saying that it isn’t possible, but we probably should dig deeper. We can see the same thing for speed. A 13MPH average speed does make sense in NY, but an SD of 103 MPH? That doesn’t sound right. When I looked at the data, it turns out that some of these rides have ridiculous speeds.

On the other hand, metrics like `passenger_count`

does make sense.
`trip_distance`

makes sense too. A distance of 3 miles with a SD of 3 miles.
Same for `tip_amount`

.

The basic model should be able to account for these outliers. Nonetheless, pruning this even more could further improve the accuracy.

There are `974180`

rows with a tip and and `208463`

rows without a tip. We would
expect a similar distribution in the train dataset as well. This would mean that
without any training at all, we can use a black box and simply predict
everything as a ride with a tip - and we would get an accuracy of 82.37%. A good
model would have to do better than 82.37%.

We used a really basic network with the following parameters:

```
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns,
n_classes=2,
hidden_units=[2048,2048,1024,512,256,128,64],
batch_norm=True,
activation_fn=tf.nn.leaky_relu,
optimizer=tf.train.AdamOptimizer(0.0005)
)
```

We get the following results:

```
loss 115.284325
accuracy_baseline 0.82373124
global_step 739160
recall 0.9799513
auc 0.5073563
prediction/mean 0.9598021
precision 0.8375532
label/mean 0.82373124
average_loss 1.8013374
auc_precision_recall 0.8981662
accuracy 0.8269224
```

This is really interesting. TensorFlow has already computed the baseline accuracy for us. 974180 / 1182643 gives us exactly 0.82373124. Similar to what I typed above, we have to do better than this. We can see that our actual accuracy is 0.8269224, which is slightly better than the baseline.

Recall is high, but it isn’t 1.0, which weakly shows that the model isn’t simply
throwing out a 1 for every value it sees. Based on the recall, we can compute
the number of trips which had a tip, but was classified wrongly: ```
(1 -
0.9799513) * 974180
```

, which gives us `19531`

. You can do more simple
calculations which I won’t describe here, to arrive at `185158`

out of `208463`

trips which had no tip, but was classified as a having a tip. This gives us an
astoundingly low accuracy of 11.1% for trips without tips - if we already know
the answer of course.

I dug deep into some of the samples. Here’s one. I noticed that the model predicted this very confidently as having a tip. The probability was 0.998.

```
VendorID 2
lpep_pickup_datetime 2017-01-01 00:40:20
lpep_dropoff_datetime 2017-01-01 00:49:28
store_and_fwd_flag N
RatecodeID 1
PULocationID 74
DOLocationID 41
passenger_count 3
trip_distance 2.02
fare_amount 9
extra 0.5
mta_tax 0.5
tip_amount 0
tolls_amount 0
ehail_fee NaN
improvement_surcharge 0.3
total_amount 10.3
payment_type 1
trip_type 1
duration 548
day_of_week 6
weekend 1
speed 13.2701
tip 0
```

Well, this looks like a classic one which will have a tip intuitively. The
average speed is good at 13 MPH. It’s a weekend. The amount was 10.3. The trip
distance was short. At this point in time, I would actually hypothesize that
**tipping is mostly based on the person**, and if we could collect anonymized
data, that would probably be optimal. Nonetheless, let’s see if we can do
better. We could test a few hypothesis right now:

- The model capacity is too big and this model is overfitted. We trained for 10 epochs as well.
- The model needs to memorize exceptions better. Perhaps Wide and Deep Learning
would work? We could use
`PULocationID`

and`DOLocationID`

as pseudo variables for classifying a “person”, as where that person stays and goes could determine his or her propensity to tip.

Normally, if you had lots of GPUs and lots of time, you could test a few hypothesis at once on a cluster, but I’ll just lump a few changes into one and run the model:

- Let’s add some regularization and reduce the number of layers.
- Let’s use Wide and Deep Learning.
- Let’s do more feature engineering and add a cross column, to account for the fact that there exist a cross relationship between pickup and dropoff. Think of it as if I am going from Upper Manhattan to FiDi on a weekend to meet friends, I could be happy and tip. Or if I am going from Upper Manhattan to the hospital at 135 St to visit a sick relative, I could be sad and not tip. Or actually I could be sad and have a lot of empathy, and tip. Honestly, there are many possibilities, but a correlation could probably exist so let’s add this cross feature.

Trained on 2 epochs, contrary to 10 epochs in experiment 1, to save time.

Results:

```
loss 48.302036
accuracy_baseline 0.82373124
global_step 147832
recall 0.9992527
auc 0.75207496
prediction/mean 0.9770326
precision 0.8373608
label/mean 0.82373124
average_loss 0.7547276
auc_precision_recall 0.94602615
accuracy 0.839512
```

We’ve done a lot better than on our baseline, but recall has increased greatly
to `0.9992527`

. We have better predictions on trips that have a tip, but are
still doing poorly on trips that do not have a tip. Using this accuracy alone is
not the best measure. There was no regularization as it will be quite involved
to add it through the high level estimator APIs.

It’s time to read the Wide & Deep Learning paper. The similarity in that paper is It mentioned that a 32-dimensional embedding vector is learned for each categorical feature.

- Reduced the number of layers from 7 to 3, with each layer having the number of hidden units: [1024, 512, 256], as per the paper.
- Used
`leaky_relu`

previously, as I had better performance with it on certain datasets. Using`relu`

now to model the paper directly. - Used
`Adam`

previously. Using`Adagrad`

now, as per the paper. - Keeping
`batch_norm`

on still, as it “always” helps. - Architecture tweaks. Previously, only numeric columns were used in the deep, and categorical columns in the wide part. Now, numeric columns and categorical features (not all of them) that are converted to embeddings are in the deep part, and the single cross feature is in the wide part.

Results:

```
loss 3.5321705
accuracy_baseline 0.82373124
global_step 147832
recall 0.9923176
auc 0.9944181
prediction/mean 0.8270121
precision 0.99146277
label/mean 0.82373124
average_loss 0.05519077
auc_precision_recall 0.9987084
accuracy 0.9866333
```

Hmm. Too good to be true? We need to dive deeper into this. 7484 trips that had a tip were classified as a trip with no tip, and 8324 trips that had no tip were classified as a trip with a tip. Now these results are a lot more interesting. This gives us an effective accuracy of 99.2% for trips with tips, and 96% for trips without tips.

I was convinced initially, and dug deeper into the features I was using. I
forgot to remove `total_amount`

, which meant the model could have learnt to
subtract `total_amount`

with `fare_amount`

to derive if there was a tip or not.
True enough, that was what happened.

I added more features this time by using the crossed column concept for the
paper. I could also add `trip_type`

in future, but this requires cleaning the
datasets again (the code here definitely can be improved so this won’t happen).
That could be future work.

```
loss 25.42006
accuracy_baseline 0.82373124
global_step 147832
recall 0.9817806
auc 0.75852734
prediction/mean 0.82457703
precision 0.84597015
label/mean 0.82373124
average_loss 0.3971928
auc_precision_recall 0.924248
accuracy 0.83774394
```

We will use this as our final model and proceed to the regression.

For the regression model, we use the exact same parameters as the classifier. Technically, the weights for these models can be shared, but for now, we simply train them as a different model altogether. There aren’t any experiments needed as most of them have been knocked out in the classifier.

We can see most of the values are clustered to the bottom left quadrant. However, this graph isn’t really clear what the distribution of points is like.

We plot the empirical CDF, and it clearly shows that most of errors are actually
really small. Digging deeper into the values, we actually see that 93% of the
errors are actually between `+-2`

. I had a suspicion that this would happen
from the beginning - for those that tip, most of them would just tip the
standard rate of 15% - 20% in NY. Perhaps more would be on the side of 20% as it
is easier to compute. This actually means that a simple regressor could do
really well too. In any case, the RMSE for this model is `2.316`

. It’s pretty
decent - happy to stick with it.

What if we could deploy this model as a component of a mobile app solution that helps the taxi driver estimate the expected tip for a particular trip?

If the features are all available at the point of pick up, the data can be sent from the client to the server for processing. At the server, we will predict whether or not there will be a tip and how much the tip will be.

Implementing interpretability models would allow us to ascertain exactly which levers to pull to make this trip one with a tip. For example, we could identify that for this particular trip, if the duration is significantly shorter, then we might want to flag to this to the driver. The driver might then speed up a little so as to earn that extra tip. The downsides of this of course, would be that there might be dangerous driving, and the passenger might end up not tipping because of dangerous driving. These are just ideas of course.

The fundamental principle would be to predict whether or not there will be a tip, and then sending the levers which the driver can pull to earn or maximize the tip.

This was a pretty interesting exercise - I suspect nothing complicated was really needed. We could really try our best to push the limits of the accuracy of this, but it would take too much time. Tipping is perhaps based on the driver and the passenger, more than the distance of the journey or the features given in this dataset.