`
Posted: 2 November 2018
I normalized the data by subtracting the mean and dividing by the standard
deviation of each feature. I created 5 new features: duration
, speed
,
tip
, day_of_week
, weekend
. I didn’t comment in the code extensively. I
focused more on the process and this post. The idea of deploying this in a
production environment is below. You can see the header in the contents page.
The classifier has an overall accuracy of 83.8%, 1% higher than the baseline,
and the regressor has an RMSE of 2.316, with 93% of predictions within 2 dollars
of the actual tip amount. The weakness of this model is that it tends to simply
predict all trips will have a tip, as the data is quite skewed. Having more
features that are more representative would greatly improve the model. If there
were constraints, I would lean towards optimizing the classifier as the
regressor tends to simply return the mean. More details below.
The challenge is to implement a machine learning model to predict for a given taxi trip, if a tip will be paid and for a trip with a tip, what is the expected tip amount. The data that we will be using is the 2017 Green Taxi Trip Data dataset.
As with all problems related to data, we first have to understand the data. I don’t have to use deep learning, but I would first like to try a classic fully connected neural network for this problem and see how it performs. Other methods are possible too. The first thing that I think of would be normalizing the features. Continuous variables have to be normalized, and categorical variables have to be one hot encoded. TensorFlow has a really good guide on feature columns. I will be using TensorFlow for the entire process.
VendorID
lpep_pickup_datetime
lpep_dropoff_datetime
lpep_pickup_datetime
.store_and_fwd_flag
RatecodeID
PULocationID
pickup_longtitude
and pickup_latitude
field, but looking at the data,
there clearly isn’t a longtitude or latitude. They seem to have converted it
into zones. That’s great, as we don’t have to do our own preprocessing to
convert the latitudes and longtitudes into zones. There probably is a mapping
dataset out there to account for the slight tilt of Manhattan, which can map
the latlongs into proper grids.PU Location ID values
DOLocationID
PULocationID
DO Location ID values
passenger_count
trip_distance
fare_amount
extra
Extra Histogram
mta_tax
MTA Tax Histogram
tip_amount
tip_amount
? Need to check
this.Tip Amount stacked with Payment Type
ehail_fee
improvement_surchage
total_amount
payment_type
trip_type
total_amount
and tip_amount
is one such example.lpep_pickup_datetime
and lpep_dropoff_datetime
, it’s going to be hard
to model them as they are and the peak hours seem to be factored in in
extra
already. We might want to construct a derived feature of duration of
journey. If the journey is long and time taken is short, a tip might be
given. Conversely, if the journey is short and time taken is long, there
might be a traffic jam and a tip isn’t given. This might even predict most of
the tips!VendorID 1
lpep_pickup_datetime 01/01/2017 12:13:13 AM
lpep_dropoff_datetime 01/01/2017 01:08:34 AM
store_and_fwd_flag N
RatecodeID 5
PULocationID 80
DOLocationID 265
passenger_count 1
trip_distance 46
fare_amount 0
extra 0
mta_tax 0
tip_amount 45
tolls_amount 5.54
ehail_fee NaN
improvement_surcharge 0
total_amount 50.54
payment_type 1
trip_type 2
This was an interesting one. The fare_amount
was actually 0, but the
total_amount
was 50.54! Turns out that this is a 46Km long trip and had a
negotiated fare. Does this mean that all negotiated fares have this
characteristic?
I also found out that the max fare_amount
was 6003.5
and the min is
-480.0
. How can a fare be negative?! I explored the data a little more and did
some cleaning up, which you can find in clean_data.py
. There’s a lot more to
the exploration which I didn’t describe here.
Instead of using the actual time of pickup and dropoff, we can come up with an easy metric - the length of the journey.
As an extension of duration
, we can have speed
. We simply divide
trip_distance
by duration
.
A boolean value. If the tip is non-zero, then this returns True
, or 1
, in
our case. We build a classifier with this target value later on.
0 - 6. Perhaps people tip less on Monday, and more on Friday?
0 or 1. Perhaps people simply tip on weekends, and generally don’t on weekdays?
With these features, and after a clean up of values that don’t make sense, we prune the data from approximately 11m rows to 5m rows. Most of the pruning comes from the fact that we don’t include cash transactions, which accounts for 5.7m rows. When making a prediction, we simply cast cash transactions to have no tip and hence 0 value.
At this point in time, I decided to code the basic model first and get a feel of
the current accuracy, before diving deeper. We first build a DNNClassifier
, to
get a feel of how many trips it predicted correctly. We will then build a
DNNRegressor
on top of that. Perhaps end to end models could be explored, or a
simply regression with a threshold could work too.
Side Note: TensorFlow GPU took less than 3 minutes to install (including CUDA cuBLAS, etc. I still remember the days of TensorFlow 0.6 where installing CUDA and the dependencies were a nightmare). Tested the install by running the sample code with
CUDA_VISIBLE_DEVICE=
, to prevent any GPU usage. There was a significant difference when the environment variable was set versus when it was not set, thus verifying the GPU indeed works.
TensorFlow has changed significantly over the past year, and I’ve been using
their new APIs. tf.eager.execution()
is a HUGE game changer. I can’t emphasize
how important that is. You should definitely check it out on their docs.
tf.data.Dataset
is also pretty awesome. No more janky queue runners and the
likes. When I compare what I wrote now to what I wrote a year ago, the change is
pretty significant. TensorFlow has improved significantly. I coded up the first
model pretty quickly, and then ran a simple experiment to make sure that the
TensorFlow code works. It did.
I then added all the relevant features, and it turned out that the loss was
diverging. I didn’t really have an idea why at first. I thought it was because
the features weren’t normalized. I added the normalization, but it still didn’t
work. I dug deeper into the data and realized that speed
was returning
infinite speeds. This was because we had duration
of value 0. I cleaned the
data again, and the model ran. The standard deviation of the columns also
strongly suggest that we should clean the data even more to account for these
situations.
passenger_count 1.3654899165449583 1.0452344257422215
trip_distance 3.0419148386103907 3.0557791184587324
fare_amount 13.032481964650795 9.848360928439012
extra 0.35816182232546484 0.3885605888383032
mta_tax 0.4920059988529926 0.06271483421703997
tip_amount 2.2817976371139967 2.595756212593663
tolls_amount 0.14511203827191288 1.2329375456225349
ehail_fee nan nan
improvement_surcharge 0.29547378107209477 0.036570196927254564
total_amount 16.72000659328439 11.768082912394933
trip_type 1.0146244501985067 0.12004406982922693
duration 1238.402000942381 5656.356759908103
speed 13.804815278283268 103.47450536216657
An average duration of 20 minutes make sense, but how can the standard deviation be 1.5 hours? I’m not saying that it isn’t possible, but we probably should dig deeper. We can see the same thing for speed. A 13MPH average speed does make sense in NY, but an SD of 103 MPH? That doesn’t sound right. When I looked at the data, it turns out that some of these rides have ridiculous speeds.
On the other hand, metrics like passenger_count
does make sense.
trip_distance
makes sense too. A distance of 3 miles with a SD of 3 miles.
Same for tip_amount
.
The basic model should be able to account for these outliers. Nonetheless, pruning this even more could further improve the accuracy.
There are 974180
rows with a tip and and 208463
rows without a tip. We would
expect a similar distribution in the train dataset as well. This would mean that
without any training at all, we can use a black box and simply predict
everything as a ride with a tip - and we would get an accuracy of 82.37%. A good
model would have to do better than 82.37%.
We used a really basic network with the following parameters:
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns,
n_classes=2,
hidden_units=[2048,2048,1024,512,256,128,64],
batch_norm=True,
activation_fn=tf.nn.leaky_relu,
optimizer=tf.train.AdamOptimizer(0.0005)
)
We get the following results:
loss 115.284325
accuracy_baseline 0.82373124
global_step 739160
recall 0.9799513
auc 0.5073563
prediction/mean 0.9598021
precision 0.8375532
label/mean 0.82373124
average_loss 1.8013374
auc_precision_recall 0.8981662
accuracy 0.8269224
This is really interesting. TensorFlow has already computed the baseline accuracy for us. 974180 / 1182643 gives us exactly 0.82373124. Similar to what I typed above, we have to do better than this. We can see that our actual accuracy is 0.8269224, which is slightly better than the baseline.
Recall is high, but it isn’t 1.0, which weakly shows that the model isn’t simply
throwing out a 1 for every value it sees. Based on the recall, we can compute
the number of trips which had a tip, but was classified wrongly: (1 -
0.9799513) * 974180
, which gives us 19531
. You can do more simple
calculations which I won’t describe here, to arrive at 185158
out of 208463
trips which had no tip, but was classified as a having a tip. This gives us an
astoundingly low accuracy of 11.1% for trips without tips - if we already know
the answer of course.
I dug deep into some of the samples. Here’s one. I noticed that the model predicted this very confidently as having a tip. The probability was 0.998.
VendorID 2
lpep_pickup_datetime 2017-01-01 00:40:20
lpep_dropoff_datetime 2017-01-01 00:49:28
store_and_fwd_flag N
RatecodeID 1
PULocationID 74
DOLocationID 41
passenger_count 3
trip_distance 2.02
fare_amount 9
extra 0.5
mta_tax 0.5
tip_amount 0
tolls_amount 0
ehail_fee NaN
improvement_surcharge 0.3
total_amount 10.3
payment_type 1
trip_type 1
duration 548
day_of_week 6
weekend 1
speed 13.2701
tip 0
Well, this looks like a classic one which will have a tip intuitively. The average speed is good at 13 MPH. It’s a weekend. The amount was 10.3. The trip distance was short. At this point in time, I would actually hypothesize that tipping is mostly based on the person, and if we could collect anonymized data, that would probably be optimal. Nonetheless, let’s see if we can do better. We could test a few hypothesis right now:
PULocationID
and DOLocationID
as pseudo
variables for classifying a “person”, as where that person stays and goes
could determine his or her propensity to tip.Normally, if you had lots of GPUs and lots of time, you could test a few hypothesis at once on a cluster, but I’ll just lump a few changes into one and run the model:
Trained on 2 epochs, contrary to 10 epochs in experiment 1, to save time.
Results:
loss 48.302036
accuracy_baseline 0.82373124
global_step 147832
recall 0.9992527
auc 0.75207496
prediction/mean 0.9770326
precision 0.8373608
label/mean 0.82373124
average_loss 0.7547276
auc_precision_recall 0.94602615
accuracy 0.839512
We’ve done a lot better than on our baseline, but recall has increased greatly
to 0.9992527
. We have better predictions on trips that have a tip, but are
still doing poorly on trips that do not have a tip. Using this accuracy alone is
not the best measure. There was no regularization as it will be quite involved
to add it through the high level estimator APIs.
It’s time to read the Wide & Deep Learning paper. The similarity in that paper is It mentioned that a 32-dimensional embedding vector is learned for each categorical feature.
leaky_relu
previously, as I had better performance with it on certain
datasets. Using relu
now to model the paper directly.Adam
previously. Using Adagrad
now, as per the paper.batch_norm
on still, as it “always” helps.Results:
loss 3.5321705
accuracy_baseline 0.82373124
global_step 147832
recall 0.9923176
auc 0.9944181
prediction/mean 0.8270121
precision 0.99146277
label/mean 0.82373124
average_loss 0.05519077
auc_precision_recall 0.9987084
accuracy 0.9866333
Hmm. Too good to be true? We need to dive deeper into this. 7484 trips that had a tip were classified as a trip with no tip, and 8324 trips that had no tip were classified as a trip with a tip. Now these results are a lot more interesting. This gives us an effective accuracy of 99.2% for trips with tips, and 96% for trips without tips.
I was convinced initially, and dug deeper into the features I was using. I
forgot to remove total_amount
, which meant the model could have learnt to
subtract total_amount
with fare_amount
to derive if there was a tip or not.
True enough, that was what happened.
I added more features this time by using the crossed column concept for the
paper. I could also add trip_type
in future, but this requires cleaning the
datasets again (the code here definitely can be improved so this won’t happen).
That could be future work.
loss 25.42006
accuracy_baseline 0.82373124
global_step 147832
recall 0.9817806
auc 0.75852734
prediction/mean 0.82457703
precision 0.84597015
label/mean 0.82373124
average_loss 0.3971928
auc_precision_recall 0.924248
accuracy 0.83774394
We will use this as our final model and proceed to the regression.
For the regression model, we use the exact same parameters as the classifier. Technically, the weights for these models can be shared, but for now, we simply train them as a different model altogether. There aren’t any experiments needed as most of them have been knocked out in the classifier.
Regression Errors
We can see most of the values are clustered to the bottom left quadrant. However, this graph isn’t really clear what the distribution of points is like.
Regression Errors Empirical CDF
We plot the empirical CDF, and it clearly shows that most of errors are actually
really small. Digging deeper into the values, we actually see that 93% of the
errors are actually between +-2
. I had a suspicion that this would happen
from the beginning - for those that tip, most of them would just tip the
standard rate of 15% - 20% in NY. Perhaps more would be on the side of 20% as it
is easier to compute. This actually means that a simple regressor could do
really well too. In any case, the RMSE for this model is 2.316
. It’s pretty
decent - happy to stick with it.
What if we could deploy this model as a component of a mobile app solution that helps the taxi driver estimate the expected tip for a particular trip?
If the features are all available at the point of pick up, the data can be sent from the client to the server for processing. At the server, we will predict whether or not there will be a tip and how much the tip will be.
Implementing interpretability models would allow us to ascertain exactly which levers to pull to make this trip one with a tip. For example, we could identify that for this particular trip, if the duration is significantly shorter, then we might want to flag to this to the driver. The driver might then speed up a little so as to earn that extra tip. The downsides of this of course, would be that there might be dangerous driving, and the passenger might end up not tipping because of dangerous driving. These are just ideas of course.
The fundamental principle would be to predict whether or not there will be a tip, and then sending the levers which the driver can pull to earn or maximize the tip.
This was a pretty interesting exercise - I suspect nothing complicated was really needed. We could really try our best to push the limits of the accuracy of this, but it would take too much time. Tipping is perhaps based on the driver and the passenger, more than the distance of the journey or the features given in this dataset.