Posted: 2 November 2018
I normalized the data by subtracting the mean and dividing by the standard
deviation of each feature. I created 5 new features:
weekend. I didn’t comment in the code extensively. I
focused more on the process and this post. The idea of deploying this in a
production environment is below. You can see the header in the contents page.
The classifier has an overall accuracy of 83.8%, 1% higher than the baseline,
and the regressor has an RMSE of 2.316, with 93% of predictions within 2 dollars
of the actual tip amount. The weakness of this model is that it tends to simply
predict all trips will have a tip, as the data is quite skewed. Having more
features that are more representative would greatly improve the model. If there
were constraints, I would lean towards optimizing the classifier as the
regressor tends to simply return the mean. More details below.
The challenge is to implement a machine learning model to predict for a given taxi trip, if a tip will be paid and for a trip with a tip, what is the expected tip amount. The data that we will be using is the 2017 Green Taxi Trip Data dataset.
As with all problems related to data, we first have to understand the data. I don’t have to use deep learning, but I would first like to try a classic fully connected neural network for this problem and see how it performs. Other methods are possible too. The first thing that I think of would be normalizing the features. Continuous variables have to be normalized, and categorical variables have to be one hot encoded. TensorFlow has a really good guide on feature columns. I will be using TensorFlow for the entire process.
pickup_latitudefield, but looking at the data, there clearly isn’t a longtitude or latitude. They seem to have converted it into zones. That’s great, as we don’t have to do our own preprocessing to convert the latitudes and longtitudes into zones. There probably is a mapping dataset out there to account for the slight tilt of Manhattan, which can map the latlongs into proper grids.
tip_amount? Need to check this.
tip_amountis one such example.
lpep_dropoff_datetime, it’s going to be hard to model them as they are and the peak hours seem to be factored in in
extraalready. We might want to construct a derived feature of duration of journey. If the journey is long and time taken is short, a tip might be given. Conversely, if the journey is short and time taken is long, there might be a traffic jam and a tip isn’t given. This might even predict most of the tips!
VendorID 1 lpep_pickup_datetime 01/01/2017 12:13:13 AM lpep_dropoff_datetime 01/01/2017 01:08:34 AM store_and_fwd_flag N RatecodeID 5 PULocationID 80 DOLocationID 265 passenger_count 1 trip_distance 46 fare_amount 0 extra 0 mta_tax 0 tip_amount 45 tolls_amount 5.54 ehail_fee NaN improvement_surcharge 0 total_amount 50.54 payment_type 1 trip_type 2
This was an interesting one. The
fare_amount was actually 0, but the
total_amount was 50.54! Turns out that this is a 46Km long trip and had a
negotiated fare. Does this mean that all negotiated fares have this
I also found out that the max
6003.5 and the min is
-480.0. How can a fare be negative?! I explored the data a little more and did
some cleaning up, which you can find in
clean_data.py. There’s a lot more to
the exploration which I didn’t describe here.
Instead of using the actual time of pickup and dropoff, we can come up with an easy metric - the length of the journey.
As an extension of
duration, we can have
speed. We simply divide
A boolean value. If the tip is non-zero, then this returns
our case. We build a classifier with this target value later on.
0 - 6. Perhaps people tip less on Monday, and more on Friday?
0 or 1. Perhaps people simply tip on weekends, and generally don’t on weekdays?
With these features, and after a clean up of values that don’t make sense, we prune the data from approximately 11m rows to 5m rows. Most of the pruning comes from the fact that we don’t include cash transactions, which accounts for 5.7m rows. When making a prediction, we simply cast cash transactions to have no tip and hence 0 value.
At this point in time, I decided to code the basic model first and get a feel of
the current accuracy, before diving deeper. We first build a
get a feel of how many trips it predicted correctly. We will then build a
DNNRegressor on top of that. Perhaps end to end models could be explored, or a
simply regression with a threshold could work too.
Side Note: TensorFlow GPU took less than 3 minutes to install (including CUDA cuBLAS, etc. I still remember the days of TensorFlow 0.6 where installing CUDA and the dependencies were a nightmare). Tested the install by running the sample code with
CUDA_VISIBLE_DEVICE=, to prevent any GPU usage. There was a significant difference when the environment variable was set versus when it was not set, thus verifying the GPU indeed works.
TensorFlow has changed significantly over the past year, and I’ve been using
their new APIs.
tf.eager.execution() is a HUGE game changer. I can’t emphasize
how important that is. You should definitely check it out on their docs.
tf.data.Dataset is also pretty awesome. No more janky queue runners and the
likes. When I compare what I wrote now to what I wrote a year ago, the change is
pretty significant. TensorFlow has improved significantly. I coded up the first
model pretty quickly, and then ran a simple experiment to make sure that the
TensorFlow code works. It did.
I then added all the relevant features, and it turned out that the loss was
diverging. I didn’t really have an idea why at first. I thought it was because
the features weren’t normalized. I added the normalization, but it still didn’t
work. I dug deeper into the data and realized that
speed was returning
infinite speeds. This was because we had
duration of value 0. I cleaned the
data again, and the model ran. The standard deviation of the columns also
strongly suggest that we should clean the data even more to account for these
passenger_count 1.3654899165449583 1.0452344257422215 trip_distance 3.0419148386103907 3.0557791184587324 fare_amount 13.032481964650795 9.848360928439012 extra 0.35816182232546484 0.3885605888383032 mta_tax 0.4920059988529926 0.06271483421703997 tip_amount 2.2817976371139967 2.595756212593663 tolls_amount 0.14511203827191288 1.2329375456225349 ehail_fee nan nan improvement_surcharge 0.29547378107209477 0.036570196927254564 total_amount 16.72000659328439 11.768082912394933 trip_type 1.0146244501985067 0.12004406982922693 duration 1238.402000942381 5656.356759908103 speed 13.804815278283268 103.47450536216657
An average duration of 20 minutes make sense, but how can the standard deviation be 1.5 hours? I’m not saying that it isn’t possible, but we probably should dig deeper. We can see the same thing for speed. A 13MPH average speed does make sense in NY, but an SD of 103 MPH? That doesn’t sound right. When I looked at the data, it turns out that some of these rides have ridiculous speeds.
On the other hand, metrics like
passenger_count does make sense.
trip_distance makes sense too. A distance of 3 miles with a SD of 3 miles.
The basic model should be able to account for these outliers. Nonetheless, pruning this even more could further improve the accuracy.
974180 rows with a tip and and
208463 rows without a tip. We would
expect a similar distribution in the train dataset as well. This would mean that
without any training at all, we can use a black box and simply predict
everything as a ride with a tip - and we would get an accuracy of 82.37%. A good
model would have to do better than 82.37%.
We used a really basic network with the following parameters:
classifier = tf.estimator.DNNClassifier( feature_columns=feature_columns, n_classes=2, hidden_units=[2048,2048,1024,512,256,128,64], batch_norm=True, activation_fn=tf.nn.leaky_relu, optimizer=tf.train.AdamOptimizer(0.0005) )
We get the following results:
loss 115.284325 accuracy_baseline 0.82373124 global_step 739160 recall 0.9799513 auc 0.5073563 prediction/mean 0.9598021 precision 0.8375532 label/mean 0.82373124 average_loss 1.8013374 auc_precision_recall 0.8981662 accuracy 0.8269224
This is really interesting. TensorFlow has already computed the baseline accuracy for us. 974180 / 1182643 gives us exactly 0.82373124. Similar to what I typed above, we have to do better than this. We can see that our actual accuracy is 0.8269224, which is slightly better than the baseline.
Recall is high, but it isn’t 1.0, which weakly shows that the model isn’t simply
throwing out a 1 for every value it sees. Based on the recall, we can compute
the number of trips which had a tip, but was classified wrongly:
0.9799513) * 974180, which gives us
19531. You can do more simple
calculations which I won’t describe here, to arrive at
185158 out of
trips which had no tip, but was classified as a having a tip. This gives us an
astoundingly low accuracy of 11.1% for trips without tips - if we already know
the answer of course.
I dug deep into some of the samples. Here’s one. I noticed that the model predicted this very confidently as having a tip. The probability was 0.998.
VendorID 2 lpep_pickup_datetime 2017-01-01 00:40:20 lpep_dropoff_datetime 2017-01-01 00:49:28 store_and_fwd_flag N RatecodeID 1 PULocationID 74 DOLocationID 41 passenger_count 3 trip_distance 2.02 fare_amount 9 extra 0.5 mta_tax 0.5 tip_amount 0 tolls_amount 0 ehail_fee NaN improvement_surcharge 0.3 total_amount 10.3 payment_type 1 trip_type 1 duration 548 day_of_week 6 weekend 1 speed 13.2701 tip 0
Well, this looks like a classic one which will have a tip intuitively. The average speed is good at 13 MPH. It’s a weekend. The amount was 10.3. The trip distance was short. At this point in time, I would actually hypothesize that tipping is mostly based on the person, and if we could collect anonymized data, that would probably be optimal. Nonetheless, let’s see if we can do better. We could test a few hypothesis right now:
DOLocationIDas pseudo variables for classifying a “person”, as where that person stays and goes could determine his or her propensity to tip.
Normally, if you had lots of GPUs and lots of time, you could test a few hypothesis at once on a cluster, but I’ll just lump a few changes into one and run the model:
Trained on 2 epochs, contrary to 10 epochs in experiment 1, to save time.
loss 48.302036 accuracy_baseline 0.82373124 global_step 147832 recall 0.9992527 auc 0.75207496 prediction/mean 0.9770326 precision 0.8373608 label/mean 0.82373124 average_loss 0.7547276 auc_precision_recall 0.94602615 accuracy 0.839512
We’ve done a lot better than on our baseline, but recall has increased greatly
0.9992527. We have better predictions on trips that have a tip, but are
still doing poorly on trips that do not have a tip. Using this accuracy alone is
not the best measure. There was no regularization as it will be quite involved
to add it through the high level estimator APIs.
It’s time to read the Wide & Deep Learning paper. The similarity in that paper is It mentioned that a 32-dimensional embedding vector is learned for each categorical feature.
leaky_relupreviously, as I had better performance with it on certain datasets. Using
relunow to model the paper directly.
Adagradnow, as per the paper.
batch_normon still, as it “always” helps.
loss 3.5321705 accuracy_baseline 0.82373124 global_step 147832 recall 0.9923176 auc 0.9944181 prediction/mean 0.8270121 precision 0.99146277 label/mean 0.82373124 average_loss 0.05519077 auc_precision_recall 0.9987084 accuracy 0.9866333
Hmm. Too good to be true? We need to dive deeper into this. 7484 trips that had a tip were classified as a trip with no tip, and 8324 trips that had no tip were classified as a trip with a tip. Now these results are a lot more interesting. This gives us an effective accuracy of 99.2% for trips with tips, and 96% for trips without tips.
I was convinced initially, and dug deeper into the features I was using. I
forgot to remove
total_amount, which meant the model could have learnt to
fare_amount to derive if there was a tip or not.
True enough, that was what happened.
I added more features this time by using the crossed column concept for the
paper. I could also add
trip_type in future, but this requires cleaning the
datasets again (the code here definitely can be improved so this won’t happen).
That could be future work.
loss 25.42006 accuracy_baseline 0.82373124 global_step 147832 recall 0.9817806 auc 0.75852734 prediction/mean 0.82457703 precision 0.84597015 label/mean 0.82373124 average_loss 0.3971928 auc_precision_recall 0.924248 accuracy 0.83774394
We will use this as our final model and proceed to the regression.
For the regression model, we use the exact same parameters as the classifier. Technically, the weights for these models can be shared, but for now, we simply train them as a different model altogether. There aren’t any experiments needed as most of them have been knocked out in the classifier.
We can see most of the values are clustered to the bottom left quadrant. However, this graph isn’t really clear what the distribution of points is like.
We plot the empirical CDF, and it clearly shows that most of errors are actually
really small. Digging deeper into the values, we actually see that 93% of the
errors are actually between
+-2. I had a suspicion that this would happen
from the beginning - for those that tip, most of them would just tip the
standard rate of 15% - 20% in NY. Perhaps more would be on the side of 20% as it
is easier to compute. This actually means that a simple regressor could do
really well too. In any case, the RMSE for this model is
2.316. It’s pretty
decent - happy to stick with it.
What if we could deploy this model as a component of a mobile app solution that helps the taxi driver estimate the expected tip for a particular trip?
If the features are all available at the point of pick up, the data can be sent from the client to the server for processing. At the server, we will predict whether or not there will be a tip and how much the tip will be.
Implementing interpretability models would allow us to ascertain exactly which levers to pull to make this trip one with a tip. For example, we could identify that for this particular trip, if the duration is significantly shorter, then we might want to flag to this to the driver. The driver might then speed up a little so as to earn that extra tip. The downsides of this of course, would be that there might be dangerous driving, and the passenger might end up not tipping because of dangerous driving. These are just ideas of course.
The fundamental principle would be to predict whether or not there will be a tip, and then sending the levers which the driver can pull to earn or maximize the tip.
This was a pretty interesting exercise - I suspect nothing complicated was really needed. We could really try our best to push the limits of the accuracy of this, but it would take too much time. Tipping is perhaps based on the driver and the passenger, more than the distance of the journey or the features given in this dataset.