Sometimes the very representation method of data, by itself, can provide a huge amount of information and might direct you towards a good analysis. In this article, I will dwell on some interesting plotting methods, provided by R, which are pivotal if you are facing geodata.

I will use the famous NYC Taxi Dataset, which contains many data frames with a huge amount of data. If you want to download and prepare your data, here the whole process is explained. However, after a first cleaning procedure of your data, you can combine your tables up to your task. Here, I joint some table to obtain the following information:

- Pickups and dropoffs coordinates
- Number of passengers
- Tip amount
- Trip distance
- Payment type
- Neighborhood code
- Neighborhood name

Here there are the first lines of my resulting dataset:

Now I want to introduce some graphical tools. After installing the *ggplot2 *library, I used the coordinates of Staten Island (again, this dataset can be obtained by following these instructions) to display NYC and its four main neighborhoods, with different techniques of separation:

ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, colour=BoroName))

ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),alpha=1/5,colour="black")

Now I’m going to plot the coordinates of pickup (in green) and dropoff (in red) points:

#I added transparency to red dropoff points not to hide those in green ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=sample,aes(x=pickup_longitude,y=pickup_latitude),size=1,fill='green',colour="black",pch=21)+ geom_point(data=sample,aes(x=dropoff_longitude,y=dropoff_latitude),size=1,fill='red',alpha=1/10,colour="black",pch=21)

Now you can really unleash your creativity and add new variables. Namely, I’m interested in inquiring about tips’ amount, since it might depend on the neighborhood of pickup points:

ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName), color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95))+ geom_point(data=sample,aes(x=dropoff_longitude,y=dropoff_latitude,size=tip_amount),color='blue',alpha=1/5)

Again, I could add another variable, or divide dots by colors depending on the pickup neighborhood:

#adding payment type as new variable gplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=sample,aes(x=dropoff_longitude,y=dropoff_latitude, size=tip_amount, colour=payment_type))

#using different colours depending on pickup neighborhood ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=sample[!sample$boroname=='New Jersey',],aes(x=pickup_longitude,y=pickup_latitude, size=tip_amount, fill=boroname),colour="black",pch=21)

Finally, let’s have a look at the longest trip:

longest_trip=max(sample$trip_distance) ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=sample,aes(x=pickup_longitude,y=pickup_latitude),size=0.5,color='green')+ geom_point(data=sample,aes(x=dropoff_longitude,y=dropoff_latitude),size=0.5,color='red')+ geom_curve(aes(x = pickup_longitude, y = pickup_latitude, xend = dropoff_longitude, yend = dropoff_latitude), color='blue', arrow = arrow(length = unit(0.25, "cm")),data = sample[sample$trip_distance==longest_trip,])

Nice, I’ve had my fun. Now let’s give a sense to these data.

What I’m going to do now is deploying specific algorithms to solve some tasks. I will use Random Forest to predict the tip amount given the trip distance, the neighborhood of pickup and the payment type. Then, Support Vector Machine will help me to predict which trips are more likely to be paid in a specific way (cash, credit, etc.) depending on tip amount and trip distance.

Be aware that these combinations of metrics I’ve just described are just some of the many you can arrange, depending on your task and the complexity of the algorithm you want to build.

So, let’s start with Random Forest. Random Forest is an algorithm which combines several decision trees, returning the most accurate combination. Just to provide a foretaste of how decision tree actually works, let’s have a look at the following output (here, I trained one single tree on my sample dataset):

install.packages("rpart") library(rpart) install.packages("rpart.plot") library(rpart) sample$boroname <- factor(sample$boroname) sample$payment_type<-factor(sample$payment_type) single_tree = rpart(tip_amount ~ pickup_longitude + pickup_latitude + dropoff_longitude + dropoff_latitude , data=sample) prp(single_tree) rpart.plot::prp(single_tree)

One of the reasons why decision trees are so popular is that they mimic how the human brain approaches the decision process. Let’s read a branch of this tree together, starting from the root.

“Is the longitude of the dropoff point smaller than -74? If yes, is the longitude of the pickup point smaller than -74? If yes, the tip you are supposed to give the driver is $ 0.39. (please note that here the number is truncated at -74, but the algorithm is considering, of course, also the decimal, otherwise, it wouldn’t make sense).

Now let’s imagine an algorithm capable of building thousands of trees and then finding their optimal combination. This is exactly what Random Forest does, and, in this case, we will train and combine 2000 trees.

install.packages("RandomForest") library(randomForest)

Before training the algorithm, let’s split our data into training and test set:

install.packages("caTools") library(caTools) set.seed(123) split = sample.split(sample$payment_type, SplitRatio = 0.75) train = subset(sample, split==TRUE) test = subset(sample, split==FALSE)

Now we can fit our model on our train set:

model = randomForest(tip_amount ~ trip_distance+boroname+payment_type, data=train, importance=TRUE, ntree=2000)

Finally, let’s make predictions on our test set:

Prediction = predict(model, test)

As for any predictive task, the first thing to do is inspecting residuals, which should behave ideally as normally distributed.

res = test$tip_amount-Prediction plot(res)

Residuals do not exhibit a specific pattern, hence our model is not missing some available information.

Now we can start visualizing our results on our maps. In the following plot, color indicates the pickup neighborhood, while the size of dots is proportional to the error term.

ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=test[!test$boroname=='New Jersey',],aes(x=dropoff_longitude,y=dropoff_latitude, size=residuals, fill=boroname),colour="black",pch=21)

The first information we can gather from here is that only two drivers leaving from the Bronx received a tip, nevertheless, the algorithm was able to predict well the amount of those two tips (the size of dots is equal to zero).

Now let’s proceed with our second task: predicting payment type. As anticipated, I will use SVM, particularly suitable for classification task (if you are interested in a full experiment setting using SVM, you can read my article here)

install.packages("e1071") library(e1071) model_2 = svm(payment_type ~ tip_amount+trip_distance, data = train, scale = TRUE, kernel = "radial", cost = 5) payment_prediction=predict(model_2,test)

Let’s check its accuracy:

#computing a vector of TRUE/FALSE valuesz=payment_prediction==test$payment_type length(z[z==TRUE]) accuracy=length(z[z==TRUE])/length(test$payment_type) accuracy [1] 0.8047987

Some considerations need to be done. First, in the original dataset, there where four methods of payment (cash, credit, no charge and…dispute). However, the most frequent were the first two, and I want to check whether I can circumscribe my analysis on these two.

#adding the true/false vector to the dataframetest$accuracy = c(z, rep(NA, length(test$payment_type)-length(z)))#adding predictions to the dataframe, taking care of converting integer values#into the original labelstest$temporary = c(z, rep(NA, length(test$payment_type)-length(payment_prediction)))#checking whether to circumscribe the analysis to 'cash' and 'credit'test[test$temporary==3,] test[test$temporary==4,]#since the output is an empty dataframe, I can work with just two labels#replacing number with labelstest$payment_predictions = factor(test$temporary, levels = c(1, 2), labels = c("cash","credit"))#removing the temporary columntest = subset( test, select = -temporary )

Now that I have added my predictions column to the test set, let’s plot these predictions on our map:

ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=test,aes(x=dropoff_longitude,y=dropoff_latitude, colour=accuracy)) ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=test[!is.na(test$payment_predictions),],aes(x=dropoff_longitude,y=dropoff_latitude, colour=payment_predictions))

And, finally, let’s check the distribution of well predicted values:

ggplot() + geom_polygon(data = ex_staten_island_map, aes(x = long, y = lat, group = group, fill=BoroName),color = "#080808", alpha=1/10) + coord_cartesian(xlim = c(-74.1, -73.7), ylim = c(40.5, 40.95)) + geom_point(data=test[!is.na(test$accuracy),],aes(x=dropoff_longitude,y=dropoff_latitude, colour=accuracy))

Now let’s say next summer I want to go to NYC. I will land at JFC and the first place I want to visit is Time Square. I’m not used to local habits and I’m wondering which should be the fair amount of tip.

The first thing to do is adding to my data frame the information I know about my trip, then using the single tree I trained before to predict the fair amount of tip I’m supposed to offer the driver:

newRow = data.frame(pickup_longitude=-73.77814, pickup_latitude=40.64131, dropoff_longitude=-73.985130,dropoff_latitude=40.758896,passenger_count=1,trip_distance=14.7,tip_amount=NA,payment_type=NA,borocode=4,boroname="Queens") sample = rbind(sample,newRow) my_tip = predict(single_tree,sample[9840,])

Let’s display the result:

my_tip

Great, I will keep it in mind while paying for my ride.

Beside my journey to NYC, what it’s really important to keep in mind is that visualizing your data could be as important as inspecting and making some analytics about them. In this article, we were able to display geodata on a map and then they suggested us some interesting existing relations among them.

Hence, remember always to set properly your analysis environment so that you can gather significant information from your data.