What can you learn in a data-for-good coding marathon?


It was promising to be a beautiful, warm weekend in the city. I had signed up for the DataKind data dive. That meant spending the entire weekend nerding out on data to help non-profit organizations. Friends had invited me to go on hikes, to the beach, to drink beer… you know, fun stuff. Yet, I chose to be in a downtown office room full of other mostly strange people. Was it worth it? Read on to find out.

Continue reading


A Comprehensive Analysis of a Very Large Uber Dataset.


Taxis Versus Uber: The NYC Armageddon.

Part 1: Insights from Data Exploration and Visualization

Early in 2017, the NYC Taxi and Limousine Commission (TLC) released a dataset about Uber’s ridership between September 2014 and August 2015. This dataset contains features such as destination, trip distance, and duration that were not available in other sets released before and thoroughly analyzed by others.

The combination of trip distance and duration allows for estimating Uber’s revenue for each trip in NYC. In another hand, the pickup and drop-off locations were anonymized and grouped as taxi zones instead of geographic coordinates. This is a better attempt to preserve data privacy, but it precludes the positioning of such locations on a map.

Continue reading

Helpful Linux commands when working with large datasets.


The penguin is our best friend.

Every time I work with files whose content I cannot easily preview because the size _usually gigabytes or more _prevents the data to be quickly loaded or processed locally, GNU/Linux is my best friend. I demonstrate below a set of very useful GNU tools in the MacOS.

Continue reading