Home Practicum 1 Practicum 2 Practicum 3 Practicum 4 Practicum 5 Practicum 6

Welcome to Practicum 6!

Today we continue working with real data: all BlueBike trips from August 1, 2019. We will learn about outliers in the data, and a few things about the habits of BlueBike users.

Start by downloading the starter file and the data set (it's best if you right-click the links and choose "Save link as..." or "Download linked file as..." and save both to same folder). The dataset has information about 337,443 trips. Each trip is described in one line of the file using three properties: its start time, duration in seconds, and distance between the start and the end station in miles.

For example, a line that looks like this:

2019-08-01 00:01:37.8320,1227,3.679

describes a trip that started 1 minute and 37 seconds after midnight, lasted 20 minutes and 27 seconds, and was 3.679 miles long (it was actually longer, that's the distance in a straight line).

You will see in the starter file that I already wrote the function for loading the data. We will learn more about loading data next week, for now you can just have a look at it and see if you can understand how it works. Run the starter file to see what the data looks like after loading: it's a list of lists, each inner list has three values.

List comprehension

In previous exercises we computed average values of grades and temperatures using our own function that looked something like this:

def avg(any_list):
    return sum(any_list)/len(any_list)

That function takes a list of values as an argument. How can we use it in our new dataset to calculate the average duration of a trip? We first need to create a list that contains only the durations (the element at the 1st position of each of the inner lists in our dataset):

durations = []
for row in data:
    durations.append(row[1])
print(avg(durations))

As you can see, creating this list tooks us three lines of code.

List comprehension allows us to do the same operation in one line, making the code cleaner and easier to read:

durations = [row[1] for row in data]
print(avg(durations))

List comprehension is used in the format of [statement for element in list], and results in a list where each element is the result of executing statement for each consecutive element in the list. In our example, the statement is row[1] - take the element at index 1 from row for each row (element) in data (list).

Excercise 1: Summary statistics of the dataset

Similarly to our avg() function, the max() and min() functions also require a list of values, and not a list of lists.

In your main function use list comprehension and print:

  1. minimum, maximum, and average distance traveled;
  2. average speed of the trips in miles per hour. This is a very rough estimate since the distance is between stations and not the actually biked distance. Tip: do not create separate lists for duration and distance, instead create one list called speeds where you calculate the speed for each trip (row) and then you take the average of that list.

List comprehension with conditionals

As you noticed in the previous exercise, the maximum distance of over 5268 miles is probably erroneous - the reported distance is between the starting and the ending station, and the two most distant stations in Boston are only 10 miles apart. Data points like this are called outliers.

Most datasets that you find online will need some cleaning and outlier removal, and we can use list comprehension with conditionals for that purpose. The format looks like this: [statement for element in list if condition] and it means that the statement will only happen if the condition is met for a given element.

Excercise 2: Removing outliers

Create a function that takes the dataset as input argument, and uses list comprehension to create a new dataset that does not have rows containing outliers:

def remove_outliers(data):
    clean_data = [row for row in data if True]
    return clean_data

We want this function to return the dataset in the same format, where each row is a list of three values, just without the erroneous rows.

You can copy the code above and replace True with the correct condition (checking if the distance is possible, no other modifications are necessary).

Then call this function in your main and save the cleaned data to a new variable and answer:

  1. Is the average traveled distance different for the clean data?
  2. How many data points did we remove?

List comprehension to count values

We can also use list comprehension to find out the fraction of data points that match certain criteria like that:

  1. Create a list of booleans (True if the criterion is met, False if not)
  2. Sum this list: True will be treat as 1, False will be treated as 0, so the sum will be the number of rows that meet the criterion.
  3. Divide that sum by the lenght of the list - that gives us a fraction of rows that meet the criterion.

For example to calculate the fraction of trips that started and ended at the same station (the reported distance is 0), we would write:

fraction = sum([row[2] == 0 for row in data])/len(data)

Let's break it down:

[row[2] == 0 for row in data]

is a list of boolean values: True if a given distance is equal to 0 and False if it's anything other than 0.

When we sum that list, python will treat True values as 1's and False values as 0's. Therefore, the sum is the count of trips with 0 distance.

Finally, by dividing this count by length, we get the fraction of 0-distance trips among all trips.

Exercise 3: Counting values

Single BlueBike trips below 30 minutes cost $2.50. What fraction of trips costed that little?

Exercise 4: Rush hours (additional, non-graded exercise if you finished the previous ones)

In this exercise we will find out when the peak traffic hours were on August 1, 2019. Using list comprehensions will be helpful, but it might be very difficult do to it all in one line, so feel free to mix "usual" for loops and list comprehensions.

  1. Create a function takes the timestamp string as argument and extracts just the hour as an integer, for example, when given 2019-08-01 00:01:37.8320 it returns 0.
    Tip 1: The timestamps are strings of length 24. Tip 2: Strings are just lists of characters, you can use list indexing to access particular characters or character ranges.
  2. Create another function for this exercise. In that function:
    1. Loop over the possible hours 0-23, for each use list comprehension to count the number of trips that started at that hour, store that number in the list.
    2. Sort the list and find the four hours that have most traffic.