Project

By next week, you should have a team ready and an initial project proposal (one-two page summary) that answers the following:

What question you are interested in answering? Really start with this - whatever topic that actually interests you can be investigated from the data perspective: computer science, music, sports, social justice, finance, ... .
What could we use your answers for? This is related to why you actually find this problem interesting - once we know the answers from the data, what can we achieve that we couldn't achieve before?
What data will you use to answer this question We will learn more about obtaining the data, so there doesn't have to exist a ready dataset. Ideally you would combine information from multiple sources.
What could be the shortcomings of using this data to answer this question. Think about the ways the data could be incomplete (for example: not everybody uses Twitter, so maybe Twitter is not a perfect source on public opinion) or biased (for example: there are many more man than women that ride blue bikes - is it something about bikes in general, or blue bikes in particular, or maybe it's a societal problem, or maybe a problem of infrastructure?).
What are the potential side-effects of the project? Given the limitations of the data and our stated prupose of the answers, what are the potential side effects if we actually acted on what we learnt.

It doesn't have to be perfect, this is just to start a conversation on the project before you dig deep into the code

The practicum proper

Today we continue working with real data: all BlueBike trips from August 2019. We'll learn about loading data from file and saving to another file and a few more things about the habits of BlueBike users.

You can download the solution to last week's Exercise 4 problem here: pr06_solution.py.

Start by downloading the starter file and the data set (it's best if you right-click the links and choose "Save link as..." or "Download linked file as..." and save both to same folder). The dataset has information about 337,443 trips. Each trip is described in one line of the file using five properties: its start time, duration in seconds, distance between the start and the end station in miles, self-reported year of birth of the user, and self-reported gender of the user (0 - female, 1 - male, 2 - other/unknown).

Excercise 1: Loading data from file

Using the load_data() function from last week as an example, write a new load_data() function that also reads in the year of birth and the gender of the user.

In the main() function load the data and using list comprehesion report what fraction of trips are taken by users with each gender (female, male, unknown/other).

String formatting

Until recently, we have been printing values of variables using concatenations with commas like this:

				timestamp = '2019-08-01'

				trip_count = 1000

				avg_distance =  4.123512364

				print('There were', trip_count, 'trips during', timestamp)

				print('The average distance was', avg_distance)

or with pluses like this:

				timestamp = '2019-08-01'

				trip_count = 1000

				avg_distance =  4.123512364

				print('There were ' + str(trip_count) + ' trips during ' + timestamp)

				print('The average distance was' + str(avg_distance))

The first method was convenient because we didn't have to remember about adding spaces or converting integers to strings, but it was also not really flexible (what if we don't want those spaces?). The second method was a bit more flexible but problematic because we had to convert numbers to strings.

Neither of these methods allowed for controling precision (the number of digits after the decimal point), often leading to ugly outputs.

All these problems can be addressed with string formatting, like that:

				timestamp = '2019-08-01'

				trip_count = 1000

				avg_distance =  4.123512364

				print('There were {} trips during {}'.format(trip_count, timestamp))

				print('The average distance was {:.2f}'.fomat(avg_distance))

Excercise 2: printing statistics

Update your code for printing the fraction of trips taken by users of each gender so that it shows the percentage down to two decimal digits.

Printing to file

In python to save some data to file, we first open it for writing (that's what 'w' means), then write content to it using the write() method, and then we close it, like that:

				fout = open('filename','w')

				fout.write('The content we are re writing, and a new line symbol \n')

				fout.close()

Note that unlike print(), write() does not automatically go to the new line, so if we want to write multiple lines, we need to end each with the new line symbol: '\n'.

REMEMBER: using open(filename, 'w') opens the file for writing. That means that if the file doesn't exist, it is created. However, if the file already exists, its contents are removed upon opening without a warning!.

Excercise 3: writing to file

When developing data analysis code it's easier and faster to first work on a small subset of data, rather than the entire dataset at once. One way to create a smaller dataset it to split the big data by date.

Let's do that with our BlueBike data: let's create a smaller dataset that only has data about the first week of August (from 2019-08-01 to 2019-08-07) and save it to a new file.

A useful trick for subsetting data by timestamp is to use the comparison operator < with strings. It tells us whether two strings are in alphabetical order, for example:

'a' < 'b' is True,
'c' < 'a' is False,
'2019-08-09 00:00:01' < '2019-08-09 00:00:02' is True
'2019-08-09 00:00:01' < '2019-08-10' is ... ?

Step 1: create a subset

Create a new variable that holds only the first week of data.

Step 2: print to file

Open a new file for writing.
Use string formating and the .write() method to write each row of your new data to the file - don't use list comprehension, use a standard for loop where you write one line at the time in the loop body.
Close the file when the loop is over.
Make sure the file is in the same format as the original dataset (the same order of columns, divided by commas with no spaces, etc.)

Step 3: did it work?

Open the folder with your code in Finder/Explorer and see if the new file is there.

Modify the load_data() function so that it takes a filename as an argument, rather than having bluebikes_07.csv hardcoded in the body of the function.

Use your newly changed function to load the first week of data from file and print out the number of lines (it should be 80850).

Excercise 4: Let's learn more about how people use BlueBikes (not graded, but will help a lot with the project)

Answer more questions about the new dataset (just the first week):

Are there more trips on a weekday, or on a weekend?
Are the rush hours the same on Monday, Friday, and Sunday?
What is the average age of riders of each age?
Do people travel slower on the weekends?
What other questions would you ask?