Python for Beginners: Reading & Manipulating CSV Files

Hi! I'm Greg Sadetsky and I teach Python classes at ProTech. Today, we'll be doing a little bit of Python programming. This tutorial is designed for anyone who is interested in Python, with little to no experience, and curious to learn what's possible with a few basic programming skills.

You can watch the full tutorial below, or skip to the individual sections, right after the table of contents.

Introduction

Here is an example situation: you are the organizer of a party and have hosted this event for two years. You have CSV (comma-separate values) files for both years listing each year's attendees. You would like to know which attendees attended the second bash, but not the first.

Environment setup

We're off to a great start! Err... do you have Python installed on your computer? If you're running under Mac OS X, open the Terminal application from the Utilities directory and type python

Windows users should follow this article to install it. My recommendation would be to get the latest 2.7 version (2.7.6, for instance).

Once you've got the Python executable running, you should see a line beginning with three greater-than signs. It will look like this:

>>>

This is great! Try out some basic arithmetics, for instance typing in 2 * 3 and pressing the Enter key. Once you've seen the interpreter answer back, you can exit it by typing exit() and pressing Enter.

Opening a CSV file

Create a new text file in your favorite editor and give it a sensible name, for instance new_attendees.py. The .py extension is typical of Python program files.

We'll be using the following example CSV data files (all attendee names and emails were randomly generated): attendees1.csv and attendees2.csv. Go ahead and download these files to your computer. Place them in the same directory where your program file, new_attendees.py, lives.

Python allows you to open text files such as these and read their content, either all at once, or line-by-line. In the case of CSV files, we'll make use of a module built-in to Python which will simplify their parsing. The module in question is called, simply, csv.

We will need a few things to get started: first, since we will be using the csv module in our code, we'll need to let Python know about this. Although Python provides you with a number of built-in modules, you need to explicitly declare which modules you'll be using. This will become our program's first line:

import csv

We're now ready to write the rest of the program. We'll open the first attendance CSV file, store a reference to this open file as f and then pass this reference to the csv module. Namely, the csv.reader function which will parse the open file, and return a parsed list of rows. Here's what it would like so far:


import csv
f = open('attendees1.csv')
csv_f = csv.reader(f)

To run this program, in your terminal, go into the directory containing the Python program file and the attendees CSV files, and type python new_attendees.py

Looping through rows

Although the code above is valid, it's a bit unsatisfying as it does not output any data, making it a bit hard to follow along... Let's print out the rows contained in the CSV file by looping over the csv_f list.


import csv

f = open('attendees1.csv')
csv_f = csv.reader(f)

for row in csv_f:
  print row

You will see that we've added two lines -- one is the for loop which defines that for the following indented lines, the row variable should contain each element from the list, and the second line (indented, since we want this line to be part of the for loop) which will print this row variable.

What do we get in return? Beauty!


['Jackson', 'Thompson', 'jackson.thompson@hotmail.com']
['Luke', 'Wallace', 'luke.wallace@lycos.com']
['David', 'Wright', 'david.wright@hotmail.com']
['Nathaniel', 'Butler', 'nathaniel.butler@aol.com']
['Noah', 'Simpson', 'noah.simpson@hotmail.com']
['Eli', 'Mitchell', 'eli.mitchell@aol.com']
...

This list should go on and on. Yes, it is the list of attendees from the attendees1.csv file! Hooray!

Extracting information from a CSV file

Every row that we are looping overis a list object(csv_f is a list of lists). We can see these lists (corresponding to rows in the attendees CSV file) are made up of three elements, the third being the e-mail address we'd like to use to compare attendees.

In Python, square brackets are used to access an element located at some position ("index") in a list. Namely, for a list my_list containing ['a', 'b', 'c'], we would write my_list[0] to access its first element (Python lists are 0-indexed, just as in many other programming languages -- see this). All together, as another example, to print the "2nd" (in human-speak) element of a list, you would write:


new_list = ['red, 'green', 'blue']
print new_list[1]

This should print out green.

Back to our attendee CSV file, to print out the e-mail address of every attendee, we'll modify the code a bit in order to get:


import csv

f = open('attendees1.csv')
csv_f = csv.reader(f)

for row in csv_f:
  print row[2]

Run this code, and you should see the e-mail address of every attendee printed out. Almost there!

Creating lists

We were successful in accessing and printing the e-mail address from every row in the CSV file. Let's now create an empty list object which we will populate with those e-mail addresses. This, again, is to allow us to compare the first year's attendees with the second year's.

First things first -- let's create an empty list. An empty list is assigned to a variable by writing the following:

new_list = []

To append elements to this list (since an empty one won't be of much use), we'll use the list's append method. Considering the list created above, appending a string to it would look this:

new_list.append("a sentence")

Did it work? We can verify by printing the list object:

print new_list

We should see:

['a sentence']

i.e, a list with one element in it! Great! Modifying the code we've been working on so far gives us this:


import csv

f = open('attendees1.csv')
csv_f = csv.reader(f)

attendee_emails = []

for row in csv_f:
  attendee_emails.append(row[2])

print attendee_emails

Once run, this code will print a single list (which should appear as a long, single line) with all of the attendees' emails. Is anything missing? Well, the code above is only running on the attendees from the first year's party (since we're only reading the attendees1.csv file). We should do the same processing (i.e., extracting the e-mail addresses) on the second file as well.

In order to keep this example simple, we've decided to duplicate the code that processes the file attendees1.csv and do almost the same operations for attendees2.csv. A word of warning! Copying & Pasting code as we'll do below is considered bad practice. We're doing it here to keep the code readable -- the typical and better way would be to take out what's common to the code that's being copied/pasted and to put that code into a reusable block (a function). So, do keep in mind that we're sacrificing code quality for ease of understanding.

Once modified, the code above now does the same steps for both attendees1.csv and attendees2.csv files:


import csv

f = open('attendees1.csv')
csv_f = csv.reader(f)

attendee_emails1 = []

for row in csv_f:
  attendee_emails1.append(row[2])


f = open('attendees2.csv')
csv_f = csv.reader(f)

attendee_emails2 = []

for row in csv_f:
  attendee_emails2.append(row[2])

print attendee_emails1
print attendee_emails2

Converting lists to sets

Two more steps left. We will first convert the lists we have (attendee_emails1 and attendee_emails2) into sets. A set is a data type which a bit like a list, in that it stores a number of objects, but with the differences that objects are unordered, and that all of the contained objects are unique. So, a set might contain 2, 5, 4 (in no particular order) but not 2, 2, 4, 6, 6 since sets only contain unique values.

By converting our lists into sets, we will gain the ability to do comparisons of the two lists' values. To convert a list into a set, simply pass the list to the set() function. Thus:


my_list = [2, 6, 5, 5, 3, 4]
my_set = set(my_list)
print my_set

... will result in the following output:


set([2, 3, 4, 5, 6])

You probably noticed that the value 5 appears only once in our set, although it was present twice in my_list. Adding the set conversion operations to our code gives us:


import csv

f = open('attendees1.csv')
csv_f = csv.reader(f)

attendee_emails1 = []

for row in csv_f:
  attendee_emails1.append(row[2])


f = open('attendees2.csv')
csv_f = csv.reader(f)

attendee_emails2 = []

for row in csv_f:
  attendee_emails2.append(row[2])

attendee_emails1 = set(attendee_emails1)
attendee_emails2 = set(attendee_emails2)

Outputting the differences between the 2 sets

One very last step! We created sets out of our attendees' e-mail addresses specifically to compare them. Let's do just that! Sets support multiple operations, one of which will tell us which elements of one set is not present in another set. Let's try a simple example by creating two sets:


small_set = set([2, 3, 4, 5, 6])
large_set = set([1, 2, 3, 4, 5, 6, 7, 8, 9])
print large_set.difference(small_set)

As expected, the result should be 1, 7, 8, 9. Note that the order in which these resulting elements appear might look different -- this is because sets are unordered.

We see that calling the difference method does just what we want to do with our attendees' e-mail addresses! So, without further ado, let's go ahead and determine who attended the second party and was not present at the first!


import csv

f = open('attendees1.csv')
csv_f = csv.reader(f)

attendee_emails1 = []

for row in csv_f:
  attendee_emails1.append(row[2])


f = open('attendees2.csv')
csv_f = csv.reader(f)

attendee_emails2 = []

for row in csv_f:
  attendee_emails2.append(row[2])

attendee_emails1 = set(attendee_emails1)
attendee_emails2 = set(attendee_emails2)

second_year_attendees = attendee_emails2.difference(attendee_emails1)

print second_year_attendees

You should see a set of 19 e-mail addresses printed out. We did it! Bravo! Any questions? Please askthem below.

Want to learn more?

ProTech offers private Python training, delivered onsite or online for teams of 4 or more, as well as classes open to the public:

Copyright © 2018 ProTech. All Rights Reserved.

Sign In Create Account

Navigation

Social Media