What happens before the IJF releases data into our eight public interest databases? Well, a lot of things.

First, the data is collected from government websites. The data that comes in might be difficult to read, have errors, inconsistent spellings of names and organizations, and be generally unusable. These errors have to be cleaned up and the data standardized before being published on our website.

It’s a long process, but it doesn’t end there. After the data is posted, we still need to ensure that it’s up to our standards of accuracy and clarity. That’s where news applications developer Lindsay Katz comes in.

Katz is designing the IJF’s data validation systems, which check to make sure that our data is clean and accurate. Creating code to check the data in this way “allows us to catch our own errors, but also catch errors that are systemic in the records,” said Katz.

Prior to working at the IJF, she graduated from the University of Toronto with a masters in statistics and the University of Guelph with a bachelor’s of arts and sciences in international development and mathematical science.

Katz sat down with IJF reporter Hannah Carty to talk about working with, and more importantly, questioning data.

What do you do at the IJF?

I joined the IJF team for the main purpose of implementing a suite of data validation tests to assess the quality of the IJF’s eight public interest databases.

What excites you the most about working for the IJF?

I'm really enjoying just working with the team and learning from everybody. Also, expanding my data skills and being able to work on databases that are so large and so important that real stories are being based on.

The IJF collects data through web scraping but also from PDF records that have to be scanned with Optical Character Recognition technology. How much work goes into making the data that people see on our website actually usable?

A lot. It looks very clean, but it obviously did not start out that way. I think people don’t realize as well [that] the OCR, the Optical Character Recognition technology that went into this, involved so much manual checking, and catching things like where an “S” was recorded as a five, or an “O” was recorded as zero. And those have huge implications with the records that we’re working with. There’s also significant work involved in wrangling the data scraped across provinces and territories to make it clean and usable for analysis.

This past April, Lindsay presented a poster on her joint work with Dr. Monica Alexander and Michael Chong at the 2023 Population Association of America annual conference, where she won an award for best poster. (Courtesy Lindsay Katz)

What appeals to you about data?

I think that data is very powerful. I like cleaning messy data, I like exploring it, I like figuring out ways that we can test it and just really ensure that the quality is up to certain standards.

Along with IJF news applications developer Callandra Moore, you're working to be able to analyze Canadian political donations data by gender using a tool that predicts the gender of a first name. What are some of the possibilities and limitations of this effort?

Some of the really interesting routes that we want to take the gender-labeled data is to look at differences in donations across political parties. So how different parties receive donations and how that varies by the donor’s gender. And then also to look at the candidate level to see if female candidates receive more donations from women than men.

I think you make a lot of strong assumptions when you assign a gender to someone's name. It doesn't account for the fluidity of gender, it doesn't account for variation over time. Some names are more gender neutral, so it's hard to account for that.

Can you tell me more about your work as a graduate student researcher?

I've also been working on a really cool project with Dr. Monica Alexander, where we have been working with data that another graduate student has collected from Facebook's advertising platform. It contains data on the number of Facebook users traveling in each province in Canada, and each state in the U.S. This is really valuable demographic data that comes from a non-traditional data source.

Lindsay with a lemon meringue tart she made for a friend’s birthday. (Ray Katz)

How do you spend your time outside of work?

My main hobby is baking. I worked at bakeries for a number of years in my teens. I have a big sweet tooth, and I’ve always loved trying out fun, challenging new recipes.

What’s your favourite thing you’ve made?

Oh, that’s a hard question. I think the thing I’m most proud of is a dark chocolate passion fruit cake that I made for my mom’s birthday.