In Canada, Access to Information (ATI) laws provide rights of access to information belonging to or under the control of the government; they also provide for the proactive publication of certain information. This type of legislation gives all Canadians the right to access records of government institutions that are otherwise unavailable to the public.

Open by Default is a database that seeks to house and publish all ATI requests previously released under respective ATI legislation. Its goal is to support the journalism industry, to fight misinformation and to strengthen democracy in Canada by making these government records immediately and easily accessible to the public.

Data Collection

The team behind this project collects previously released ATI records through various tools, including from donations by journalists and researchers as well as through the use of automated mechanisms. Once received, we process and verify the validity of each record before hosting it on Open by Default. Our goal is to collect and publish every record from every jurisdiction in Canada released through access to information legislation. Currently, we have records from the federal government. All our data is hosted on DocumentCloud.

Data Cleaning

The federal government sends ATI records in a variety of formats including: email attachments, CDs, USB sticks and paper records sent by mail. Every record available for request has a unique request number issued by the government, which is the identifier – or requisition number – for every ATI document. 

The process of verifying documents involves matching requisition numbers of records to existing records published by the Government of Canada. The vast majority of ATI records on Open By Default came directly from the federal government. We are also grateful to trusted partners who have donated their personal archives of documents.

Optical Character Recognition

Many of the documents held by government institutions do not allow for searching within (i.e., they are not machine-readable). Open by Default uses DocumentCloud’s built-in Optical Character Recognition (OCR) technology to scan and generate from all files a proper transcript to help file search and accessibility.

Limitations

File types

A small subset of record requests consists of audio or video file types. These file types are incompatible with our standardized PDF file system; moreover, our solutions to provide transcripts for these files are still under development, so we are unable to host them for the time being. The IJF team is working on a separate hosting method to incorporate all file types into our database.

Optical Character Recognition

Some PDF files are given to us in poor quality, either due to age or carelessness during the record’s conception. OCR technology has its limits with generating quality transcripts, so for exceptionally rough documents, transcripts may be unreliable. The IJF team is working to find solutions to this problem.