Over the last decade, we’ve seen an enormous increase in the amount of open data on the internet, published on an array of servers, portals, APIs, and catalogs. Agencies are embracing the call to ‘get it out there’ — which is great news.
Why are they doing this? On the one hand, we have a range of benefits that are specific to government itself, including:
Beyond government, the benefits of opening public data to industry and civil society, include:
The upshot is that open data has the potential to have a massive positive impact on our economy, environment and society.
However, as the open data movement has matured, the expectations from open data users have risen. While it’s great to see government agencies publishing their data, it’s fair to say that the impact of open data releases to date has been underwhelming.
Why is this? While hundreds of thousands of government datasets have been made open, first-generation technologies have made them extremely difficult to find, access, combine, translate, and use. This is especially the case for geospatial data, where proprietary formats and large files lock out users that aren’t GIS professionals.
In response, an increasing number of users are calling for agencies to do a better job making their data openly available. As the Civil Analytics Network – made up of urban Chief Data Officers based at Harvard University — argued in a recent open letter, the open data movement needs to “set higher goals for open data to make it more accessible and usable.”
It’s no longer good enough to ‘get it out there.’ In 2017, users expect immediate access to the public data they need, in the specifications of their choice. And agencies, for their part, are starting to expect a radical increase in the use and impact of their open data, leading to a marked improvement in the ROI of their data assets.
Before you release your open data, it’s important to know exactly what ‘open’ means. Open is a concept that means different things to different people.
For some, data is open when it is made available on a publicly accessible website. For them, the purpose of open government data is transparency: by making government data publicly available, citizens can more effectively hold governments to account for their decisions on policy and expenditure.
For others — including most open data users — data is only truly open when it is made available under an open licence, in an open, machine-readable format.
This perspective is best summarised by the Open Knowledge Foundation’s Open Definition, which reads: “Open data and content can be freely used, modified, and shared by anyone for any purpose.”
So, the purpose of open government data is much broader than achieving greater transparency. The purpose is to allow anyone to reuse published government data in any way they see fit, without facing any technical or legal barriers.
Open data policy generally includes a range of legal and technical requirements for government agencies. We can think of these recommendations as a series of boxes that agencies are required to check before their data can be considered open. These checkboxes include:
In some jurisdictions, government data is automatically placed in the ‘public domain’, free of copyright restrictions on reuse. Where this is not the case, government agencies need to apply an open license. An open license ensures that government data is available for anyone to legally reuse, subject to attribution requirements.
There are several open licences available, with Creative Commons being the most universally recognized and used. Generally speaking, the only restriction that agencies can place on reuse is to require attribution.
Because government data can be large and complex, most high-value uses will involve some degree of processing by software applications — from common spreadsheet programs, such as Excel, to specialized software, such as a Geographic Information System, or GIS, for geospatial data.
Open data policies tend to require that data be released in a format that can be reused in these applications. For example, a common complaint from data reusers is when tabular data is released in PDF, as opposed to a machine-readable format like CSV. A similar complaint occurs when agencies release a visual representation of geospatial data — i.e. a map — without releasing the data itself.
For many users, open data is not considered truly open unless it is made available under open standards — which includes open formats — that don’t restrict usage to proprietary software applications.
Many governments use a centralized data catalogue to point to data that has been published across the government sector. This is intended to make published data more discoverable, and ensure that data users are not required to scour government websites to find the data they need.
Government agencies, then, are increasingly releasing data on the internet, under an open licence, often using open and machine-readable formats. But there is another, crucial decision that every government agency needs to make, namely: Where should I publish my data?
Over the last decade, a range of approaches have been tried, leading to incrementally greater levels of data reuse.
The first mode of distribution was — and, for some agencies is — to simply make data available on request. The data itself remains on internal file drives and databases, and is distributed by CD, DVD or USB. Data users, though, have to know exactly what data an agency has before they can request it.
Similar to firewalled distribution, data remains on internal file drives and databases, and is distributed by CD, DVD or USB. Agencies, though, begin to list descriptions of available data on their website, making it easier for users to know exactly what to request.
Agencies use a range of tools to offer visualizations of data, including simple GIS map viewers and graphing software. This enables the public to view a visual representation the data, but often doesn’t allow them to easily access and reuse the data itself.
More recently, agencies have started to use early stage technologies to distribute data, including open data portals and public servers. These enable agencies to meet the requirements of open data policies—and thus checking their respective open data boxes—though results to date have been mixed.
The latest development, the data service, expands the use of open data beyond ‘early adopters’ in the geospatial and developer communities, and seeks to make it considerably easier for everyone to discover and use data.
For most organisations, the decision to publish data openly will represent a step-change in how their data is distributed and used. The good news is that much of the heavy-lifting can be completed by your choice of technology platform—i.e. the data portal or data service you choose.
Regardless of which solution you choose, there are six steps you’ll need to follow.
First and foremost, you will need to make a choice of data platform. This decision will need to be informed by the range of technical functionality you require. For more guidance on making this decision, read our guide on the ten things you need from your open data portal.
Depending on the size and type of your organisation, and the ongoing capacity of your team, it may not be feasible to publish all of your open data at the outset. This means that you will need to decide on what to prioritize. Remember that data publishing is ongoing, iterative process: feedback and analytics on how users interact with your data will help you revise, refine, and expand your investment in the future.
The next step is to prepare your data for publication. While it is ultimately up to each publisher to decide how their data will appear, there are some simple things you can do to improve usability (and increase impact). These include:
Now, you should be ready to add your data to your site. The method you choose to do so will depend on the volume and nature of the data you wish to publish, as well as the technical capacity of your data publishing team.
Ideally, you should be able to have added automatically scanned and added from a data source, such as Windows SMB/CIFS file share, public ArcGIS Server, or WFS service. Your chosen platform may also automatically add site descriptions of each dataset, populated from your metadata — though you may also wish to amend these manually.
When your data is added, the final step before launch is to set up the branding and appearance of your site, on your own custom domain. This will involve choosing logos, colours, and featured datasets. Your publisher may provide you with a ‘sandbox’ site, where you can experiment with how your data is presented to the user.
Post-launch, your organization will need to update existing datasets from your internal data sources. You may also wish to publish additional datasets, utilizing analytics on how users interact with and reuse data from your site.
As we have already suggested, authoritative data publishing requires an ongoing resource commitment from the publisher. The level of this commitment will depend on the amount of data you wish to publish, the regularity of any updates to that data, and the relationship you wish to cultivate with data users.
In general, this commitment will take two forms:
Subscription fees to the technology platform. The subscription rate can be calculated in a variety of ways, though will generally depend on the amount and type of data you wish to publish.
Staff FTEs, including salary and overhead. As we have already noted, while your data platform should be able to automate a range of processes, authoritative data publishing will require staff to make decisions on how to prepare and publish your data and communicate with users.
Over the last decade, an increasing number of organisations have sought to distribute their data over the Internet. The reasons for doing so are obvious. By distributing data using the Internet, there is no longer any need to manage the time-consuming process of extracting and customising data for each individual data user (and, at times, having to send it via physical media in the post).
However, earlier modes of data distribution - such as visualisation or sharing data in a ZIP file on an FTP server or online data catalogue - tended to make it difficult for users to know exactly what data they were getting, and placed significant barriers on high-value reuse of data. The upshot for data publishers is that most data made available in these ways has not been well used.
In our experience, the last decade of data publishing has shown that it’s not enough to simply make your data available. A radical increase in data reuse - and the subsequent benefits that provides - will only occur if data is published to meet the requirements and workflows of data publishers and users.
By meeting these requirements, publishers can see markedly greater reuse of their authoritative data. This leads to greater brand association with that data, and, with rich analytics, better investment decisions on how they make their authoritative data available to their users.
Users, for their part, can access more authoritative data about our planet in the specifications of their choice. When authoritative data publishing is spread over an entire industry, the result is clear: better decision making, fewer delays and increased innovation. When these benefits are spread over an entire country, we can make a material difference to society, the environment and our economy.