Why the world still struggles with open geospatial data publishing
Realising the potential of open geospatial data will require us to think of data publishing in a fundamentally different way.
Recently I gave a presentation at FOSS4G Oceania 2019. The conference brought together folks from various ‘open’ communities around the region, including open source developers, open street map editors and open data users.
This wasn’t my first open data talk. In fact, we’ve been talking about the potential — but also the limitations — of open data publishing for a few years now. It’s a topic the team at Koordinates is passionate about. And from the very positive response afterwards, it seems most of the industry is as well.
What is open data?
First and foremost, what is open data? The strict definition is that open data is free of technical, price, and most legal restrictions on reuse. The basic principle of open data is that anyone on the planet should be able to access and reuse publicly-funded data — even for commercial purposes — without any requirement except to credit the copyright holders.
The decision to make data ‘open’ is made by the owner of the data, such as the government agency, the researcher, or the NGO that produced the data in the first place. For government agencies and research organisations, this decision is structured by various institutional and governmental policies and mandates.
In New Zealand, we have NZGOAL (a great acronym for the New Zealand Government Open Access and Licensing framework) and the Declaration on Open and Transparent Government. Australia has the originally named AUSGOAL (NZ copied Australia, to be fair), and other federal, state, and local governments have their own variations. Lots of universities and research institutions have open data policies, too.
What’s the problem?
This sounds great, doesn’t it? So what’s the problem? It’s fair to say that, as of 2019, progress with open data publishing has stalled. Some agencies, like Land Information New Zealand, have done an excellent job of publishing its data to tens of thousands of users. Others — and it’s fair to say these are in the majority — dump their data on a server, or FTP, or require you to formally email them with a data request.
Where portals are used, they generally haven’t changed or improved in the last six years despite rapid developments elsewhere in the geospatial industry. Metadata publication in portals is inconsistent. Raster imagery often isn’t supported. Service levels are unreliable. Licensing is absent, contradictory, or ‘pretty legal’. Data publishing is clearly an afterthought for most agencies and standards are persistently, frustratingly, low. Why is this still the case, well over a decade into the open data movement?
Lack of progress
There are a few reasons. From an outsider’s perspective, it’s easy to argue organisations simply aren’t treating data publishing as core business. This is true, but I think it risks letting technology providers off the hook. Data platforms also haven’t made it easy to publish data, especially large and complex geospatial data — or at least not as easy as publishers would like. This blog from the Civil Analytics Network at Harvard University, “An Open Letter to the Open Data Community,"outlines some of the issues government agencies have that are limiting their publishing activities.
I believe solving these issues will require data platforms to think of data publishing — and data distribution — in fundamentally different ways. Before I outline how our thinking at Koordinates has changed, I’d like to discuss an example of how data distribution works in the real world.
Real world data distribution
You may be aware that the New Zealand Government has an ambitious target of making New Zealand pest free by 2050. There is an enormous network of volunteer groups, local and central government agencies, NGOs, businesses, and individuals all working together towards this goal (noting, of course, this network existed prior to the Government’s target).
One of the ways these people can connect is by sharing data — after all, this is the only way the government is going to be able to track progress. Across the country, there are already hundreds of disparate, frequently updated datasets with no single point of truth or scalable way to collaborate.
Data complexity in the real world
This complexity is hard to incorporate into our usual way of thinking about open data, which is very much a “publisher publishes, user uses”, A-to-B relationship. In the real world, data distribution is simply much more complex. In our example of conservation data, we’re talking about a network, with an enormous number of potential transactions across many different nodes.
At present, this complex distribution of geospatial data — and real data collaboration — isn’t easy to achieve. It is usually managed by a messy system of cloud file-sharing systems, email and USBs. It would obviously be a great help for these volunteers to be able to immediately access up-to-date versions of all the geospatial data relevant to their project from one place.
A directly connected network
A more accurate view of data distribution is the model of a directly connected network, where organisations have multiple sources lying ‘upstream’ from them for a particular dataset, and other sources lying ‘downstream.’ An organisation might be the upstream source for dataset A, but a midstream node for dataset B, receiving the source, making changes, and distributing the augmented data to other stakeholders.
This view of data distribution helps explain some of the issues we face with the 'A-B' model of open data publishing. At each node we’re seeing a bunch of operations going on to make the data they have sourced fit for their specific purpose. It might be edited in a GIS. It might be clipped. It might have its attributes renamed. It might be combined with one or more other datasets to make a new product. There are actually dozens of potential operations used to make data fit for purpose or to add value.
This process actually similar to how products are made in other industries — it’s a supply chain.
Geospatial data supply chains
Our working hypothesis is that every organisation on the planet producing or using geospatial data is in a geospatial data supply chain. Part of that chain may involve accessing open data — authoritative data from government agencies are fundamental to the operation of the supply chain — but that ‘public’ side of the chain is by no means the entire picture.
At present, data supply chains take a lot of work to maintain. Geospatial data is big, complex, and often requires expensive tools and a lot of labour to maintain. That means people currently at the bottom of the supply chain are often using data which is quite out of date.
It’s all well and good for government agencies to aim to release more frequently updated data — but this in itself can become a problem as there’s a cost to bringing those updates into the internal systems of organisations across the network. While more updates and more data sounds exciting, it can be difficult to realise the potential of that geospatial data. Our current systems used to manage this supply chain —across government, industry, and civil society — can’t always cope.
Working towards a new model for open data
To solve the problems with open geospatial data publishing, we must solve the problems with geospatial data distribution writ large. We need to build an end-to-end solution for the geospatial data flows across these supply chains, while also significantly lowering the barriers for government agencies to publish their geospatial data openly.
At Koordinates, we’re working extremely hard to solve these problems — and we’ll be releasing some exciting features in the months ahead. The first of these, Data Forking, was released to existing customers last week (we’ll be making a bigger announcement on this blog shortly). Our next step will be new plans for our new-and-vastly-improved Koordinates Data Management, including free open data publishing for individuals and small teams. Watch this space.