Blog Feed Post

Thinking about Open Data, with a little help from the Data Hub

Continuing to explore the adoption of explicit Open Data licenses, I’ve been having a trawl through some of the data in the Open Knowledge Foundation‘s Data Hub. I’m disappointed – but not surprised – by the extent to which widely applicable Open Data licenses are (not!) being applied.

For those who are impatient or already aware of the background, feel free to skip straight to the results. For the rest of you, let me begin with a little background and an explicit description of my methodology.


Open Data is, increasingly, recognised as being A Good Thing. Governments are releasing data, making them more accountable, (possibly) saving themselves money by avoiding the need to endlessly answer Freedom of Information requests, and providing the foundation upon which a whole new generation of websites and mobile apps are being built. Museums and Libraries are releasing data, increasing visibility of their collections and freeing these institutional collections from their decades-long self imposed exile in the ghetto of their own web sites. Scientists are beginning to release their data, making it far easier for their peers to engage in that fundamental principle of science; the reproduction of published results.

Open Data is good, and useful, and valuable, and increasingly visible. But without a license telling people what they can and cannot do, how much use is it? Former colleague Leigh Dodds did some work a few years ago, to look at the extent to which (notionally open) data was being explicitly licensed. He was concerned with a very specific set of data; contributions to the Linked Data Cloud. At the time, Leigh found only a third of the data sets carried an explicit license, and several of those license choices were dubious.

I was interested to see how the situation had changed. I’m running a short survey, inviting people to describe their own licensing choices. I’ve also taken a look at the Data Hub, which “contains 4004 datasets that you can browse, learn about and download.” This is a far larger set of data than the one Leigh studied back in 2009, and should hopefully therefore provide a richer picture of licensing choices. It’s worth remembering, though, that data owners must actively choose to contribute their data to the Data Hub. The Hub is run by the Open Knowledge Foundation, and it therefore seems likely that submissions will skew in favour of those who are more than normally enthusiastic about their data and more than normally predisposed toward open. For more, listen to my podcast with the Open Knowledge Foundation’s Rufus Pollock and Irina Bolychevsky.


I began by querying the Data Hub’s api, to discover the set of permissible licenses. This resulted in a set of 15 possible values;

  • Not Specified [notspecified]
  • Open Data Commons Public Domain Dedication & License [odc-pddl]
  • Open Data Commons Open Database License [odc-odbl]
  • Open Data Commons Attribution License [odc-by]
  • Creative Commons CC0 [cc-zero]
  • Creative Commons Attribution [cc-by]
  • Creative Commons Attribution Share Alike [cc-by-sa]
  • GNU Free Documentation License [gfdl]
  • Other Open Licenses [other-open]
  • Other Public Domain Licenses [other-pd]
  • Other Attribution Licenses [other-at]
  • UK Open Government License [uk-ogl]
  • Creative Commons Non-Commercial [cc-nc]
  • Other Non-Commercial Licenses [other-nc]
  • Other closed licenses [other-closed]

I then downloaded the JSON dump from the Data Hub, but found that it was far older (and smaller) than the set of data available to the API. The JSON dump was last updated on 30 August 2011, and only contained just over 2,000 entries. At the time of writing, the API offers access to 4,004 entries. With the help of Adrià Mercader, I learned how to submit the correct query to the API itself, giving me access to all 4,004 records. Results included 44 different values for the license_id attribute; the 15 above, 12 numeric values that were presumably errors of some kind, assorted ways of either saying nothing or specifying that the data had no license, and then a small number of records associated with some specific licenses such as a Canadian Crown Copyright and the MIT License. Of 4,012 records, 874 appear to say nothing whatsoever about their license conditions; not even the

"license_id": ""

used by 523.



Looking at the raw numbers, the first impression must be a depressing one. Fully 50% of the records either explicitly state that there is no license (14), explicitly state that the license is ‘not specified’ (604), explicitly record a null value (523), or fail to include the license_id attribute at all (874). Given all of the effort that has gone into evangelising the importance of data licensing, and all the effort that Data Hub contributors have gone to in collecting, maintaining and submitting their data in the first place, that really isn’t very good at all. But at least it’s an improvement on what Leigh observed back in 2009.

If we remove the 2,015 unlicensed records and the 31 errors (those well-known data licenses, including ’1,’ ’34,’ ’73,’ etc), the picture becomes somewhat clearer.

The licenses that many have worked so hard to promote for open data (CC0, the Open Data Commons family and – in some circumstances – CC-BY) are far less prevalent than I’d expected them to be. 125 resources are licensed CC0, 273 CC-BY, 119 ODC-PDDL, 61 ODC-ODBL, and 36 ODC-BY. That’s a total of 614 out of 1,966 licensed resources, or just 31%. 44% of the 614 are licensed CC-BY; an attribution license based upon copyright rather than database rights. At least some of those may therefore be wrongly licensed. The two core data licenses are almost tied, (125 for CC0, 119 for ODC-PDDL), but together account for a tiny 12% of all the licensed resources in the Data Hub.

The picture’s not all bad, as there is clearly a move toward the principle of ‘open’ and ‘public domain’ licenses. CC0 (125) and ODC-PDDL (119) are joined by 167 data sets licensed with some other public domain license. And with 444 data sets, ‘other open license’ is the single most popular choice; almost one quarter of the licensed data sets use an open license that is not one of the mainstream ones.

In total, the Creative Commons family of licenses (including the odd ‘sharealike’ variant and the hugely annoying ‘noncommercial’ anachronism) account for 602 data sets, or 31%. The Open Data Commons family account for 216, or 11%.

By most measures, we should probably welcome the use of any open or public domain license. But the more choices there are, the more scope there is for confusion, contradiction, and a lack of interoperability. Every time I want to take an ‘open’ dataset licensed with Open License A, and combine it with an ‘open’ dataset licensed with Open License B, there’s the nagging doubt that some wording in one of the licenses introduces a problem. Do I need to check with a lawyer? Do I need to check with one or both of the data providers? Is this all too much bother, and should I just go and do something else? License proliferation is friction.

So those are the results. What do they say to you?

It will be interesting to check back over time, and see how the proportions shift. Let’s work to eradicate the ‘None/ Not Specified’ category altogether, and then see what we can do to shrink all of the ‘Other’ categories.

Read the original blog entry...

More Stories By Paul Miller

Paul Miller works at the interface between the worlds of Cloud Computing and the Semantic Web, providing the insights that enable you to exploit the next wave as we approach the World Wide Database.

He blogs at www.cloudofdata.com.

Latest Stories
Is advanced scheduling in Kubernetes achievable? Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations? In his session at @DevOpsSummit at 21st Cloud Expo, Oleg Chunikhin, CTO at Kublr, will answer these questions and demonstrate techniques for implementing advanced scheduling. For example, using spot instances ...
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily own their code into production, but want to use platforms instead of raw automation. That’s changing the landscape that we understand as DevOps with both architecture concepts (CloudNative) and process redefinition (SRE). Rob Hirschfeld’s recent work in Kubernetes operations has led to the conclusion that containers and related platforms have changed the way we should be thinking about DevOps and...
SYS-CON Events announced today that Taica will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Taica manufacturers Alpha-GEL brand silicone components and materials, which maintain outstanding performance over a wide temperature range -40C to +200C. For more information, visit http://www.taica.co.jp/english/.
When it comes to cloud computing, the ability to turn massive amounts of compute cores on and off on demand sounds attractive to IT staff, who need to manage peaks and valleys in user activity. With cloud bursting, the majority of the data can stay on premises while tapping into compute from public cloud providers, reducing risk and minimizing need to move large files. In his session at 18th Cloud Expo, Scott Jeschonek, Director of Product Management at Avere Systems, discussed the IT and busine...
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities – ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups. As a result, many firms employ new business models that place enormous impor...
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
As popularity of the smart home is growing and continues to go mainstream, technological factors play a greater role. The IoT protocol houses the interoperability battery consumption, security, and configuration of a smart home device, and it can be difficult for companies to choose the right kind for their product. For both DIY and professionally installed smart homes, developers need to consider each of these elements for their product to be successful in the market and current smart homes.
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, will go over the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, applicatio...
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...