Welcome!

Blog Feed Post

Thinking about Open Data, with a little help from the Data Hub

Continuing to explore the adoption of explicit Open Data licenses, I’ve been having a trawl through some of the data in the Open Knowledge Foundation‘s Data Hub. I’m disappointed – but not surprised – by the extent to which widely applicable Open Data licenses are (not!) being applied.

For those who are impatient or already aware of the background, feel free to skip straight to the results. For the rest of you, let me begin with a little background and an explicit description of my methodology.

Background

Open Data is, increasingly, recognised as being A Good Thing. Governments are releasing data, making them more accountable, (possibly) saving themselves money by avoiding the need to endlessly answer Freedom of Information requests, and providing the foundation upon which a whole new generation of websites and mobile apps are being built. Museums and Libraries are releasing data, increasing visibility of their collections and freeing these institutional collections from their decades-long self imposed exile in the ghetto of their own web sites. Scientists are beginning to release their data, making it far easier for their peers to engage in that fundamental principle of science; the reproduction of published results.

Open Data is good, and useful, and valuable, and increasingly visible. But without a license telling people what they can and cannot do, how much use is it? Former colleague Leigh Dodds did some work a few years ago, to look at the extent to which (notionally open) data was being explicitly licensed. He was concerned with a very specific set of data; contributions to the Linked Data Cloud. At the time, Leigh found only a third of the data sets carried an explicit license, and several of those license choices were dubious.

I was interested to see how the situation had changed. I’m running a short survey, inviting people to describe their own licensing choices. I’ve also taken a look at the Data Hub, which “contains 4004 datasets that you can browse, learn about and download.” This is a far larger set of data than the one Leigh studied back in 2009, and should hopefully therefore provide a richer picture of licensing choices. It’s worth remembering, though, that data owners must actively choose to contribute their data to the Data Hub. The Hub is run by the Open Knowledge Foundation, and it therefore seems likely that submissions will skew in favour of those who are more than normally enthusiastic about their data and more than normally predisposed toward open. For more, listen to my podcast with the Open Knowledge Foundation’s Rufus Pollock and Irina Bolychevsky.

Methodology

I began by querying the Data Hub’s api, to discover the set of permissible licenses. This resulted in a set of 15 possible values;

  • Not Specified [notspecified]
  • Open Data Commons Public Domain Dedication & License [odc-pddl]
  • Open Data Commons Open Database License [odc-odbl]
  • Open Data Commons Attribution License [odc-by]
  • Creative Commons CC0 [cc-zero]
  • Creative Commons Attribution [cc-by]
  • Creative Commons Attribution Share Alike [cc-by-sa]
  • GNU Free Documentation License [gfdl]
  • Other Open Licenses [other-open]
  • Other Public Domain Licenses [other-pd]
  • Other Attribution Licenses [other-at]
  • UK Open Government License [uk-ogl]
  • Creative Commons Non-Commercial [cc-nc]
  • Other Non-Commercial Licenses [other-nc]
  • Other closed licenses [other-closed]

I then downloaded the JSON dump from the Data Hub, but found that it was far older (and smaller) than the set of data available to the API. The JSON dump was last updated on 30 August 2011, and only contained just over 2,000 entries. At the time of writing, the API offers access to 4,004 entries. With the help of Adrià Mercader, I learned how to submit the correct query to the API itself, giving me access to all 4,004 records. Results included 44 different values for the license_id attribute; the 15 above, 12 numeric values that were presumably errors of some kind, assorted ways of either saying nothing or specifying that the data had no license, and then a small number of records associated with some specific licenses such as a Canadian Crown Copyright and the MIT License. Of 4,012 records, 874 appear to say nothing whatsoever about their license conditions; not even the

"license_id": ""

used by 523.

Results

 

Looking at the raw numbers, the first impression must be a depressing one. Fully 50% of the records either explicitly state that there is no license (14), explicitly state that the license is ‘not specified’ (604), explicitly record a null value (523), or fail to include the license_id attribute at all (874). Given all of the effort that has gone into evangelising the importance of data licensing, and all the effort that Data Hub contributors have gone to in collecting, maintaining and submitting their data in the first place, that really isn’t very good at all. But at least it’s an improvement on what Leigh observed back in 2009.

If we remove the 2,015 unlicensed records and the 31 errors (those well-known data licenses, including ’1,’ ’34,’ ’73,’ etc), the picture becomes somewhat clearer.

The licenses that many have worked so hard to promote for open data (CC0, the Open Data Commons family and – in some circumstances – CC-BY) are far less prevalent than I’d expected them to be. 125 resources are licensed CC0, 273 CC-BY, 119 ODC-PDDL, 61 ODC-ODBL, and 36 ODC-BY. That’s a total of 614 out of 1,966 licensed resources, or just 31%. 44% of the 614 are licensed CC-BY; an attribution license based upon copyright rather than database rights. At least some of those may therefore be wrongly licensed. The two core data licenses are almost tied, (125 for CC0, 119 for ODC-PDDL), but together account for a tiny 12% of all the licensed resources in the Data Hub.

The picture’s not all bad, as there is clearly a move toward the principle of ‘open’ and ‘public domain’ licenses. CC0 (125) and ODC-PDDL (119) are joined by 167 data sets licensed with some other public domain license. And with 444 data sets, ‘other open license’ is the single most popular choice; almost one quarter of the licensed data sets use an open license that is not one of the mainstream ones.

In total, the Creative Commons family of licenses (including the odd ‘sharealike’ variant and the hugely annoying ‘noncommercial’ anachronism) account for 602 data sets, or 31%. The Open Data Commons family account for 216, or 11%.

By most measures, we should probably welcome the use of any open or public domain license. But the more choices there are, the more scope there is for confusion, contradiction, and a lack of interoperability. Every time I want to take an ‘open’ dataset licensed with Open License A, and combine it with an ‘open’ dataset licensed with Open License B, there’s the nagging doubt that some wording in one of the licenses introduces a problem. Do I need to check with a lawyer? Do I need to check with one or both of the data providers? Is this all too much bother, and should I just go and do something else? License proliferation is friction.

So those are the results. What do they say to you?

It will be interesting to check back over time, and see how the proportions shift. Let’s work to eradicate the ‘None/ Not Specified’ category altogether, and then see what we can do to shrink all of the ‘Other’ categories.

Read the original blog entry...

More Stories By Paul Miller

Paul Miller works at the interface between the worlds of Cloud Computing and the Semantic Web, providing the insights that enable you to exploit the next wave as we approach the World Wide Database.

He blogs at www.cloudofdata.com.

Latest Stories
Automation is enabling enterprises to design, deploy, and manage more complex, hybrid cloud environments. Yet the people who manage these environments must be trained in and understanding these environments better than ever before. A new era of analytics and cognitive computing is adding intelligence, but also more complexity, to these cloud environments. How smart is your cloud? How smart should it be? In this power panel at 20th Cloud Expo, moderated by Conference Chair Roger Strukhoff, pane...
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo Silicon Valley which will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. "DevOps is at the intersection of technology and business-optimizing tools, organizations and processes to bring measurable improvements in productivity and profitability," said Aruna Ravichandran, vice president, DevOps product and solutions marketing...
@DevOpsSummit at Cloud Expo taking place Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center, Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is ...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
What's the role of an IT self-service portal when you get to continuous delivery and Infrastructure as Code? This general session showed how to create the continuous delivery culture and eight accelerators for leading the change. Don Demcsak is a DevOps and Cloud Native Modernization Principal for Dell EMC based out of New Jersey. He is a former, long time, Microsoft Most Valuable Professional, specializing in building and architecting Application Delivery Pipelines for hybrid legacy, and cloud ...
In his session at Cloud Expo, Alan Winters, an entertainment executive/TV producer turned serial entrepreneur, presented a success story of an entrepreneur who has both suffered through and benefited from offshore development across multiple businesses: The smart choice, or how to select the right offshore development partner Warning signs, or how to minimize chances of making the wrong choice Collaboration, or how to establish the most effective work processes Budget control, or how to ma...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
In 2014, Amazon announced a new form of compute called Lambda. We didn't know it at the time, but this represented a fundamental shift in what we expect from cloud computing. Now, all of the major cloud computing vendors want to take part in this disruptive technology. In his session at 20th Cloud Expo, Doug Vanderweide, an instructor at Linux Academy, discussed why major players like AWS, Microsoft Azure, IBM Bluemix, and Google Cloud Platform are all trying to sidestep VMs and containers wit...
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.