Blog Feed Post

Getting it right with data attribution

3997333611_2565fc9a4d_bThere have always, it seems, been people for whom attribution and citation really matter. Some of them passionately engage in arguments that last months or years, debating the merits of comma placement in written citations for the work of others. Bizarre, right?

But, as we all become increasingly dependent upon data sourced from third parties, aspects of this rather esoteric pastime are beginning to matter to a far broader audience. Products, recommendations, decisions and entire businesses are being constructed on top of data sourced from trusted partners, from new data brokers, from crowdsourced communities, or simply plucked from across the open web. Without an understanding of where that data came from, and how it was collected, interpreted or maintained, all of those products, recommendations, decisions and businesses stand upon very shaky foundations indeed.

Data attribution is increasingly important, but it will be essential to make sure that the rules, tools and norms which emerge are both lightweight and pragmatic. Now is not the time to get heavy-handed and pedantic about where the comma goes.

Former colleague Leigh Dodds recently offered a useful discussion of the rationale behind data attribution. Early on, he describes the related (and, often, sloppily interchangeable) notions of attribution and citation;

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

This distinction is important in some circumstances, but it can also be useful to consider a simpler, more selfish, but ultimately more scalable justification. Attribution (and citation) of data quite simply provides an audit trail, enabling you, your bosses, your investors, or your customers, to know more about the data upon which actions are based.

Creators want credit, consumers want to trust

Much of the serious consideration of attribution comes from a relatively small cadre of data owners or creators, who (understandably, perhaps) want credit for their hard work. Perhaps they want to prove use in order to secure future funding or advancement, or perhaps they simply want to track where their data ends up. Through a series of licenses, contracts and Terms & Conditions statements, these creators have done much in codifying the ways that data should be referred to. Leigh discusses some of the licensing terms in his post, but it’s probably fair to say that none of them have really caught on outside a few rather narrowly scoped groups of co-dependent developers and data providers.

Data owners’ requirements for credit run the gamut, from loosely phrased requests for a link back to their website all the way past this rather excessive example quoted by Leigh to end up as lengthy tomes of draconian legalese;

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

All too often, that sort of self-defeating prescription is enough to send prospective users back to Google, in search of a less demanding alternative.

For consumers of data, or for those wondering where the data behind a product or decision came from, things are rather simpler. On the whole, those on this side of the divide are simply looking for a pointer which enables them to learn more. Is my company’s multi-million dollar change of direction based upon detailed data from stable Governments, credible banks and respected analysts, or did the person responsible use some numbers the found on their friend’s blog?

Carefully crafted rules regarding attribution’s wording, placement, colour, size and typeface are an irrelevance, probably deserving to be ignored or ridiculed.

Far better, and far more likely of success, to simply encourage users and re-users of data to sensibly point back (however they like) to their principal sources.

Remembering all the ancestors is a bit daft

Data set A is modified and added to in order to create data set A1. Data set A1 is modified and added to in order to create data set A2. Data set B is modified and added to in order to create data set B1. Data set B1 is modified and added to in order to create data set B2. Data Set C modifies and extends data sets A2 and B2. It seems reasonable to acknowledge the contribution made to C by A2 and B2, but some would argue (loudly) that A, B, A1 and B1 also need to be acknowledged in C. This is one aspect of ‘attribution stacking’, and attribution stacking is, quite simply, stupid.

If I am the creator of data set C, I am selecting A2 and B2 because they are the right data sets for my purpose. That selection will be based upon a range of criteria, including the scope and coverage of the data. The selection will also be based upon my impression of the brands responsible for A2 and B2, and that impression (implicitly or explicitly) will include some awareness of the processes they use to select, validate and manage the data they use. It’s for them to carefully select, validate and provide attribution for A1 and B1, not for me. And it’s A1 and B1′s job to do the same for A and B, not me.

Things get even worse in some open source data projects, where all the individual contributors expect to be acknowledged. Inside the project (and on its website, etc), that’s fine and sensible. Outside, though? It’s ridiculous. So if data set A were created by individuals Aa, Ab, Ac, Ad and all their friends right up to Az, under some licenses there would be an expectation that every single one of those individuals be acknowledged by name in any mention of data sets A, A1, A2 or C. A massive administrative burden for any downstream users of the data set, and of no real benefit to anyone whatsoever. This desire for glory really does need to be challenged, if it is not to stifle free and fair downstream use and reuse of the data. Within the project building A, it may be vital to know that user Aa is a bit sloppy, or that user Ad has a nasty habit of making the data say what she thinks it should rather than what it actually does. But it is the responsibility of the project behind A to put processes and procedures in place to address these issues, and to ensure that all of its participants receive appropriate credit within the project for their contribution. By the time we reach A1 or A2, though, those internal details no longer matter. A1 chose to use A because those processes exist. After an initial evaluation of those processes and their implementation, A1 can — and should — simply trust them, rather than endlessly second-guessing them.

Tracking and Trust are different

Ultimately, the motivations of data creators and data re-users are very different. The processes and procedures put in place by creators and owners in search of kudos or statistics may actively obstruct the use and reuse that they profess to want. Complex forms of attribution, aligned to heavy-handed enforcement of infringements, do nothing to encourage a far broader community of use to emerge.

By attempting to count — and manage — the small number of uses today, data creators are stifling growth that otherwise is ready to explode. A perfect example of the saying (which may not translate beyond Britain’s shores!) of ‘biting off your nose to spite your face.’ Think about it… ;-)

Leigh ends his post with

Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

Other than broadening it from ‘Open Data’ to just ‘data,’ I couldn’t agree more. But let’s keep it lightweight, simple, and pragmatic.

Note: or perhaps this post should have been called “When you stand on a giant’s shoulders, it’s a good idea to say thank you.”

Image of Eduardo Paolozzi‘s sculpture of Sir Isaac Newton by Flickr user ‘monkeywing

Read the original blog entry...

More Stories By Paul Miller

Paul Miller works at the interface between the worlds of Cloud Computing and the Semantic Web, providing the insights that enable you to exploit the next wave as we approach the World Wide Database.

He blogs at www.cloudofdata.com.

Latest Stories
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo Silicon Valley Call for Papers is now open.
SYS-CON Events announced today that Twistlock, the leading provider of cloud container security solutions, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Twistlock is the industry's first enterprise security suite for container security. Twistlock's technology addresses risks on the host and within the application of the container, enabling enterprises to consistently enforce security policies, monitor...
This talk centers around how to automate best practices in a multi-/hybrid-cloud world based on our work with customers like GE, Discovery Communications and Fannie Mae. Today’s enterprises are reaping the benefits of cloud computing, but also discovering many risks and challenges. In the age of DevOps and the decentralization of IT, it’s easy to over-provision resources, forget that instances are running, or unintentionally expose vulnerabilities.
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
Five years ago development was seen as a dead-end career, now it’s anything but – with an explosion in mobile and IoT initiatives increasing the demand for skilled engineers. But apart from having a ready supply of great coders, what constitutes true ‘DevOps Royalty’? It’ll be the ability to craft resilient architectures, supportability, security everywhere across the software lifecycle. In his keynote at @DevOpsSummit at 20th Cloud Expo, Jeffrey Scheaffer, GM and SVP, Continuous Delivery Busine...
While some vendors scramble to create and sell you a fancy solution for monitoring your spanking new Amazon Lambdas, hear how you can do it on the cheap using just built-in Java APIs yourself. By exploiting a little-known fact that Lambdas aren’t exactly single threaded, you can effectively identify hot spots in your serverless code. In his session at 20th Cloud Expo, David Martin, Principal Product Owner at CA Technologies, will give a live demonstration and code walkthrough, showing how to ov...
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive ad...
Everywhere we turn in our industry we can find strong opinions about the direction, type and nature of cloud’s impact on computing and business. Another word that is used in every context in our industry is “hybrid.” In his session at 20th Cloud Expo, Alvaro Gonzalez, Director of Technical, Partner and Field Marketing at Peak 10, will use a combination of a few conceptual props and some research recently commissioned by Peak 10 to offer a real-world consideration of how the various categories of...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore t...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
SYS-CON Events announced today that Interoute has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Interoute is the owner operator of Europe's largest network and a global cloud services platform, which encompasses over 70,000 km of lit fiber, 15 data centers, 17 virtual data centers and 33 colocation centers, with connections to 195 additional partner data centers. Our full-service Unifie...
SYS-CON Events announced today that CollabNet, a global leader in enterprise software development, release automation and DevOps solutions, will be a Bronze Sponsor of SYS-CON's 20th International Cloud Expo®, taking place from June 6-8, 2017, at the Javits Center in New York City, NY. CollabNet offers a broad range of solutions with the mission of helping modern organizations deliver quality software at speed. The company’s latest innovation, the DevOps Lifecycle Manager (DLM), supports Value S...
SYS-CON Events announced today that Peak 10, Inc., a national IT infrastructure and cloud services provider, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Peak 10 provides reliable, tailored data center and network services, cloud and managed services. Its solutions are designed to scale and adapt to customers’ changing business needs, enabling them to lower costs, improve performance and focus intern...
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will look at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deli...
SYS-CON Events announced today that Linux Academy, the foremost online Linux and cloud training platform and community, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Linux Academy was founded on the belief that providing high-quality, in-depth training should be available at an affordable price. Industry leaders in quality training, provided services, and student certification passes, its goal is to c...