Blog Feed Post

Getting it right with data attribution

3997333611_2565fc9a4d_bThere have always, it seems, been people for whom attribution and citation really matter. Some of them passionately engage in arguments that last months or years, debating the merits of comma placement in written citations for the work of others. Bizarre, right?

But, as we all become increasingly dependent upon data sourced from third parties, aspects of this rather esoteric pastime are beginning to matter to a far broader audience. Products, recommendations, decisions and entire businesses are being constructed on top of data sourced from trusted partners, from new data brokers, from crowdsourced communities, or simply plucked from across the open web. Without an understanding of where that data came from, and how it was collected, interpreted or maintained, all of those products, recommendations, decisions and businesses stand upon very shaky foundations indeed.

Data attribution is increasingly important, but it will be essential to make sure that the rules, tools and norms which emerge are both lightweight and pragmatic. Now is not the time to get heavy-handed and pedantic about where the comma goes.

Former colleague Leigh Dodds recently offered a useful discussion of the rationale behind data attribution. Early on, he describes the related (and, often, sloppily interchangeable) notions of attribution and citation;

It might also be useful to distinguish between:

  • Attribution — highlighting the creator/publisher of some data to acknowledge their efforts, conferring reputation
  • Citation — providing a link or reference to the data itself, in order to communicate provenance or drive discovery

This distinction is important in some circumstances, but it can also be useful to consider a simpler, more selfish, but ultimately more scalable justification. Attribution (and citation) of data quite simply provides an audit trail, enabling you, your bosses, your investors, or your customers, to know more about the data upon which actions are based.

Creators want credit, consumers want to trust

Much of the serious consideration of attribution comes from a relatively small cadre of data owners or creators, who (understandably, perhaps) want credit for their hard work. Perhaps they want to prove use in order to secure future funding or advancement, or perhaps they simply want to track where their data ends up. Through a series of licenses, contracts and Terms & Conditions statements, these creators have done much in codifying the ways that data should be referred to. Leigh discusses some of the licensing terms in his post, but it’s probably fair to say that none of them have really caught on outside a few rather narrowly scoped groups of co-dependent developers and data providers.

Data owners’ requirements for credit run the gamut, from loosely phrased requests for a link back to their website all the way past this rather excessive example quoted by Leigh to end up as lengthy tomes of draconian legalese;

The attribution must be no smaller than 70% of the size of the largest bit of information used, or 7px, whichever is larger. If you are making the information available via your own API you need to make sure your users comply with all these conditions.

All too often, that sort of self-defeating prescription is enough to send prospective users back to Google, in search of a less demanding alternative.

For consumers of data, or for those wondering where the data behind a product or decision came from, things are rather simpler. On the whole, those on this side of the divide are simply looking for a pointer which enables them to learn more. Is my company’s multi-million dollar change of direction based upon detailed data from stable Governments, credible banks and respected analysts, or did the person responsible use some numbers the found on their friend’s blog?

Carefully crafted rules regarding attribution’s wording, placement, colour, size and typeface are an irrelevance, probably deserving to be ignored or ridiculed.

Far better, and far more likely of success, to simply encourage users and re-users of data to sensibly point back (however they like) to their principal sources.

Remembering all the ancestors is a bit daft

Data set A is modified and added to in order to create data set A1. Data set A1 is modified and added to in order to create data set A2. Data set B is modified and added to in order to create data set B1. Data set B1 is modified and added to in order to create data set B2. Data Set C modifies and extends data sets A2 and B2. It seems reasonable to acknowledge the contribution made to C by A2 and B2, but some would argue (loudly) that A, B, A1 and B1 also need to be acknowledged in C. This is one aspect of ‘attribution stacking’, and attribution stacking is, quite simply, stupid.

If I am the creator of data set C, I am selecting A2 and B2 because they are the right data sets for my purpose. That selection will be based upon a range of criteria, including the scope and coverage of the data. The selection will also be based upon my impression of the brands responsible for A2 and B2, and that impression (implicitly or explicitly) will include some awareness of the processes they use to select, validate and manage the data they use. It’s for them to carefully select, validate and provide attribution for A1 and B1, not for me. And it’s A1 and B1′s job to do the same for A and B, not me.

Things get even worse in some open source data projects, where all the individual contributors expect to be acknowledged. Inside the project (and on its website, etc), that’s fine and sensible. Outside, though? It’s ridiculous. So if data set A were created by individuals Aa, Ab, Ac, Ad and all their friends right up to Az, under some licenses there would be an expectation that every single one of those individuals be acknowledged by name in any mention of data sets A, A1, A2 or C. A massive administrative burden for any downstream users of the data set, and of no real benefit to anyone whatsoever. This desire for glory really does need to be challenged, if it is not to stifle free and fair downstream use and reuse of the data. Within the project building A, it may be vital to know that user Aa is a bit sloppy, or that user Ad has a nasty habit of making the data say what she thinks it should rather than what it actually does. But it is the responsibility of the project behind A to put processes and procedures in place to address these issues, and to ensure that all of its participants receive appropriate credit within the project for their contribution. By the time we reach A1 or A2, though, those internal details no longer matter. A1 chose to use A because those processes exist. After an initial evaluation of those processes and their implementation, A1 can — and should — simply trust them, rather than endlessly second-guessing them.

Tracking and Trust are different

Ultimately, the motivations of data creators and data re-users are very different. The processes and procedures put in place by creators and owners in search of kudos or statistics may actively obstruct the use and reuse that they profess to want. Complex forms of attribution, aligned to heavy-handed enforcement of infringements, do nothing to encourage a far broader community of use to emerge.

By attempting to count — and manage — the small number of uses today, data creators are stifling growth that otherwise is ready to explode. A perfect example of the saying (which may not translate beyond Britain’s shores!) of ‘biting off your nose to spite your face.’ Think about it… ;-)

Leigh ends his post with

Attribution should be a social norm that we encourage, strongly, in order to acknowledge the sources of our Open Data.

Other than broadening it from ‘Open Data’ to just ‘data,’ I couldn’t agree more. But let’s keep it lightweight, simple, and pragmatic.

Note: or perhaps this post should have been called “When you stand on a giant’s shoulders, it’s a good idea to say thank you.”

Image of Eduardo Paolozzi‘s sculpture of Sir Isaac Newton by Flickr user ‘monkeywing

Read the original blog entry...

More Stories By Paul Miller

Paul Miller works at the interface between the worlds of Cloud Computing and the Semantic Web, providing the insights that enable you to exploit the next wave as we approach the World Wide Database.

He blogs at www.cloudofdata.com.

Latest Stories
Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
Real IoT production deployments running at scale are collecting sensor data from hundreds / thousands / millions of devices. The goal is to take business-critical actions on the real-time data and find insights from stored datasets. In his session at @ThingsExpo, John Walicki, Watson IoT Developer Advocate at IBM Cloud, will provide a fast-paced developer journey that follows the IoT sensor data from generation, to edge gateway, to edge analytics, to encryption, to the IBM Bluemix cloud, to Wa...
What is the best strategy for selecting the right offshore company for your business? In his session at 21st Cloud Expo, Alan Winters, U.S. Head of Business Development at MobiDev, will discuss the things to look for - positive and negative - in evaluating your options. He will also discuss how to maximize productivity with your offshore developers. Before you start your search, clearly understand your business needs and how that impacts software choices.
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
Most of the time there is a lot of work involved to move to the cloud, and most of that isn't really related to AWS or Azure or Google Cloud. Before we talk about public cloud vendors and DevOps tools, there are usually several technical and non-technical challenges that are connected to it and that every company needs to solve to move to the cloud. In his session at 21st Cloud Expo, Stefano Bellasio, CEO and founder of Cloud Academy Inc., will discuss what the tools, disciplines, and cultural...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that Fusic will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Fusic Co. provides mocks as virtual IoT devices. You can customize mocks, and get any amount of data at any time in your test. For more information, visit https://fusic.co.jp/english/.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
With the rise of DevOps, containers are at the brink of becoming a pervasive technology in Enterprise IT to accelerate application delivery for the business. When it comes to adopting containers in the enterprise, security is the highest adoption barrier. Is your organization ready to address the security risks with containers for your DevOps environment? In his session at @DevOpsSummit at 21st Cloud Expo, Chris Van Tuin, Chief Technologist, NA West at Red Hat, will discuss: The top security r...
SYS-CON Events announced today that Enroute Lab will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enroute Lab is an industrial design, research and development company of unmanned robotic vehicle system. For more information, please visit http://elab.co.jp/.
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 21st Cloud Expo, Lennart Frantzell, a Developer Advocate with IBM, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing devel...
SYS-CON Events announced today that Mobile Create USA will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Mobile Create USA Inc. is an MVNO-based business model that uses portable communication devices and cellular-based infrastructure in the development, sales, operation and mobile communications systems incorporating GPS capabi...
SYS-CON Events announced today that Keisoku Research Consultant Co. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Keisoku Research Consultant, Co. offers research and consulting in a wide range of civil engineering-related fields from information construction to preservation of cultural properties. For more information, vi...
There is huge complexity in implementing a successful digital business that requires efficient on-premise and cloud back-end infrastructure, IT and Internet of Things (IoT) data, analytics, Machine Learning, Artificial Intelligence (AI) and Digital Applications. In the data center alone, there are physical and virtual infrastructures, multiple operating systems, multiple applications and new and emerging business and technological paradigms such as cloud computing and XaaS. And then there are pe...