Sunday, June 16, 2013

How Open Are You? Part 1: Metrics to Measure Openness and Free Availability of Publications

For many many years I have been raising a key questions in relation to open access publishing - how can we measure how open someone's publications are.  Ideally we would have a way of measuring this in some sort of index.  A few years ago I looked around and asked around and did not find anything out there of obvious direct relevance to what I wanted so I started mapping out ways to do this.

When Aaron Swartz died I started drafting some ideas on this topic.  Here is what I wrote (in January 2013) but never posted:


With the death of Aaron Swartz on Friday there has been much talk of people posting their articles online (a short term solution) and moving more towards openaccess publishing (a long term solution).  One key component of the move to more openaccess publishing will be assessing people on just how good a job they are doing of sharing their academic work.

I have looked around the interwebs to see if there is some existing metric for this and I could not find one.  So I have decided to develop one - which I call the Swartz Openness Index (SOI).


Let A = # of objects being assessed (could be publications, data sets, software, or all of these together). 
Let B = # of objects that are released to the commons with a broad, open license. 
A simple (and simplistic) metric could be simply 
OI = B / A

This is a decent start but misses out on the degree of openness of different objects. So a more useful metric might be the one below.
A and B as above. 
Let C = # of objects available free of charge but not openly 
OI = ( B + (C/D) ) / A  
where D is the "penalty" for making material in C not openly available

This still seems not detailed enough.  A more detailed approach might be to weight diverse aspects of the openness of the objects.  Consider for example the "Open Access Spectrum."  This has divided objects (publications in this case) into six categories in terms of potential openness: reader rights, reuse rights, copyrights, author posting rights, automatic posting, and machine readability.  And each of these is given different categories that assess the level of openness.  Seems like a useful parsing in ways.  Alas, since bizarrely the OAS is released under a somewhat restrictive CC BY-NC-ND  license I cannot technically make derivatives of it.  So I will not.  Mostly because I am pissed at PLoS and SPARC for releasing something in this way.  Inane.

But I can make my own openness spectrum.


And then I stopped writing because I was so pissed off at PLOS and SPARC for making something like this and then restricting it's use.  I had a heated discussion with people from PLOS and SPARC about this but not sure if they updated their policy.  Regardless, the concept of an Openness Index of some kind fell out of my head after this buzzkill.  And it only just now came back to me. (Though I note - I did not find the Draft post I made until AFTER I wrote the rest of this post below ... ).


To get some measure of openness in publications maybe a simple metric would be useful.  Something like the following
  • P = # of publications
  • A = # of fully open access papers
  • OI = Openness index
A simple OI would be
  • OI = 100 * A/P
However, one might want to account for relative levels of openness in this metric.  For example
  • AR = # of papers with a open but somewhat restricted license
  • F = # of papers that are freely available but not with an open license
  • C = some measure of how cheap the non freely available papers are
And so on.

Given that I am not into library science myself and not really familiar with playing around with this type of data I thought a much simpler metric would be to just go to Pubmed (which of course works only for publications in the arenas covered by Pubmed).

From Pubmed one can pull out some simple data. 
  • # of publications (for a person or Institution)
  • # of those publications in PubMed Central (a measure of free availability)
Thus one could easily measure the "Pubmed Central" index as

PMCI = 100 * (# publications in PMC / # of publications in Pubmed)

Some examples of the PMCI for various authors including some bigger names in my field, and some people I have worked with.

            Name                        #s                PMCI    
Eisen JA
224/269  
83.2
Eisen MB 
76/104
73.1
Collins FS
192/521
36.8
Lander ES
160/377
42.4
Lipman DJ
58/73
79.4
Nussinov R
170/462
36.7
Mardis E
127/187
67.9
Colwell RR
237/435
54.5
Varmus H
165/408
40.4
Brown PO
164/234
70.1
Darling AE
20/27
74.0
Coop G
23/39
59.0
Salzberg SL
107/162
61.7
Venter JC
53/237
22.4
Ward NL
24/58
41.4
Fraser CM
78/262
29.8
Quackenbush J
95/225
42.2
Ghedin E
47/82
57.3
Langille MG
10/14
71.4




And so on.  Obviously this is of limited value / accuracy in many ways.  Many papers are freely available but not in Pubmed Central.  Many papers are not covered by Pubmed or Pubmed Central.  Times change, so some measure of recent publications might be better than measuring all publications.  Author identification is challenging (until systems like ORCID get more use).  And so on.

Another thing one can do with Pubmed is to identify papers with free full text available somewhere (not just in PMC).  This can be useful for cases where material is not put into PMC for some reason.  And then with a similar search one can narrow this to just the last five years.  As openaccess has become more common maybe some people have shifted to it more and more over time (I have -- so this search should give me a better index).

Lets call the % of publications with free full text somewhere the "Free Index" or FI.  Here are the values for the same authors.

Name
PMC 
%
Pudmed 
PMCI 
Free
%
Pubmed
5 years
FI - 5 
Free
%
Pubmed
All
FI-ALL
Eisen JA
224/269
83.2
178/180
98.9
237
88.1
Eisen MB 
76/104
73.1
32/34
94.1
8379.8
Collins FS
192/521
36.8
104/128
81.3
26350.5
Lander ES
160/377
42.4
78/104
75.0
20053.1
Lipman DJ
58/73
79.4
20/22
90.9
5980.8
Mardis E
127/187
67.9
90/115
78.3
13572.2
Colwell RR
237/435
54.5
31/63
49.2
25859.3
Varmus H
165/408
40.4
21/28
75.0
20650.5
Brown PO
164/234
70.1
20/21
95.2
18579.0
Darling AE
20/27
74.0
18/21
85.7
2177.8
Coop G
23/39
59.0
16/20
80.0
2871.8
Salzberg SL
107/162
61.7
54/58
93.1
12879.0
Venter JC
53/237
22.4
20/33
60.6
8535.9
Ward NL
24/58
41.4
18/27
66.6
3051.7
Fraser CM
78/262
29.8
9/13
69.2
10941.6
Quackenbush J
95/225
42.2
54/75
72.0
13158.2
Ghedin E
47/82
57.3
30/36
83.3
5668.3
Langille MG
10/14
71.4
11/13
84.6
1178.6


Very happy to see that I score very well for the last five years. 180 papers in Pubmed.  178 of them with free full text somewhere that Pubmed recognizes. The large number of publications comes mostly from genome reports in the open access journals Standards in Genomic Sciences and Genome Announcements.  But most of my non genome report papers are also freely available.

I think in general it would be very useful to have measures of the degree of openness.  And such metrics should take into account sharing of other material like data, methods, etc.  In a way this could be a form of the altmetric calculations going on.

But before going any further I decided to look again into what has been done in this area. When I first thought of doing this a few years ago I searched and asked around and did not see much of anything.  (Although I do remember someone out there - maybe Carl Bergstrom - saying there were some metrics that might be relevant - but can't figure out who / what this information in the back of my head is).

So I decided to do some searching anew.  And lo and behold there was something directly relevant. There is a paper in the Journal of Librarianship and Scholarly Communication called: The Accessibility Quotient: A New Measure of Open Access.  By Mathew A. Willmott, Katharine H. Dunn, and Ellen Finnie Duranceau from MIT.

Full Citation: Willmott, MA, Dunn, KH, Duranceau, EF. (2012). The Accessibility Quotient: A New Measure of Open Access. Journal of Librarianship and Scholarly Communication 1(1):eP1025. http://dx.doi.org/10.7710/2162-3309.1025

Here is the abstract:

Abstract
INTRODUCTION The Accessibility Quotient (AQ), a new measure for assisting authors and librarians in assessing and characterizing the degree of accessibility for a group of papers, is proposed and described. The AQ offers a concise measure that assesses the accessibility of peer-reviewed research produced by an individual or group, by incorporating data on open availability to readers worldwide, the degree of financial barrier to access, and journal quality. The paper reports on the context for developing this measure, how the AQ is calculated, how it can be used in faculty outreach, and why it is a useful lens to use in assessing progress towards more open access to research.
METHODS Journal articles published in 2009 and 2010 by faculty members from one department in each of MIT’s five schools were examined. The AQ was calculated using economist Ted Bergstrom’s Relative Price Index to assess affordability and quality, and data from SHERPA/RoMEO to assess the right to share the peer-reviewed version of an article.
RESULTS The results show that 2009 and 2010 publications by the Media Lab and Physics have the potential to be more open than those of Sloan (Management), Mechanical Engineering, and Linguistics & Philosophy.
DISCUSSION Appropriate interpretation and applications of the AQ are discussed and some limitations of the measure are examined, with suggestions for future studies which may improve the accuracy and relevance of the AQ.
CONCLUSION The AQ offers a concise assessment of accessibility for authors, departments, disciplines, or universities who wish to characterize or understand the degree of access to their research output, capturing additional dimensions of accessibility that matter to faculty.

I completely love it.  After all. it is directly related to what I have been thinking about and, well, they actually did some systematic analysis of their metrics.  I hope more things like this come out and are readily available for anyone to calculate.  Just how open someone is could be yet another metric used to evaluate them ...

And then I did a little more searching and found the following which also seem directly relevant

So - it is good to see various people working on such metrics.  And I hope there are more and more.

Anyway - I know this is a bit incomplete but I simply do not have time right now to turn this into a full study or paper and I wanted to get these ideas out there.  I hope someone finds them useful ...

13 comments:

  1. You should count separately papers with the person is first, senior, or middle authors, which, as you know, entail varying degrees of control over where papers are published.

    ReplyDelete
    Replies
    1. yes, well, that is certainly fair .. however ... I decline collaborations and sometimes joint manuscripts if the others / senior people refuse to commit to OA ... so this is an imperfect measure since it does not measure how committee one is to openness

      Delete
  2. I agree with Michael, this approach is more suited to measure the openness of senior authors but that might be enough and if not, it's a great start.

    ReplyDelete
  3. For a long time (8 years?) I've been on-and-off wrestling with an OA metric question; To what extent has data been exploited (the background question for me was economic exploitation) prior to sharing? How much opportunity cost is being given up by the act of sharing this specific extent of data at the present time?

    Now, this is a very different question than the one Jon is asking. What I'm asking is about data sets and less about publications; but it can be connected 'around the back' through the idea of merit and opportunity cost.

    How much does OA cost different authors? Do they have to pay out of grants - if so, at what opportunity cost to new projects, students, equipment? Do they pay with lower article metrics and thus via their CV? What is the real cost to the author to make these publications more broadly available? How much did they give up to do this good deed?

    Indirectly, this addresses the question: How much honor should they be accorded as a result? Was it somehow self-serving (and thus a fine judgement call but not independently meritorious)? Or was it self-sacrificial and consciously for the good of the scientific community or general public?

    ReplyDelete
    Replies
    1. I get your point and some of that information would be useful to have. But I don't think it is directly connected to my metric needs / goals. Just as exercise can be good for a person and good for society (in terms of reduced health care costs for the people who exercise) - open publishing can be good for the researcher and society and still be worth rewarding or measuring (just as health insurers have found that rewarding people for healthy behaviors can be a win win ..).

      Delete
  4. Did you notice that Swartz' Guerilla Open Access Manifesto is not CC licensed at all? http://cryptome.org/2013/01/swartz-open-access.htm

    I don't think Aaron Swartz was obsessed with metrics, this doesn't seem to be his style at all.

    My own perspective is that we should move away from our societal obsession with metrics - we need less of this, not openness metrics.

    ReplyDelete
    Replies
    1. Well, the only real connection here to Aaron is that I started the first blog post (in Yellow) just after Aaron died.

      As for metrics - I disagree. Metrics have many uses. They can be abused without a doubt. but I think they can also be useful.

      Delete
  5. Demand Progress, the site Aaron Swartz started, is licensed CC-BY-NC-SA. http://www.demandprogress.org/

    Reddit is All Rights Reserved.

    My suggestion for honouring Swartz today is to stand up for others who are courageous enough to take risks to make things open that should be open. Sign the petition to pardon Eric Snowden: https://petitions.whitehouse.gov/petition/pardon-edward-snowden/Dp03vGYD Join the call to free Bradley Manning.

    ReplyDelete
  6. This Guardian article says what I just said above but much better - the whistleblowers are the next generation of American patriots http://www.guardian.co.uk/commentisfree/2013/jun/16/whistleblowers-new-generation-american-patriots

    ReplyDelete
  7. Part of me thinks Jonathan just did this to show he beats his brother, but to be serious for a moment, maybe we could just stop these silly arguments about which license is best for everyone, and let the humanities put whatever NC-SA-ND stuff they want on their stuff, as long as they don't confuse people and distinguish their now restrictive practices from fully open OA?

    ReplyDelete
  8. Mr. Gunn, our interests in this matter are quite different. You are in the world of industry, benefiting financially from the gifts of others. My concern is building a global sustainable knowledge commons to serve the interests of scholarship and the public. If businesses can make a profit along the way, that's a good thing, but it's not the point and problematic when profit becomes the priority.

    ReplyDelete
  9. Just to say that PLOS did re-release the Open Access Spectrum under a CC BY license which I agree was the right way to do it.

    ReplyDelete
    Replies
    1. That's irritating - I thought it would at least put my name in. That previous comment is from me.

      Cameron Neylon, PLOS

      Delete

Most recent post

Talk on Sequencing and Microbes ...

I recently gave a talk where I combined what are normally two distinct topics - the Evolution of DNA Sequencing, and the use of Sequencing t...