Missing the point?

pendulum by sylvar

It has been about a month since Science published
Electronic Publication and the Narrowing of Science and Scholarship by James Evans. I’ve waited some time to comment because the results were somewhat nonintuitive, leading to some deeper thinking.

The results seem to indicate that greater access to online journals results in fewer citations. The reasons for this are causing some discussion. Part of what I wlll maintain is that papers from 15 years ago were loaded with references for two reasons that are no longer relevant today: to demonstrate how hard the author had worked to find relevant information and to help the reader in their searches for information.

Finding information today is too easy for there to be as great a need to include a multitude of similar references.

Many people feel the opposite, that the ease in finding references, via such sites as PubMed, would result in more papers being cited not less. Bench Marks has this to say:

Evans brings up a few possibilities to explain his data. First, that the better search capabilities online have led to a streamlining of the research process, that authors of papers are better able to eliminate unrelated material, that searching online rather than browsing print “facilitates avoidance of older and less relevant literature.” The online environment better enables consensus, “If online researchers can more easily find prevailing opinion, they are more likely to follow it, leading to more citations referencing fewer articles.” The danger here, as Evans points out, is that if consensus is so easily reached and so heavily reinforced, “Findings and ideas that do not become consensus quickly will be forgotten quickly.” And that’s worrisome–we need the outliers, the iconoclasts, those willing to challenge dogma. There’s also a great wealth in the past literature that may end up being ignored, forcing researchers to repeat experiments already done, to reinvent the wheel out of ignorance of papers more than a few years old. I know from experience on the book publishing side of things that getting people to read the classic literature of a field is difficult at best. The keenest scientific minds that I know are all well-versed in the histories of their fields, going back well into the 19th century in some fields. But for most of us, it’s hard to find the time to dig that deeply, and reading a review of a review of a review is easier and more efficient in the moment. But it’s less efficient in the big picture, as not knowing what’s already been proposed and examined can mean years of redundant work.

But this is true of journals stored in library stacks, before online editions. It was such a pain to use Index Medicus or a review article (reading a review article has always been the fastest way to get up to speed. It has nothing to do with being online or not) and find the articles that were really needed. So people would include every damn one they found that was relevant. The time spent finding the reference had to have some payoff.

Also, one would just reuse citations for procedures, adding on to those already used in previous papers. The time spent tracking down those references would be paid out by continuing usage, particularly in the Introduction and Materials & Methods sections. Many times, researchers would have 4 or 5 different articles all saying the similar things or using the same technique just to provide evidence of how hard they had worked to find them (“I had to find these damned articles on PCR generated mutagenesis and I am going to make sure I get maximum usage out of them.”)

There are other possible answers for the data that do not mean that Science and Scholarship are narrowing, at least not in a negative sense. A comment at LISNews leads to one possible reason – an artifact of how the publishing world has changed.
The comment takes us to a commentary of the Evans’ article.While this is behind the subscription wall, there is this relevant paragraph:

One possible explanation for the disparate results in older citations is that Evans’s findings reflect shorter publishing times. “Say I wrote a paper in 2007” that didn’t come out for a year, says Luis Amaral, a physicist working on complex systems at Northwestern University in Evanston, Illinois, whose findings clash with Evans’s. “This paper with a date of 2008 is citing papers from 2005, 2006.” But if the journal publishes the paper the same year it was submitted, 2007, its citations will appear more recent.

[As an aside, when did it become Evans’s rather than Evans’? I’d have gotten points of from my English teacher for that. Yet a premier journal like Science now shows that I can use it that way.]

The commentary also mentions work that appears to lead to different conclusions:

Oddly, “our studies show the opposite,” says Carol Tenopir, an information scientist at the University of Tennessee, Knoxville. She and her statistician colleague Donald King of the University of North Carolina, Chapel Hill, have surveyed thousands of scientists over the years for their scholarly reading habits. They found that scientists are reading older articles and reading more broadly–at least one article a year from 23 different journals, compared with 13 journals in the late 1970s. In legal research, too, “people are going further back,” says Dana Neac u, head of public services at Columbia University’s Law School Library in New York City, who has studied the question.

So scientists are reading more widely and more deeply. They just do not add that reading to their reference lists. Why? Part of it might be human nature. Since it is so much easier to find relevant papers, having a long list no longer demonstrates how hard one worked to find them. Citing 8 articles at a time no longer means much at all.

That is, stating “PCR has been used to create mutations in a gene sequence 23-32” no longer demonstrates the hard work put into gathering those references. It is so easy to find a reference that adding more than a few looks like overkill. That does not mean that the scientists are not reading all those other ones. They still appear to be, and are even reading more, they just may be including only the relevant ones in their citations.

Two others put the data into a different perspective. Bill Hooker at Open Reading Frame did more than most of us. He actually went exploring in the paper itself and added his own commentary. Let’s look at his response to examining older articles:

The first is that citing more and older references is somehow better — that bit about “anchor[ing] findings deeply intro past and present scholarship”. I don’t buy it. Anyone who wants to read deeply into the past of a field can follow the citation trail back from more recent references, and there’s no point cluttering up every paper with every single reference back to Aristotle. As you go further back there are more errors, mistaken models, lack of information, technical difficulties overcome in later work, and so on — and that’s how it’s supposed to work. I’m not saying that it’s not worth reading way back in the archives, or that you don’t sometimes find overlooked ideas or observations there, but I am saying that it’s not something you want to spend most of your time doing.

It is much harder work to determine how relevant a random 10 year old paper is than one published last month. In the vast majority of cases, particularly in a rapidly advancing field (say neuroscience) papers that old will be chock full of errors based on inadequate knowledge. This would diminish their usefulness as a reference. In general, new papers will be better to use. I would be curious for someone to examine reference patterns in papers published 15 years ago to see how many of the multitude of citations are actually relevant or even correct?

Finally, one reason to include a lot of references is to help your readers find the needed information without having to do the painful work of digging it out themselves. This is the main reason to include lots of citations.

When I started in research, a good review article was extremely valuable. I could use it to dig out the articles I needed. I loved papers with lots of references, since it made my life easier. This benefit is no longer quite as needed because other approaches are now available to find relevant papers in a much more rapid fashion than just a few years ago.

Bill discusses this, demonstrating that since it is so much easier to find relevant article today, this need to help the reader in THEIR searches is greatly diminshed.

OK, suppose you do show that — it’s only a bad thing if you assume that the authors who are citing fewer and more recent articles are somehow ignorant of the earlier work. They’re not: as I said, later work builds on earlier. Evans makes no attempt to demonstrate that there is a break in the citation trail — that these authors who are citing fewer and more recent articles are in any way missing something relevant. Rather, I’d say they’re simply citing what they need to get their point across, and leaving readers who want to cast a wider net to do that for themselves (which, of course, they can do much more rapidly and thoroughly now that they can do it online).

Finally, he really examines the data to see if they actually show what many other reports have encapsulated. What he finds is that the online access is not really equal. Much of it is still commercial and requires payment. He has this to say when examining the difference between commercial online content and Open Access (my emphasis):

What this suggests to me is that the driving force in Evans’ suggested “narrow[ing of] the range of findings and ideas built upon” is not online access per se but in fact commercial access, with its attendant question of who can afford to read what. Evans’ own data indicate that if the online access in question is free of charge, the apparent narrowing effect is significantly reduced or even reversed. Moreover, the commercially available corpus is and has always been much larger than the freely available body of knowledge (for instance, DOAJ currently lists around 3500 journals, approximately 10-15% of the total number of scholarly journals). This indicates that if all of the online access that went into Evans’ model had been free all along, the anti-narrowing effect of Open Access would be considerably amplified.

[See he uses the possessive of Evans the way I was taught. I wish that they would tell me when grammar rules change so I could keep up.]

It will take a lot more work to see if there really is a significant difference in the patterns between Open Access publications and commercial ones. But this give and take that Bill utilizes is exactly how Science progresses. Some data is presented, with a hypothesis. Others critique the hypothesis and do further experiments to determine which is correct. The conclusions from Evans’ paper are still too tentative, in my opinion, and Bill’s criticisms provide ample fodder for further examinations.

Finally, Deepak Singh at BBGM provides an interesting perspective. He gets into one of the main points that I think is rapidly changing much of how we do research. Finding information is so easy today that one can rapidly gather links. This means that even interested amateurs can find information they need, something that was almost impossible before the Web.

The authors fail to realize that for the majority of us, the non-specialists, the web is a treasure trove of knowledge that most either did not have access to before, or had to do too much work to get. Any knowledge that they have is better than what they would have had in the absence of all this information at our fingertips. Could the tools they have to become more efficient and deal with this information glut be improved? Of course, and so will our habits evolve as we learn to deal with information overload.

He further discusses the effects on himself and other researchers:

So what about those who make information their life. Creating it, parsing it, trying to glean additional information to it. As one of those, and having met and known many others, all I can say is that to say that the internet and all this information has made us shallower in our searching is completely off the mark. It’s easy enough to go from A –> B, but the fun part is going from A –> B –> C –> D or even A –> B –> C –> H, which is the fun part of online discovery. I would argue that in looking for citations we can now find citations of increased relevance, rather than rehashing ones that others do, and that’s only part of the story. We have the ability to discovery links through our online networks. It’s up to the user tho bring some diversity into those networks, and I would wager most of us do that.

So, even if there is something ‘bad’ about scientists having a more shallow set of citations in their publications, this is outweighed by the huge positive seen in easy access for non-scientists. They can now find information that used to be so hard to find that only experts ever read them. The citation list may be shorter but the diversity of the readers could be substantially enlarged.

Finally, Philip Davis at The Scholarly Kitchen may provide the best perspective. He also demonstrates how the Web can obliterate previous routes to disseminate information. After all the to-do about not going far enough back into the past for references, Philip provides not only a link (lets call it a citation) from a 1965 paper by Derek Price but also provides a quote:

I am tempted to conclude that a very large fraction of the alleged 35,000 journals now current must be reckoned as merely a distant background noise, and as far from central or strategic in any of the knitted strips from which the cloth of science is woven.

So even forty years ago it was recognized that most publications were just background noise. But, what Philip does next is very subtle, since he does not mention it. Follow his link to Price’s paper (which is available on the Web, entitled Networks of Scientific Papers). You can see the references Price had in his paper. a total of 11. But you can also see what papers have used Price’s paper as a reference. You can see that quite a few recent papers have used this forty year old paper as a reference. Seems like some people maintain quite a bit of depth in their citations!

And now, thanks to Philip, I will read an interesting paper I would never have read before. So perhaps there will be new avenues to find relevant papers that does not rely on following a reference list back in time. The Web provides new routes that short circuits this but are not seen if people only follow databases of article references.

In conclusion, the apparent shallownesss may only be an artifact of publishing changes, it may reflect a change in the needs of the authors and their readers, it may not correctly factor in differences in online publishing methods, it could be irrelevant and/or it could be flat out wrong. But it is certainly an important work because it will drive further investigations to tease out just what is going on.

It already has, just by following online conversations about it. And to think that these conversations would not have been accessible to many just 5 years ago. The openness displayed here is another of the tremendous advances of online publication.

Technorati Tags: , , ,

5 thoughts on “Missing the point?”

  1. I think the point I was trying to make was your basic old-curmudgeon-you-kids-today-need-to-read-the-literature rant. I know from my own research experience the way that results are often twisted (or to be kinder, evolve) over time. You see the original article where the author proposes a possible mechanism. Then a later article cites that one and mentions the possibility. Then a review article cites the second one and states it as a fact. Then 6 papers down the line the unproven hypothesis is now dogma.

    And that’s the problem with using secondhand (and third and fourth, and fifth) sources for your references. Yes, this is just as true in a paper-only world as it is online. But in the online world with instant access to a journal’s backlog and direct links from reference lists, there’s less excuse for scholarly laziness.

    Those who forget the past are doomed to repeat it as the cliche goes. There seems to be an awful lot of thought being given to the idea of a “dark archive” where people can publish negative results, which will help others save time and money by not having to repeat failed experiments. Forget the dark archive, what about already published positive results–if we choose to ignore the deep literature in favor of a small subset of recent articles, then we’re likely to end up repeating work that’s already been done. The study in question may not show that this is really happening–how do paper citations reflect the scholarship an author has actually done–but it was a conclusion suggested by the study’s author, hence my comment.

    The other issue online is the emphasis on “consensus” rather than “correctness”. If a well-connected person with lots of online friends and linkers puts forth their opinion, or cites a paper stating one possible solution (but not necessarily the correct one), the online world tends to reinforce that. You link to it, it raises in Google, I find it easier because it’s higher in Google, so I link to it, it gets higher in Google, lather, rinse, repeat. Everyone ends up linking to the same paper, rather than a better paper that isn’t as popular or well-linked. I worry that this provides a mechanism for effectively silencing different voices, rather than the open forum we were all promised the internet would be. Kind of a “mob rules” versus “wisdom of the crowds” sort of thing.

  2. David,

    Its why I put the question mark ;-) I think it is a little too early to interpret this observation with any definitive conclusion.

    You bring up some really good points. However, I think that this sort of thing was present long before online journals appeared.

    Oddball ideas have always had a hard time beating their way through scientific consensus. Mendel’s paper sat unexamined for years. Science has always had a tendency to make it hard for non-concensus things to make their way to the forefront.

    It used to be that an entire generation of scientists had to die before a new idea would gain real purchase. we do not see that today. In fact, it is almost the opposite, with new ideas and changes happening almost hourly.

    The main driver for what you describe is based on how human social netowrks are set up. They follow the small-world model where new nodes are added preferentially to the most highly connected. This provides real benefits in scalability and information flow and are how the internet itself is set up.

    But they create power-law driven connections (a Long Tail), so that the most popular remain the popular. That is how the social network survives. It is not a bad thing since the popular only remain popular if they provide the network what it needs. The people drive the popularity, not the site.

    So there was really never any chance that the Web would give equal voices to all. Human social networks do not function that way.

    What the Web can do is provide equal access to all. Now my oddball idea can actually get out there in ways never before possible. Of course it won’t be a big hitter but at least it is findable.

    And if it is actually correct, if it actually describes nature in a better way that current methods, others will find it. We do not all need to be deep searchers. But some people love to do that and they will find the oddball. And they will connect to others. This incipient community can have real power, if the model actually is better, if it represents a better ‘Truth.’

    The Long Tail is fractal and holds the promise that useful ideas will move up the tail, will percolate towards the concensus of the popular.

    Thus it is possible for the odd to get seen. But it is a different path than a sort of town hall where everyone gets their say evenly. I could be wrong though ;-)

    But it is through conversations like what you and I have, that would really not have been possibe before the Web, that will help make it work well.

    Because your concerns need to be addressed. If there is some sort of dark archive developing, then we need to develop tools to prevent it. And we can.

  3. As an aside, I am reminded of Steve Martins great character, Theodoric of York, Medievel Barber.

    “You know, medicine is not an exact science, but we are learning all the time. Why, just fifty years ago, they thought a disease like your daughter’s was caused by demonic possession or witchcraft. But nowadays we know that Isabelle is suffering from an imbalance of bodily humors, perhaps caused by a toad or a small dwarf living in her stomach.”

    And:

    “Wait a minute. Perhaps she’s right. Perhaps I’ve been wrong to blindly follow the medical traditions and superstitions of past centuries. Maybe we barbers should test these assumptions analytically, through experimentation and a “scientific method”. Maybe this scientific method could be extended to other fields of learning: the natural sciences, art, architecture, navigation. Perhaps I could lead the way to a new age, an age of rebirth, a Renaissance!…Naaaaaahhh!”

  4. I’m certainly not concluding anything yet, and I do think you’re right–this is definitely an age old problem, not one that sprung up with the rise of online publishing. The question is whether things are getting better or worsening. I would hope, with more and more journals putting their archives online (and more and more doing so open access), that finding and reading the primary literature would be easier than ever. But the study in question seems to be stating the opposite, that people are citing the primary literature less and less, and using more recent secondary sources (and fewer and fewer of these). Any time anyone summarizes someone else’s findings, you’re introducing a level of bias, and that worries me. I want to read the original findings and understand them for myself, rather than relying on someone else’s reading of them. Just FYI, the last thing I wrote for publication cited Haeckel’s Gastrea Theory paper from the 1890’s, but that’ s how I roll.

    To me is one of the worries of our move online. We’re cutting corners, living in a world of instant tweets instead of well-thought out essays that really explore a topic. There’s too much “I read it on the internet so it must be true” or at least “everyone else cites this paper so I will too without fully investigating for myself”, or at least that’s the implication of the study.

    One interesting note is Rotisserie, which is designed to inspire well-written, thought-out discussions online:
    http://scienceoftheinvisible.blogspot.com/2008/08/how-to-fix-broken-internet.html

    I do agree that the internet just magnifies human nature, and that increased access to things is a positive thing. But I worry that (as recent studies have shown), the long tail is becoming longer and more obscure, and the mainstream hits are getting more and more concentrated. Time will tell if this trend continues.

    Oh, and I guess I didn’t explain the dark archive well enough–it’s a good thing. The idea is to have places where you can publish, or at least upload, your unpublished data, your experiments that yielded negative results. This will allow others to perhaps get something out of your work, work that you’re not going to be able to use for anything. At the very least, it would prevent others from wasting time and money repeating the things you’ve already done that haven’t panned out.

    As for your Theodoric comments, well, you’ll feel a lot better after a good bleeding.

  5. I misunderstood about the dark archive. Sorry.

    Well, to a certain extent, referencing literature is there to make it easier to find relevant information for the reader and to provide support for the research. Referencing irrelevant papers is always a problem but now one can find out faster what is what.

    If it really becomes a problem for online articles (i.e. referencing work that does not accurately reflect the science), then there are a couple of ways to fix this.

    In strong referee journals, the referees can easily tell the authors to get their references in shape. This is much easier now than it used to be, due to online approaches. Checking references was such a horrible chore before. Now it is a click. Make them use worthwhile citations.

    For lightly refereed papers, this may not be done as much before publication. But it would not be in the journal’s best interest to allow this to become shoddy (lots of irrelevant references). Because it could have a huge effect on the impact factor for a journal.

    Impact factors will remain a useful tool to order journals, even ones online. Those who make sure their customers are happy (by providing high quality articles with great references) will have higher impact. Those who put out shoddy papers with poor references will find themselves much lower on the totem pool.

    So the market may drive people back to having really great, relevant references.

    An important aspect of this, when it comes to papers, is that if this narrowing is real and if it really is a bad thing, then the scientific community can develop ways to fix the problem. Because now we know it exists.

    About the Long Tail. It has always been there because the power log nature of it maps directly to human social networks. We can just see it better now because of the metrics developed for web sites.

    So, any idea before could only percolate up by working its way through the Long Tail using some defined methods, all social in nature. Publication, presentation, sabbatical, letter writing. These have been the age-old methods for moving up the power log nature of human social networks. Making human connections with the right people, the right publication. but usually the people doing the interacting had to occupy a similar space at similar times. Not with the Web.

    The Web simply provides another and much more potent avenue to accomplish the same things. The critical aspect, and one I spend a lot of time on, is that there is also an increase in noise and random crud. We need good filters, a sort of Maxwell’s Demon, that can let good stuff percoalte up and keep bad stuff down.

    Rotisserie sounds interesting. Because information can not become knowledge without human interaction and discussion. This can occur online and will be greatly enlarged as we develop better tools. A place for good scientific discussions is hard to find online (some blogs come close).

    Because it is through a diversity of viewpoints, such as ours, that we will be able to overcome these difficulties. that is how humanshave always solved the complex problems facing them.

Leave a Reply to richard Cancel reply