Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias/Section 6

←

Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias (2012)
Imogen Casebourne, Dr. Chris Davies, Dr. Michelle Fernandes, Dr. Naomi Norman

Section 6: Discussion

Appendix

→

This study was published in 2012. See this blog post for further information.

1359868Assessing the accuracy and quality of Wikipedia entries compared to popular online encyclopaedias — Section 6: Discussion2012Imogen Casebourne, Dr. Chris Davies, Dr. Michelle Fernandes, Dr. Naomi Norman

6. Discussion

6.1 Methodology

Section 3 of this report describes in some detail the full process followed in carrying out this research. In particular it reported on the processes of selecting languages and encyclopaedias for comparison with Wikipedia articles, the sampling strategies for student and academic expert reviewers, the selection of articles and the review process.

We have nothing to add in terms of the decisions made for which languages to study, beyond saying that nothing which occurred subsequently in the process suggested that these were inappropriate choices. Considerable care was spent in trying to select the online encyclopaedias that were most appropriate for comparison with Wikipedia, in terms of the nature of content, style and readership. We have no doubt, in retrospect, that even given the difficulties in finding articles of equivalent focus and length to Wikipedia articles on a number of occasions, we could not have made better choices with respect to the three languages chosen.

The processes of establishing the samples of students and academic experts proved to be largely appropriate and productive. Achieving an initial pool of 116 students provided an excellent foundation for the selection of 24 students (12 as the main cohort, and a further 12 as back-up). Given the time pressures and commitments of such high level students, we were pleased that we were able to select this number of committed and capable people who had such a key role to play in the research, both in terms of identifying academic experts and in carrying out their own reviews of articles. The identification of experts was carried out rapidly and productively and resulted in a generally satisfactory outcome in terms of numbers and quality of reviews. However, we were not always able to meet our target of at least two academic experts for each article (as against one student and one academic expert, a minimal requirement that was met on every occasion). Given the considerable efforts and enthusiasm of all involved, especially the students, this does raise serious questions about the viability of a significantly large-scale study in the future.

Similar questions arise from the difficulties encountered in selecting and preparing pairs of articles for comparison. In the event, it proved extremely difficult to locate encyclopaedias which provided articles that could be compared with Wikipedia in a number of the specialist areas of experts. We had no alternative but to select topics that were broader and less specialist than we would have preferred and which did not match the expertise of academic reviewers as closely as originally intended. This, once again, has significant implications for any future scaling up of the research, although we do believe that the actual process of comparison proved to be extremely valuable (discussed further below).

Another potential problem in preparing articles concerns the inclusion of additional material such as photographs, charts and tables. Images, as was explained in Section 3, had been removed from articles and presented separately as part of the anonymisation process, so that although reviewers were not able to see images in context, they were able to comment on them. A few reviewers commented on the issue of imagery as a concern especially for articles on science and social science. For instance, with respect to the Wikipedia article on Neuronia, Reviewer 1 commented approvingly about the presence of images ("It includes more photographs. These photographs help to understand the role of neuron"). This is a factor that helped to distinguish two articles judged to be generally of commensurate quality. In general, though, imagery was not frequently referenced in comparative comments, perhaps because images had been dislocated from the flow of the text.

It is worth noting that a general audience (rather than academic reviewers, who may well be more used to engaging with largely text-based material) might expect high quality imagery from an encyclopaedia, online or otherwise, and might make overall judgments of quality based on layout and the quality of the imagery included. The methodology of this study (removing images from their original location and using academic reviewers) meant that the focus was very strongly on the words used and may not have fully captured this area of judgment.

The greatest difficulty in anonymising articles involved the removal of information that may be considered integral to the value of the article, such as the Wikipedia article tree, or the name of the author of some articles in other encyclopaedias. The removal of such information was clearly essential in order to achieve the goal of blind reviewing, but it could be argued that information about authorship might, to some extent, compensate for lack of references (although there is no inherent reason why named authorship need preclude the use of references).

The review process appeared to have been productive and appropriate. The criteria contained within the feedback tool (whose development is described in Section 3) provided an appropriate range of distinct perspectives on articles and stimulated a range of judgments and comments that, for the most part, enabled us to gain quite a rich and insightful range of comments about articles from reviewers. However, in developing this tool further, we would recommend a further period of trialling of criteria, especially around concepts such as completeness, conciseness and coherence, which sometimes seemed to generate slightly contradictory comments from some reviewers.

There is, of course, a fundamental problem in trying to reconcile the provision of clear and consistent criteria so that a wide range of reviewers can be seen to be making comparable judgments, with the need (especially when it comes to asking for qualitative judgments) to capture the language and criteria that academic experts might otherwise have used, if simply asked to discuss the strengths and weaknesses of articles as they perceived them. It is certainly only through such an approach that it would be possible to carry out any systematic form of quantifiable content analysis of experts' qualitative judgments. As it was, in analysing the qualitative aspect of reviewers' judgments, analysts had to make their own judgments to some extent about whether, for instance, a reviewer talked about enjoyment of an article because that term had been put before them in the feedback tool or because they had actually enjoyed reading the article.

This said, we felt that the qualitative comments provided considerable insight into the kinds of criteria and standards for making judgments about online encyclopaedia content that different academics use, and it may have been the case that we would have had considerably more difficulty in generating the quality of thinking and comment that we did receive without the framework that was provided. It is fair to say, though, that this pilot has not provided a definitive answer to this question, but it has demonstrated that the approach of providing a clear and specific framework can be highly productive. We encountered no evidence, at any rate, that reviewers felt constrained by the criteria provided for the review process. They were clearly capable of introducing their own criteria into the qualitative discussion of articles as appropriate, such as in the following:

"The discussion of the Monologion is fairly good. There are some factual errors or at least infelicities: 'monologium' and 'proslogium' should be 'monologion' and 'proslogion'. Anselm was most likely canonised in 1494. The treatments of the proslogion and cur deus homo are superficial." (Reviewer 2 – St Anselm)

We definitely believe that the comparative approach worked very effectively, and we would certainly recommend using that more widely, all other considerations being equal. This was clearly demonstrated on a number of occasions by comments such as the following:

"Reading the second article made me realise how poor the first article was in that it did not cover the subject comprehensively and focused excessively on one aspect that could be viewed as peripheral. In the second article, the structure and the reference to sources were ideal." (Reviewer 1 – antibiotic resistance)
"The most important differences separating the two articles were conciseness and scope of information." (Reviewer 2 – Mutation)
"Article 1 is superior in almost all respects to article 2. The difference is particularly striking in terms of structure, references, factual accuracy, grammar and language." (Reviewer 2 – Numero Racional)

In addition to the way in which comparing articles managed to focus reviewers' awareness of the qualities and weaknesses of specific articles, it was interesting to see on a number of occasions the way that comparison generated insights into the ways in which an article on a particular topic might usefully combine the insights and approach of multiple articles:

"Although both articles address the same basic issue of global warming, each of them with very different views, I find both very instructive and entertaining. The first privileges the sociological items, while the second starts with a more geophysical outlook. I find them both very good and complementary." (Reviewer 1 – Cambio Climatico)
"I preferred the first article to the second. It is written in a more scholarly manner and it provides a lot of references. I found the second paper still a draft, and this might be the case. Ideally you would combine the two to give a more comprehensive picture of preschool education." (Reviewer 2 – Preschool Education)

One methodological issue that raises perplexing questions is the fact that reviewers' judgments were not always in exact agreement with one another on specific articles. To some extent, such differences reflect the previous point about the different perspectives on particular topics that emerge from different articles. We do not consider, certainly, that different viewpoints on the same topic are necessarily invalid – indeed, they are part and parcel of academic life, as is variation in academics' judgment of quality. Just as an article submitted to a journal for peer review will very often receive diverging judgments, so did a number of the online encyclopaedia articles in the present sample (such as, for instance, the following where one reviewer differed markedly from the other reviewer or reviewers: Energia Renovable, Antibiotic Resistance, Preschool Education, Egypt). For the most part, such disagreement concerned issues of emphasis and style rather than accuracy or perhaps in a small number of cases also reflected negative perspectives on online encyclopaedias in general. This raises questions that are not possible to resolve here, but which might need clarifying before further work of this kind is carried out. These concern philosophical and epistemological perspectives on issues such as: the nature of knowledge within different cultural settings, the traditional role of encyclopaedias as sources of authoritative knowledge, perspectives on the Internet in general (and Wikipedia in particular) as a medium for the co-construction and sharing of knowledge, and so on. It is a mistake, perhaps, to assume that everyone involved in a project of this kind is actually agreed on the fundamental perspectives, which are bound to influence judgments made about individual articles.

Finally, given the successful outcome in terms of return of reviews, we believe that the tool created for this pilot study, using Moodle, proved to be usable, and provides a good basis for further development. Indeed, we can say with some confidence that the decisions made for carrying out this pilot proved generally to be appropriate and effective, both in terms of securing the valuable co-operation of many busy academics within quite a tight timescale, and in terms of generating an illuminating and satisfactory dataset. We recognise that were the study to be substantially extended, some of the challenges in securing the necessary content for analysis and the desired range of reviewers might prove hard to surmount on a far larger scale.

6.2 Findings

The quantitative and qualitative findings from this project are more or less in agreement with one another, as might be expected. But they do lead to slightly differing perspectives on the judgments made by reviewers in one or two respects. While it is inevitable that quantitative results offer a more precise account of reviewers' judgments, we would suggest that the qualitative perspectives provided by the data are also of considerable value.

6.2.1 Quantitative Findings

The quantitative findings demonstrate that, across the piece, Wikipedia articles scored more highly on accuracy, amount and quality of references, style/ readability and overall judgment (which is to say, citability). With respect to citability, though, it must be emphasised that at no time did articles from online encyclopaedias, whether Wikipedia or others, score highly with respect to this key criterion. This was also made quite clear in the qualitative comments. While many reviewers felt that some of the online encyclopaedia articles they reviewed were suitable for use in non-academic contexts (as useful or interesting overviews and introductions on particular topics) they did not consider that such articles could be considered on equal terms to material in refereed journals or textbooks from established publishers. Indeed, for academic reviewers in general, this was not likely to be otherwise and should not be seen either as a particularly surprising outcome, or as a particularly negative reflection on such articles.

This simply reflects the reality that scholarly knowledge and scientific research have to go to far greater lengths than are possible within a relatively short encyclopaedia article in order to justify knowledge claims in general. By 'far greater lengths', we should add, we are talking especially about issues such as extent of evidence provided in support of a knowledge claim, clarity about methodological issues and evidence of peer review. None of these things can reasonably be expected of articles in online encyclopaedias in sufficient measure. It is important here to focus on the qualities that can reasonably be expected of such sources of knowledge, in order to see whether – on the basis of this quite small sample at least – we were able to collect evidence which, if collected on a far larger scale, would provide definitive judgments about the quality of Wikipedia in its own terms, which is to say, as a leading online encyclopaedia.

Within this small sample, Wikipedia scored well in many key respects, as we have indicated above, and these positive scores were reflected when considering the findings in relation to the specific perspectives of articles in different languages, and in different disciplines. Indeed, as the quantitative results show clearly, it was only with respect to articles in the Arabic encyclopaedias that Wikipedia did not earn markedly higher scores. In the case of those two encyclopaedias, Mawsoah and Arab Encyclopaedia, Wikipedia came out lower on style, and more or less the same on the other key criteria of accuracy, references, overall judgment and overall quality score. In all other comparisons, Wikipedia fared somewhat better on references and, with the exception of articles in the Humanities and MPLS (Mathematics, Physics and Life Sciences) where Wikipedia scored no better on accuracy, style/ readability, overall judgment and overall quality score. This was more or less the case with articles in the Social Sciences, with the difference that Wikipedia scored relatively poorly there on style/ readability. In Medical Sciences, though, Wikipedia scored well on accuracy, references and overall judgment.

6.2.2 Qualitative Findings

In terms of qualitative analysis, the picture is less easy to summarise. It is, in theory, possible to total the number of positive and negative comments and the overall number of preferences expressed regarding the full spread of articles in the sample. However, reviewers were generally quite measured in their comments and sometimes expressed no distinct preference, or highlighted strengths and weaknesses across both articles whilst marginally preferring one. For some articles overall preference is too close to call and in others where a preference is expressed, it is not a strong preference. Additionally, some subjects had four reviewers, whereas others only had two, so any overall count of preferences will be necessarily skewed by this. If a particularly well received article from one publication happened to have more reviewers than a less well received article from the same publication, that publication would make an unrepresentatively strong showing in any rough total of reviewer preferences.

It would, at any rate, be pernicious to attempt to quantify qualitative judgments too precisely. Above all, the analysis of qualitative data aims to capture things that are hard to quantify precisely: feelings, attitudes and opinions of reviewers that are important and illuminating but are often also imprecise and hard to compare.

In comparing 'the accuracy, quality, style, references and judgment of Wikipedia entries as rated by experts to analogous entries from popular online alternative encyclopaedias' through the medium of the qualitative data, we were able to identify a number of issues both about the way academics make judgments about online encyclopaedia content in general and in particular about the characteristics that distinguish Wikipedia entries. It is evident from these qualitative judgments that, apart from a small number of articles considered to be quite weak, Wikipedia articles in general emerge creditably from this comparison in a number of respects. More usefully, it was possible to identify a pattern of qualities that appeared to be particularly characteristic of Wikipedia entries. They were generally seen as being more up to date than others, were generally considered to be better referenced and appeared to be at least as strong as other sources in terms of comprehensiveness, lack of bias and even readability.

This latter judgment is worth emphasising here, because the quantitative data suggests that Wikipedia generally performed less well than the other sources when it came to style/ readability. The qualitative data does show though, that despite the readability issues associated with multi-authorship, such as lack of cohesion, repetition and poor structure, there was no clear impression across the qualitative data that Wikipedia articles were all less satisfying or engaging to read than other articles. This should definitely be considered as a crucial aspect of readability in our opinion. In the reviews, for example, of the Wikipedia article on Polinomia ("it has the right encyclopaedic tone"), Memory ("concise and well-written"), Attention ("very carefully written"), it is clear that reviewers approved the reading experiences as much as they valued the accuracy and references. By the same token, many articles from other encyclopaedias were criticised for their poor quality of writing.

Nonetheless, the multi-authored nature of Wikipedia did frequently lead to negative judgments regarding repetition, poor structure and lack of coherence within articles. But the analysis of qualitative data did allow us to build a more interesting composite picture, in which the balance of qualities in a particular article were seen often to outweigh specific shortcomings. Insofar as Wikipedia articles were more often judged to provide more comprehensive and up to date content, useful references and at least comparable levels of accuracy and citability, it can be argued that reviewers (though critical when asked about readability) were prepared to forego that to some extent given the presence of the other qualities.

Indeed, in seeking to characterise what it was about a high proportion of the Wikipedia articles that led to them being preferred in the qualitative findings as a whole, we would suggest that the answer can be found in the marked impression that the strengths of multiauthorship were often judged to outweigh its weaknesses. Regardless of problems of readability, style and structure on occasions, the greater likelihood that it was a Wikipedia article that would be judged to be up to date, comprehensive and well referenced in the qualitative comments offers, at the very least, a hypothesis about the particular qualities of Wikipedia that is well worth exploring in a more substantive study.

6.3 Recommendations

6.3.1 Methodological Considerations for Future Research

We are well aware that the findings of this study are merely indicative of the kinds of approaches and issues that might be considered relevant to a future large-scale study, and we must repeat that although the findings for this small sample were quite positive with regard to Wikipedia, this has limited significance beyond indicating the kinds of questions or hypotheses that might profitably be focused on in a subsequent study on a larger scale. Although it is perfectly legitimate to look in depth, as we have attempted to do, at the themes emerging from this small sample, in order to sharpen the focus of further research of this kind in the future, it is obviously not possible to make generalisations about Wikipedia as a whole on the basis of the comparative analysis of 22 articles. Therefore, it is important to consider what kinds of further study are most feasible.

The use of the feedback tool devised for this research produced results that were comparable across reviewers, and allowed for consistency of measurement of views, especially in terms of quantitative data. Therefore, in any future study, we consider that it would definitely be worthwhile to develop and refine this instrument. Further pilot work would be needed to test the appropriateness of specific criteria, such as conciseness, readability and enjoyment, in order that reviewers are consistent in their application of these to specific articles. Further consideration as to whether it would be appropriate to modify the tool for use in different disciplinary areas needs examination: our findings indicated that the notion of conciseness, for instance, was used differently by reviewers in science areas than in humanities and social science areas. It is also worth exploring whether an element of unstructured questioning of academics might produce more valuable qualitative data.

In terms of methods used, we do have considerable concerns about the feasibility of substantially scaling up the exact approach used in the present study. This is for a number of reasons:

The difficulty of finding a substantially increased number of articles from other encyclopaedias against which to compare Wikipedia articles.
The difficulty of securing and managing the participation of a sufficient number of academic experts for a large-scale representative study.
The complexity of carrying out a sufficiently rigorous analysis of qualitative data on a large-scale without recourse to some degree of quantitative content analysis, which is not feasible with the present model of feedback tool.

In other words, we would say that the present project has produced a considerable amount of interesting and illuminating data, but we recognise that there are serious logistical problems in replicating it on a far larger and wider scale. Although it would presumably be possible to do so, given considerable investment in research staff to carry this out across multiple sites, the question arises as to whether or not this would be worthwhile.

6.3.2 The Focus for Future Research

As the end of the Section 6.2 indicated, this study has indeed raised some lines of enquiry for further research of a similar kind. As mentioned above, the qualitative results from this study indicated that academic reviewers are generally open-minded about what constitutes quality in articles in online encyclopaedias. We would therefore suggest that, on the basis of these findings, the following tentative hypothesis is worth testing in a future study: the perceived quality of online encyclopaedia articles is as much dependent on a coherent and engaging narrative about a topic as it is on extensive provision of technical information.

Other issues that have arisen in this study that merit further study along the same lines also might include questions about the extent to which multi-authored articles present multiple, repetitive or even contradictory perspectives on their topics. Of importance also is the appropriateness of different kinds of reference to other sources, with respect to different kinds of content: is it possible to identify variation in terms of academic discipline/ topic in order to judge when it is appropriate to draw on internet-based references as well as, or instead of, references to peer-reviewed journals and published books?

We would also recommend considering a wider disciplinary focus to include academic topics that are particularly relevant to Wikipedia's distinctive strengths, such as disciplines around which there is substantial internet-based discourse and dissemination: cultural studies, information sciences, communications, journalism, media studies and a wide range of interdisciplinary studies that bring together areas such as: economics, geography, sociology, future studies and sociolinguistics.

In conclusion, though, we must again ask whether the considerable investment needed to carry out large-scale research into the question of academic approval for Wikipedia articles constitutes the best way forward. It is important that ongoing effort continues to be made to secure the judgments and engagement of academic experts in monitoring the quality of Wikipedia, but the experience of this study suggests that a subsequent study of this kind on a considerably larger scale is not necessarily the most appropriate way of achieving this.

Previous studies, such as those referred to in the Introduction to this report, have been inconclusive, and we would not claim that this study has demonstrated a feasible means of replicating or extending the findings of the original study by Nature, although we do believe that it has made a respectable addition to studies of that kind. The criteria and terms of reference of previous studies have never been consistent from one study to the next and we suggest that a possible way forward is to seek to establish a more consistent set of criteria and questions (drawing on the experience of the present study), that can form the basis for an continuing series of manageable, small-scale studies of quality in the future. These would perhaps form the basis of regular snapshots that help to monitor what is, inevitably, a continuously shifting picture.

While we recognise and applaud the efforts of Wikipedia to maintain high standards in all their articles, though, this study has indicated that even the highest standards are not likely to convince academics – quite reasonably, we suggest – that Wikipedia articles can expect to be citable alongside peer reviewed journal articles, or even published books. These academics were, on the other hand, very prepared and able to recognise the qualities and values in the best online encyclopaedia articles – the majority of which, in the present study, were found in Wikipedia – in their own right.

We suggest that it is worth considering also whether future research into Wikipedia might – in addition to the rigorous small-scale studies of accuracy suggested above – not usefully attempt to devote more attention to the ways in which this exceptional resource is being used by a wide range of readers – academics, students, workers in various industries and self-directed learners also – as a crucial source of knowledge. In this respect, questions of considerable interest might include: What do they expect from it? What do they gain from it? What credit do they attribute to the accuracy of content they encounter? How do they follow up the openings to new knowledge that it provides? And so on. The academics who contributed to this research have very helpfully recognised that the quality of online encyclopaedias resides not so much in exhaustiveness of content, so much as in their capacity to make knowledge accessible and engaging to a wide readership.