Popular Science Monthly/Volume 65/June 1904/On the Significance of Characteristic Curves of Composition
|ON THE SIGNIFICANCE OF CHARACTERISTIC CURVES OF COMPOSITION.|
UNIVERSITY OF NEBRASKA.
A FEW months ago, while studying the variation and interrelation of certain sentence constants, as average sentence-lengths, predication averages and simple-sentence frequencies in prose composition, my attention was called to an allied investigation, directed by Dr. T. C. Mendenhall, which takes for its basis the words used by an author rather than the sentences. The investigation in which I was then employed made it clear that the theory which asserts that an author uses invariable average sentence proportions is not true except when modified in essential respects, and I recognized at once that similar modifications would become necessary if the word instead of the sentence were taken as the element of composition.
The allied investigation to which I refer is set forth in two papers by Dr. T. C. Mendenhall, one in Science, March 11, 1887, entitled 'The Characteristic Curves of Composition,' the other, 'A Mechanical Solution of a Literary Problem' in The Popular Science Monthly, December, 1901.
These papers deal with the relative frequency of words of different lengths employed by an author. It was found that different groups of a thousand words each, taken from the same author, manifested a rather remarkable uniformity in the frequency of words containing a given number of letters. Larger groups showed still greater uniformity, and hence it was inferred that if sufficiently large groups of words from the same writer were examined, they would yield practically the same relative frequencies of words with a given number of letters.
The results were exhibited graphically. The number of letters per word were used as abscissas, the number of words per thousand containing a definite number of letters were taken for ordinates, and the resulting points connected by straight lines. Thus a graph or diagram was obtained which presents to the eye in a simple manner the relative frequencies of words of different lengths. Two such diagrams from the same author will agree more or less closely, depending upon the number of words in the groups upon which the averages are based. In the writer's own words: "When the number of words in each group is increased there is, of course, closer agreement of their diagrams, and this became so evident in the earlier stages of the investigation that the conclusion was soon reached that if a diagram be made representing a very large number of words from a given author, it will not differ sensibly from any other diagram representing an equally large number of words from the same author. Such a diagram would then reflect the persistent peculiarities of this author in the use of words of different lengths and might be called the characteristic curve of his composition. Curves similarly formed from anything that he had ever written could not differ materially from this." (The italics are mine.) After some preliminary work which seemed to bear out the conclusion ventured above, the writer states: "From the examination thus far made I am convinced that 100,000 words will be necessary and sufficient to furnish the characteristic curve of a writer—that is to say, if a curve is constructed from 100,000 words of a writer, taken from any one of his productions, then a second curve constructed from another 100,000 words would be practically identical with the first—and that this curve would, in general, differ from that formed in the same way from the composition of another writer, to such an extent that one could always be distinguished from the other."
Such is the author's own statement of his theory, which the facts adduced apparently support. The culminating test consisted in the examination of different groups of 100,000 or more words from each of several authors, and it was found that the corresponding graphs did actually coincide. This, in the words of the author, 'must be regarded as convincing evidence of the soundness of the original assumption.'
The existence and uniqueness of characteristic curves being granted, its practical application as a test of disputed authorship is obvious. To quote again, "If it can be proved that the characteristic curve exhibited by an analysis of 'David Copperfield' is identical with that of 'Oliver Twist' of 'Barnaby Rudge,' of 'Great Expectations,' etc., and that it differs sensibly from that of 'Vanity Fair,' or 'Eugene Aran,' or 'Robinson Crusoe,' or 'Don Quixote,' or anything else, in fact, then the conclusion will be tolerably certain that whenever it appears it means Dickens."
The title of Dr. Mendenhall's second paper, 'A Mechanical Solution of a Literary Problem,' refers to the application of this theory to the Bacon-Shakespeare controversy, which, we are told, formed the objective point of the whole investigation. The characteristic curves resulting from 400,000 words of the plays, and 200,000 words from Bacon's 'Henry VII.,' 'Advancement of Learning' and the 'Essays' were constructed and exhibited together as in Fig. 20. The concluding remark that 'the reader is at liberty to draw any conclusion he pleases from this diagram' only strengthens the impression that the conclusion intended is considered unavoidable, though we are told at the outset that 'the investigation is not to be looked upon as a final solution of the principal problem.'
Considering that it is now over fifteen years since the theory of characteristic curves was first outlined and that no denial of it has appeared, it must be taken for granted that the theory has found general acceptance. It is for this reason that I undertook an investigation, which proved laborious and unattractive in the main, in order to combat with facts an error which to me seemed obvious from the outset. The data which I have now at hand, though necessarily meager, are amply sufficient to establish a duality, if not a multiplicity, of characteristic curves for many authors. But this amounts to a denial of Dr. Mendenhall's major premise, and consequently invalidates his conclusion. Fig. 20, instead of furnishing a convincing proof, or even contributary evidence, leaves the problem of disputed authorship wholly untouched. In fact, my results throw considerable doubt upon the very existence of characteristic curves in the sense that the word has been employed by Dr. Mendenhall. I shall, therefore, use the term word-curve when referring to curves representing the relative frequencies of different length words used in composition.
Dr. Mendenhall states that the validity of his method as a test of authorship implies two assumptions: first, that the author makes use of a vocabulary which is peculiar to himself, and the character of which does not change from year to year during his productive period; and second, that in the use of that vocabulary in composition, personal peculiarities in the construction of sentences will, in the long run, recur with such regularity that short words, long words and words of medium length will occur with definite relative frequencies.
These two assumptions are of course independent. Suppose it be granted that authors use vocabularies peculiarly their own. It does not at all follow that these peculiarities will manifest themselves in varying word-lengths. Obviously an indefinite number of different vocabularies is conceivable, each yielding the same average word-length or even fitting to the same word-curve. Now, it is true that if authors are endowed with a word-sense or word-instinct by means of which personal traits are reflected through their vocabularies (first assumption), and if, moreover, this word-sense manifests itself in measurable differences in the relative frequencies of words of a given length (second assumption), then these personal traits or peculiarities of an author will in general modify the contour of the word-curve. But the converse of this by no means follows, that the differences in the contours of word-curves are necessarily due to any personal peculiarities in the respective authors.
The average word-length may be reasonably assumed to depend upon other factors besides the author's word-sense, as the form of composition, the subject matter, etc. A man's gait differs according as he is walking for pleasure or on business, alone or in the company of others, on a long journey or to escape from danger. Similarly the average word-length of the language current in the market place, the street or the drawing-room differs from that employed on the rostrum, in learned discourse or in polite conversation, even though used by the same person. Why should not this difference manifest itself in the written utterances of an author?
Dr. Mendenhall, by an enormous expenditure of labor, attempts to prove his second assumption. How? By taking for granted the converse of the very proposition which he wishes to establish. He actually constructs the word-curves for Mill, Jonson, Dickens, Bacon, Shakespeare and finding that they differ in contour, attributes these differences to personal peculiarities of the respective authors. Not once seems the question to have been raised, much less answered, whether these differences are not due wholly or in part to other determining conditions, such as the form of composition or other accidents.
Now not only can it be shown that the form of composition, at least, is a modifying factor of the word-curve and average word-length, but it appears, indeed, to be the predominating factor, overshadowing all others. Works agreeing in form of composition, though written by different authors, will be found to yield curves more nearly in agreement than different works of widely divergent forms of composition by the same author. Whether or not the author-component in the word-curve can be separated from the others is unknown; certain it is that nothing of the kind has as yet been attempted. With our present knowledge concerning word-curves, conclusions regarding the authorship of spurious or disputed writings based upon a comparison of the word-curves of works differing either in the form of composition or in other essential respects must be considered worthless.
It is not difficult to predict in a general way in what respects word-curves of different types of composition will differ. In the vernacular of a language, so nearly devoid of inflection as our English, three-and four-letter words will naturally predominate. The development of oral speech, following the path of least resistance, will naturally be from the simple to the complex. Combinations of five, six or more letters, representing as many elements of sound, will not generally be resorted to so long as there are abundant simpler combinations, consistent with the possibilities of vocal articulation, to draw from. Now while the possible combinations of two and three letters into words is inadequate for a civilized language, the possible number of two-, three and four-letter words, aggregating thousands, is sufficient to supply the majority of words needed for every-day speech. The word-curve of common conversation may therefore be expected to show a maximum ordinate for three or four letters. Words containing five, six or more letters will occur with diminishing frequencies. Few words of more than ten letters will occur. Now this is exactly what takes place. Swift's 'Polite Conversation,' which is a reproduction of the conversation of the uncultured, yields the word-curve shown in Fig. 1. This, after a correction for an excess of seven-and eight-letter words, due to the frequent occurrence of the words ladyship, lordship and certain proper names, is the typical word-curve of extreme light dialogue.
What now will be the probable variations as we pass from this extreme type of composition to other forms of dramatic prose? As conversation becomes more sustained Fig. 1. 5,000 Word-curves from Swift's Polite Conversation.' Corrected curve. the relative frequency of the personal pronoun 'I' will naturally diminish, the use of prepositional phrases will cause two-letter words to increase, words of six and seven letters will become more numerous at the expense of the frequency of three-and four-letter words. The resulting word-curve must, therefore, cross the former in two places, once between the abscissas one and two, and again near the abscissa five. If we pass from the heavier forms of dramatic prose to narrative, in which dialogue alternates with description and still heavier composition, the personal pronoun will diminish still more in frequency, two-letter words will continue to increase as will also words of six, seven and more letters, and to compensate for this there must be a further decrease in the relative number of three-and four-letter words. This law of change will continue as we pass from fiction to pure description and from the essay style to the opposite extreme of scientific and philosophic discourse. Here the personal pronoun 'I' will have disappeared, leaving the indefinite article 'a' the practically sole representative of one-letter words; with the accumulation of phrases and clauses there is a corresponding accumulation of two-letter prepositions, three-and four-letter words will have reached a minimum to make room for longer derivatives, compounds and technical terms.
Throughout these changes the five-letter word will probably vary least, since the variations on opposite sides of it are in contrary directions. We assume it constant. Taking furthermore Swift's 'Polite Conversation' and Mill's 'Political Economy' as representatives of the opposite extremes of the chain of forms of composition just described, we have the following schematic types of word-curves (Fig. 2) characteristic not of any particular author, but of the form of composition employed.
Of course no one would expect anything more than an approximate conformation to these types in any specific case, for We have already stated that the form of composition into which an author casts his
|Fig. 2. Schematic Word-curves representing, (A) 'Light Conversation,' (B) Classic Dramatic Prose, (C) Fiction, (D) Essay and Description, (E) Scientific and Philosophic Discourse.||Fig. 3. Actual Word-curves, (A) Swift's 'Polite Conversation,' (B) Beaumont and Fletcher's Dramatic Works, (C) Dickens's 'Christmas Carol,' (D) Bacon's 'Essays' and 'New Atlantis' and 'Henry VII.,' (E) Mill's 'Political Economy.'|
thought is but one of several possible factors affecting the word-curve. But Dr. Mendenhall's diagrams seem to show that it is the predominating factor. In Fig. 3 I have superimposed on one the other four of Mendenhall's diagrams, and to complete the series I have added the word-curve of Swift's 'Polite Conversation.' A more striking corroboration of our hypothesis could scarcely be expected from data intended to establish the theory of characteristic curves.
It may be pointed out in passing that our hypothesis explains several puzzling phenomena brought out in Dr. Mendenhall's investigations. It is now clear why none of the thousand word-graphs from Dickens's 'Oliver Twist' 'could by any possibility be mistaken' for any one of ten similar graphs from Mill's 'Political Economy,' why the 10,000 word-curve from Mill's 'Political Economy' varies very strikingly from a similar curve from his 'Essay on Liberty' (Fig. 4). It explains why the two word-curves of 10,000 words each, one from 'Oliver Twist,' the other from 'Vanity Fair,' agree so closely, fully as closely in fact as two different curves of 10,000 words each from Dickens himself (Fig. 5), an occurrence which Mendenhall remarked, 'must be largely the result of accident, and it would not be likely to repeat itself in another analysis.' Finally our hypothesis removes all cause for surprise that Shaler's 'Armada Days,' composed 'in the spirit and style of the Elizabethan Age,' should yield a word-curve resembling that of Shakespeare's plays.
Seeing that the assumption that word-curves vary according to the composition employed accounts for nearly everything which had been attributed to personal characteristics of the authors, and that it also explains so much which is inexplicable on the opposite assumption, I
|Fig. 4. Two 5,000 Word-curves (after Mendenhall) from John Stuart Mill. (A) 'Political Economy,' (B) 'Essay on Liberty.'||Fig. 5. Three 10,000 Word-curves of Fiction (after Mendenhall). (A) Dickens's 'Oliver Twist,' (B) Thackeray's 'Vanity Fair,' (C) Dickens's 'Christmas Carol.'|
sought for a way to test it. But how? According to Dr. Mendenhall, 'no one has written enough in two or three different styles, as prose, poetry, history, essay, drama, etc., to produce normal characteristic diagrams.' This, if true, would exclude any positive test of our hypothesis, but a moment's reflection convinced me that the assumption is entirely unwarranted. Goethe has among his prose works alone, volumes each of drama, biography, fiction, travel, science, criticism and correspondence. Schiller, too, has written far to exceed 100,000 words each of prose, drama and history. And what about Voltaire with his seven volumes of drama, eleven of history, seven of essays, ten of philosophy and eighteen of correspondence, besides several others of poetry, romance, science and commentaries; or George Sand or Lamartine with their libraries of books written in various forms of composition? Our own Dryden, also, has written of essays and prose dramas each more than sufficient to furnish a normal word-curve from each.
Here then was sufficient material to demonstrate the truth or falsity of our hypothesis, if only means could be found to carry out the work. Dr. Mendenhall convinced himself that no less than 100,000 words are necessary to yield an invariable curve, and it would evidently require several such curves to furnish any safe ground for induction. But the examination of several hundred thousand words, allowing but two hours for the tabulation and classification per thousand, would require a greater sacrifice of time than other duties would permit me to make. Indeed, Dr. Mendenhall found himself in the same predicament, from which he was rescued by the generosity of a private citizen, who supplied the salaries of two assistants for several months during which the necessary data were collected.
Then it occurred to me that though one hundred thousand words may be necessary to yield an invariable curve, a much smaller number might suffice to establish the existence of such a curve within certain limits. If these limits for the curves of different forms of composition from the same author turn out to be mutually exclusive, our hypothesis would be established, though we had not examined a sufficiently large number of words to determine the locus of the curves with accuracy. Thus, possibly, the work necessary to test our hypothesis might reduce itself to manageable proportions.
The first author examined was Goethe. To eliminate as far as possible the disturbing effect of unconscious bias, I decided to count in word-groups of consecutive thousands, always beginning with the first of the work. Quotations, footnotes, headings and, in the case of dramas, stage-directions, etc., were uniformly omitted. These rules were strictly adhered to in all the data which follow. Five groups of one thousand words each were taken from each of Goethe's 'Bürgergeneral,' and 'Literatur Recensionen,' (B). The results were tabulated as follows:
Each thousand words was now plotted separately and the resulting two sets of five curves compared (Fig. 6 and Fig. 7). These results far exceeded my expectation. No curve of the one set could possibly be mistaken for any curve of the other set. Three-letter words, of which there were between 319 and 338 in each thousand of the first set, were reduced to 250 to 268 per thousand in the second set; nine-letter words, which did not exceed 26 in any thousand of one, rose to 73 in the other. A similar contrast prevailed in the relative frequencies of four-, seven-, eight-and ten-letter words in the two sets of data. In this case then, at least, five thousand words seemed
|Fig. 6. Five 1,000 Word-curves from Goethe's 'Dee Bürgergeneral.' (See Table I.)||Fig. 7. Five 1,000 Word-curves from Goethe's' Literatur: Recensionen.' (See Table I.)|
sufficient to indicate the limits of the invariable curves which a larger number of words would yield, and these limits are actually exclusive except in the proximity of the intersection of the two sets of curves. The normal curve for each group of five thousand words is given in Fig. 8.
Goethe may possibly be exceptional in manifesting such striking uniformity in the curves for successive thousands of the same work, Fig. 8. Two 5,000 Word-curves from Goethe. (Table I.) (A) Prose Drama ('Der Bürgergeneral'), (B) Criticism ('Literatur: Recensionen'). and an equally striking divergence in any two curves belonging to different works. So I turned to Schiller. Ten thousand words were examined, five thousand from his 'Kabale und Liebe,' a prose drama, and five thousand from Ms 'History of the Thirty Years' War.'
A glance at the corresponding word-curves for each thousand words (Figs. 9 and 10) shows that here, too, five thousand words will determine the limits within which the unknown invariable word-curves will be confined with a sufficient definiteness to convince us that the curves can in no wise resemble each other. Four-letter words occur only about half as frequently in the 'History' as in the 'Play'; ten-, eleven-, twelve-letter and longer words are increased two and three-fold.
|Fig. 9. Five 1,000 Word-curves from Schiller's 'Kabale und Liebe.' (See Table II.)||Fig. 10. Five 1,000 Word-curves from Schiller's 'Thirty Years' War.' (See Table II.)|
Next I tabulated ten thousand words from Goldsmith, choosing the drama 'She Stoops to Conquer' and his essay on the 'Present State of Polite Learning in Europe.'
Goldsmith: 'Present State of Polite Learning in Europe.'
Here the results, graphically exhibited in Figs. 12, 13 and 14 are somewhat less satisfactory than in the case of Schiller or Goethe, yet
even here any one-thousand word-curve of the one work is easily distinguished from all the curves of the other work. The most marked contrast is shown in the relative frequencies of two-, four-, eight-, nine- and ten-letter words.
|Fig. 12. Five 1,000 Word-curves from Goldsmith's.'She Stoops to Conquer.' (See Table III.)||Fig. 13. Five 1,000 Word-curves from Goldsmith's 'Present State of Polite Learning in Europe.' (See Table III.)|
Here not only are five thousand words sufficient to indicate that the invariable curves for the two kinds of writing differ essentially, but the number of four-letter words alone in any single thousand seems to characterize the drama from the essay.It seemed hardly necessary to augment these data which may seem to the reader more than adequate to establish the multiplicity of the Fig. 14. Two 5,000 Word-curves from Goldsmith. (Table III.) (A) Drama 'She Stoops to Conquer,' (B) Essay 'Present State of Polite Learning in Europe.' so-called characteristic curves of an author. Still I ventured another test. Suppose several five-thousand word-curves from different dramatic works of an author were constructed, and again several five-thousand word-curves of various other prose productions as criticism or history by the same author. Suppose it were found that each set of curves agrees in the main, but differ, in essential respects, from all the curves of the other set, could this be interpreted otherwise than that the nature of the composition is the determining factor of the curves? With this thought in mind, I tabulated four additional groups of five thousand words each from Goethe, two groups each taken from single works, the other two groups made up of single thousands from each of ten different productions. These together with the five thousand averages previously obtained from the 'Bürgergeneral' and 'Literatur Recensionen' (B), are given in Table V., and the corresponding word-curves are given in Fig. 18. Fig. 19 shows the two curves which result if the entire fifteen thousand words are taken
as the basis. These will approximately coincide with the invariable curves for the two kinds of composition in question.
Throughout our work we have used the word-curve as the basis of comparison. But the mere fact of divergence of such curves forforms of composition could have been much more readily established
|Fig. 15. Five 1,000 Word-curves from Dryden's 'Sir Martin Mar-all.' (See Table IV.||Fig. 16. Five 1,000 Word-curves from Dryden's 'Essay on Satire.' (See Table IV.)|
by an inspection of the average word-lengths of various works. It will pay to compare carefully the numbers given in the last columns of our tables. None of the averages per thousand in Goethe's prose dramas exceeds 4.8 letters per word; in none of his other works examined do the averages fall below 5.4. The limits of the average word lengths for the two forms of composition are thus seen to be not only exclusive, but they are separated by a wide gap. Goldsmith's averages, 4.0 and 4.9 letters per word, respectively, show a similar difference, and so do Schiller's and Dryden's averages. Doubtless this factor of average word-length alone which can be determined with an expenditure of but a small fraction of the time required for the determination
of the figures necessary to construct the word-curves, would in general be indicative of the nature of the curve, so that in critical cases only, the word-curve would need to be examined.
The question still remains whether two word-curves of the same author may vary as much as the word-curves of different authors, that
is, whether, so far as word-curves indicate anything, an author differs as much from himself as from other authors. This question can not be definitely answered until a large number of authors have been compared, that is, until we have obtained the maximum variation between authors, as well as the maximum variation between various forms of composition. But so far as the evidence at hand may be trusted, it is to the effect that the line of demarcation follows the form of composition rather than the author. Figs. 11, 11, 17 and 19 show variations that must be attributed to the form of composition; the difference in the curves of Fig. 20 may reasonably be ascribed to the same cause. Fig. 21 shows four five-thousand word-curves, representing two authors and two styles of writing. The curves representing the same style not the same author follow each other. Fig. 22 contains four words-curves of dramas (Shakespeare, Beaumont and Fletcher, Marlowe and Jonson), and four word-curves from the prose writings of Bacon, Dryden, Goldsmith and Mill. While the latter show considerable variations among each other, they are all clearly differentiated from each of the drama curves.
The theory of characteristic curves is exactly parallel to that of constant sentence proportions. Both rest upon the same fallacy—that personal peculiarities outweigh all other determining factors to such an extent as to make it unnecessary to consider them. Elsewhere I have shown that the average sentence length, instead of being invariable for a given author, varies between wider extremes for different styles employed by the same author than for different authors writing in the same style. Goethe alone shows average sentence lengths varying from 5 to 38 words per sentence. Is it not likewise probable that a more extended inquiry would reveal, in the case of versatile writers like Goethe, Voltaire and others, not two only, but a whole series of invariable word-curves, distributed something like the curves in Fig. 2?
It was the theory of spectrum analysis which first suggested to Dr. Mendenhall the analogous conception of word-spectra or characteristic curves. Just as the light-rays of various wave-lengths emitted by a substance combine to form its spectrum, so a combination of words of various lengths in proper definite ratios make up an author's word-spectrum or characteristic curve. The analogy is imperfect, but we admit it. But is it true that each substance has a single spectrum? This was the supposition when the science of spectrum analysis was in its infancy, and upon this supposition Dr. Mendenhall bases his analogy. The fact is that over forty years ago it was demonstrated that some substances have several spectra, and to-day it is generally believed that all substances have several spectra, corresponding to the several stages of disassociation or molecular composition of their molecules. The analogy to spectrum analysis, therefore, demands the modification of the theory of characteristic curves, which I have tried to point out in the preceding pages.
- The Sherman principle in rhetoric and its restrictions. Popular Science Monthly, October, 1903. On the variation and functional relation of certain sentence constants in standard literature, University (of Nebraska) Studies, July, 1903.