Back to home page
Some common mistakes to avoid in scientific writing
As a referee for papers submitted to journals and
conferences in the area of computational linguistics,
there are a few mistakes I encounter frequently. Since conferences do not
usually have copy-editors to make sure these mistakes are eliminated before
the papers are printed as part of proceedings, the only opportunity to
eliminate them is that the mistakes
are pointed out to the
authors in the referee report. Many of the mistakes are however of a
trivial nature and a waste of much of the precious time of the referees can be
avoided if the authors make sure the mistakes do not occur in the first place.
It is also very much in the interest of authors to avoid these mistakes, since
finding too many of them may give the referee a bad impression of the
authors and their work and may eventually lead to less favourable
recommendations for acceptance/rejection.
The following is intended as a first attempt at making a list of such
trivial, easily avoidable mistakes. If you have anything to
add to this list, please let me know
(markjan@let.rug.nl).
Latex
- It is a common mistake to misuse math mode to write a word in italics.
Thus, one may be tempted to write thingummy in Latex as:
$thingummy$
Especially for some letters,
this practice leads to strange looking output, since Latex interprets the
individual letters ''t'', ''h'', ''i'', ... as mathematical objects, and
spacing between these objects is inserted in accordance
to common meanings these objects have in mathematics. Write instead:
{\it thingummy}
If the objective is to emphasize a word rather than to write it in italics
irrespective of context, write:
{\em thingummy}
In a formula write:
$2 * {\it thingummy}^5$
- A small but frequent mistake among beginning authors concerns
abbreviations.
If one writes e.g.:
John ate some piece of fruit, viz. an apple, i.e. he devoured it.
Latex will think that one sentence ends after ''viz.'' and a second after ''i.e.''
and additional inter-word spacing is added. To prevent this, insert ''\''
followed by white-space (e.g. a blank or newline):
John ate some piece of fruit, viz.\ an apple, i.e.\ he devoured it.
This will handle most abbreviations correctly, but in some special cases
one needs to consult the fine print of a Latex manual.
Language
If one compares the respective
styles of writing in say computer science and computational
linguistics, one finds the latter seems to be bound to less strict rules.
This is not because one needs more `sophisticated' language in order
to describe the intricacies of how to process linguistic material by
computational means, but it is simply because the need for clear
and concise scientific
writing has not sunk in yet with a considerable part of our community.
If you feel computational linguistics is not a science such as mathematics,
biology or chemistry, you should write whatever way you like.
Otherwise, I offer the following simple guidelines.
- The so called contractions, as for example ''don't'' and
''they've'',
should be avoided in scientific writing.
Such contractions occur once in a while in a text book for use
in the class room, but not in journals or proceedings,
or at least they ought not to.
Write in full ''do not'' and ''they have''.
- Colloquial expressions and witticisms
are not acceptable. Examples of objectionable phrases
are ''but this is no big deal'', ''method A beats method B'',
''this is the take-home lesson of this paper'',
''and then phenomenon XYZ rears its ugly head again''.
In all such cases, there
is an equivalent way of expressing oneself that
1) is more clear, and 2) does not show
off the authors' knowledge of street language.
A very serious kind of witticism involves a choice
of terminology based on a desire to be funny rather than to be clear.
A good example is an article discussing the ''three-dimensional
matching'' problem, which is well-known under this name in the literature
on NP-completeness. However, the author of the aforementioned article
prefers to call this the ''menage a trois'' problem, apparently because
he found this a good joke. Needless to say, the confusion caused by
introducing this new name for something that is already well-known under
another name far outweighs the few meagre laughs it may induce.
- Authors should always run the text through a spell checker
before submitting, and remove all typos found in this way.
This requires minimal effort, and avoids
making a sloppy impression on the referees.
For UNIX, all that is required is:
spell paper.tex
(assuming spell has been installed, of course).
This admittedly also outputs many strings that are not spelling errors.
To remove some of these strings, one may opt for:
detex paper.tex | spell
Experienced authors may use more sophisticated scripts for spell checking
of Latex documents; I offer some suggestions.
Note that spell checkers cannot find all kinds of typos. The use of a
spell checker does not exempt the authors from the obligation to
carefully check the complete text before submitting.
Descriptions of Algorithms
The computational linguistics literature has an abysmal tradition when it
comes to precise descriptions of algorithms. Even today, not enough
computational linguists
are aware of the possibility of using pseudo code, which has been
very common in the computer science literature for decades.
Such code contains e.g.
- self-explanatory keywords such as
IF, THEN and WHILE,
which directly relate to concepts from empirical programming languages
known to virtually everyone who has had any programming experience,
- high-level mathematical notation involving e.g. operations
on sets, strings and other well-known data-structures, and
- where this is possible without introducing ambiguity,
descriptions in natural language of steps of the algorithm.
It is not at all a good idea to use particular programming languages,
such as Lisp or Prolog, to describe an algorithm in a scientific paper,
for the following reasons:
-
Despite the importance of some programming languages for
software engineering, there
is no basis for insisting that every computational linguist should know
Prolog, or that someone is computer-illiterate unless that person has a
working knowledge of Lisp. It is even more presumptuous to imagine
any reader will want to learn a programming language just to be able
to understand one particular ''scientific'' paper.
-
Even relatively compact programming languages such as Prolog may not allow
a complete algorithm to be described within the page limits set by
journals or conference proceedings. Many authors in that case
provide merely what they regard as key fragments of the code,
expecting that the reader will be able to figure out the rest.
In reality however, readers will seldom be able to reconstruct a
complete algorithm from a few fragments and inadequate explanation
in running text, and even if they can,
they cannot be sure their reconstruction
of the algorithm is what the author had in mind, which
brings us back to the problem of providing
precise, unambiguous descriptions.
-
That a computer may process a program in a programming
language does not make it unambiguous, unless one specifies exactly
what dialect of that language one is dealing with, but even then the
reader may misunderstand because of having a different dialect in mind.
Structure
A paper should be divided into sections
and paragraphs in a sensible way. Specifically:
- The abstract and the introduction should be
used in an appropriate way. It is often observed that the text from the abstract
is repeated verbatim in the introduction. This is plainly a waste of ink, and
means that either the abstract is too detailed, or the introduction is too vague.
An abstract should tell the reader in a few lines what the paper is about,
the introduction should introduce the reader to the subject matter
and to the material presented
in the sequel. Composing these two pieces of text in a proper way
obviously requires two different levels of abstraction.
- A section called ''Conclusions'' should only contain conclusions with
respect to material presented in the precedings text.
It should not introduce any new material.
- It is a mistake to assume the reader will study the complete paper from
start to finish. With some luck a few readers will, but the majority will
skim through the abstract and introduction, and possibly have a look at a few
paragraphs here and there.
To ensure that at least some of the material from the paper is open to such
casual readers, the text should contain a few passages at
well-chosen locations which
convey the main ideas of the paper, and which can be understood without
the need to read every single sentence or formula in the preceding text.
Typically, such passages will be found in the introduction, but they can
also be extremely helpful at the
beginnings of more technical sections further on.
Depending on the subject matter, illustrations may be a further help to
convey the main ideas to the casual, or attentive, reader.
An extreme example of a badly written article would be a text consisting of
program code, only occasionally interspersed with comments, requiring the
reader to study every single routine and subroutine in order to understand
what the paper is about.
Claims
A claim made in a paper should be supported by verifiable arguments or data
that give sufficient credibility to the claim. Science is about
dialogue and exchanging arguments, not about spreading propaganda.
Exactly how much support should accompany a claim is difficult to say,
but it is generally true that the less intuitive and the more surprising
a claim is, the more support it needs.
E.g., if a paper discusses a new grammar formalism accompanied by
only a few examples of languages phenomena described in that formalism,
it would be outrageous if the authors included the statement
''our new formalism now makes all other grammar formalisms
redundant''. This pretentious statement
would make the paper unsuitable for publication, although the new formalism
could actually be a genuine contribution to the field.
Concerning claims of the novelty of material,
it is the job of the authors to include references to relevant
publications, and it is the job of the referees to check for any
missing references. However, literal claims made in papers that the authors are
the first ever to come up with some idea or to make an implementation of
something are often met with scorn, since the breadth and depth of the
existing literature on computational linguistics makes it very hard to verify
such claims, although actually refuting such a claim is often
not even that difficult.
It should further be noted that
(misplaced) claims of novelty do not benefit the chances
of acceptance of a paper, as some authors may naively expect.
It seems valid however to write ''In the existing method A, processing
works this way, whereas in
our new method B, processing works that way'', but the use of the word ''new'' here
should be seen as a figure of speech with the intention of creating
a contrast between the described existing literature and the material in the
present paper, not as an absolute guarantee that no one has ever before
published or implemented method B.
Concluding Remark
Avoiding the common pitfalls above is of course not a guarantee for obtaining
a well-written article.
The main difficulty is how to describe scientific work such
that someone other
than the authors understands what was meant.
Inexperienced authors may get benefit from academic writing courses.
Bibliography
- R.R. Jordan.
Academic Writing Course. Collins ELT, 1990.
This book may be helpful especially to those with a poor knowledge of
English. For native speakers and advanced language learners however
it mainly contains truisms, and only an occasional helpful remark
pertaining to academic writing.
I have not tried to find more appropriate textbooks on
academic writing, but I'm sure they exist.