LPedia:Text Conversion Guidelines

From LPedia
Jump to: navigation, search

Conversion of the text of existing documents for presentation as wiki text can be a complicated process, and there are a number of issues that may need to be addressed for which the best approach may not be immediately obvious. The following are guidelines, developed based on experience in converting various kinds of documents, which may be useful in future work of this type.

General Guidelines

In general, the wiki text will not correspond exactly, in every respect, to the "original". There are certain features of text that arise from the form of presentation, and some conventions that apply or only make sense when text is formatted in certain ways. There may even be multiple "originals" that differ in various ways -- different presentations of the text, dating from the time when the document was created or in active use, that differ in these ways. What we want to preserve and present as wiki text is what might be thought of as "just the text itself", not any of the features/artifacts which arose simply because of the way it was presented at some time in the past.

Features that we do want to preserve:

  • all the words, including the original grammar and spelling
  • original punctuation, except where used only as part of "formatting" (see below)
  • original paragraph structure
  • original section/subsection structure
  • highlighting of particular words/phrases (for emphasis or other reasons)
  • numbering/lettering of lists

Features that we usually do not want to preserve include:

  • pagination (including text that exists only because of pagination, such as headers/footers)
  • right-justification
  • hyphens that were introduced as part of line-wrapping
  • use of specific fonts or character sets
  • use of "control characters" (e.g., tab, backspace) to create effects like indentation

There are some aspects of "formatting" that we do usually want to preserve but for which the specific details are not important. In some cases the details varied even among different presentations at the time. For these features of the text, our general approach should be to use wiki text formatting features which convey the same general sense even though they may result in a different detailed appearance.

Specifically:

  • section headings may be represented using ordinary wiki heading style
  • indentation of paragraphs or lists can be done with <blockquote>
  • special characters that were used to create graphic effects (e.g., a line or border) can be replaced with something else that serves the same purpose
  • hyphens intended to represent a dash (double hyphens were standard for this in typewritten text) may be replaced with an actual dash character

Dealing With Errors

These guidelines apply to the process of converting a document, i.e., to the work done to bring it into LPedia as wiki text. Once this job is done, it will only rarely be appropriate to make or even consider "corrections". Anyone who notices what appear to be errors in an existing page in the Document space should exercise great caution. Please review the Talk page for the document, the conventions that are being used for other documents of a similar type, and in case of any doubt check with the person who did the original conversion when possible, before making any changes.

In general, to decide whether to correct what appears to be an error, we will need to make a judgement about whether the error was in "the original" or somehow arose in the course of conversion. In many cases, it will be easy to make this decision by reference to "an original" -- many types of errors will so obviously have been introduced by the conversion process that it will be clear that they should be corrected -- we will in essence just be correcting our own work, not the document itself.

There are cases, however, where the text from which we are working may not be entirely definitive. If the error was recognized and somehow corrected at the time, there is no good reason that we should perpetuate the error, just because the particular copy from which we did the conversion still had the error. Evidence that the "original" from which we are working may itself not be "correct" may include:

  • the existence of different copies from the time only some of which contain the error
  • explicit mention at the time that an error was noted/corrected (e.g., in minutes)
  • implicit correction at the time (e.g., error disappears in later version without explicit change)

In these cases, some judgement will be required to decide what is the "real" text, and in some cases it may even make sense to make a correction that results in wiki text for which no corresponding physical "original" can be found. To the extent that it is still possible to find people who were involved in the writing/editing/printing of the original document, it may be helpful to seek their advice. They may remember what happened, or even have private notes that could shed light on the issue.

Spelling Errors

Since we are trying to present the original text, in general we do not want to be "correcting" spelling errors. In some cases it may not even be completely clear that what appears to us to be an error really was an error -- the original author/editor may have preferred an alternative spelling, or created a new word to make some point. But even in cases where we can be pretty sure that it was an error, the goal of presenting the original text suggests that we should not make a change.

OCR Errors

Documents converted to wiki text by scanning a hardcopy document and the use of OCR software typically will contain errors. If the original was in good condition and the software is good, there may be very few, but in other cases there may be many errors, sometimes so many that the result is obviously defective even to the casual reader. Obviously we want to minimize these errors, and correcting them is always the right thing to do.

Common OCR conversion errors:

  • wrong letter/number: e.g., 1/I, 5/S, 0/O
  • wrong punctuation: hyphen vs. dash, comma vs. period
  • extraneous or missing punctuation (especially periods and commas)
  • extraneous or missing white space (words broken up or glued together, extra blank lines)
  • mis-interpretation of hyphens at the end of a line

Approaches for identifying these kinds of errors:

  • Proofreading -- read the text carefully, note what appear to be errors, compare to best available original.
  • Searching for problems -- use text editor to look for characters or patterns that are often mis-converted.
  • Automatic spell-checking may help identify lines/pages that got garbled in the scan/OCR process and warrant more carefully proofreading. (But misspellings by themselves are not evidence of an error in the conversion and should not necessarily be corrected -- see above.)
  • Text comparison - use software to find all differences between suspect text and text converted another way or text which should be mostly the same (e.g., an earlier or later version of the same document) and then sort the differences into (1) errors in the subject text, (2) errors in the other text, (3) differences that exist for some other reason.