The Technologies on which the Project is Based
Two primary technologies form the basis of the computerization of texts—i.e., the conversion of printed material to digital file, or as it is professionally called, the digitalization of texts:
- Scanning technology
- Writing identification technology (OCR—Optical Character Recognition)
Scanning means a photocopy of printed material converted to a simple picture file. In this sense, scanning a drawing and scanning a text yield exactly the same product, since the computer treats them both as images and does not distinguish between them based on content. For this purpose, writing identification technology (OCR) has been developed, which allows a picture file including text to be turned into a searchable text file. This is accomplished through the recognition of patterns of dots in a picture as letters within words. These two technologies (scanning and OCR) are relatively long-standing: scanning technology was successfully implemented at the end of the 1950s and writing identification technology was put to commercial use during the 1960s. However, these technologies have been greatly developed throughout the years, and the programs that implement them today are able to cope with a broad range of languages and fonts. Moreover, the percentage of identification for each and every language has significantly risen in recent years and—given that the quality of the original material and of the scanning are good—it is possible to achieve identification of over ninety percent accuracy.
In the transition from the digitalization of simple texts (for example, letters or official documents), to newspapers, the importance of a third technology becomes apparent:
- Segmentation technology
Segmentation means the division of the scanned page into the distinct, logical sections from which it is assembled. In particular, with regard to a page from a newspaper, segmentation is the division of the page into the different articles present therein. Without this division, the newspaper page constitutes the smallest possible searchable unit, and the format of the search results will be based upon how many times the searched-for subject appears on the page. Clearly this method is highly problematic for the organization of search results, because within a newspaper the basic unit of information is not the page but rather the article, which is likely to take up only a small part of the page and may very well be continued on subsequent pages. Thanks to segmentation technology, in which field Olive Software is one of the leaders, the user can obtain search results based on the original articles and the relevance of each article to the subject on which the search was performed.
How the Technologies are implemented by the 'Historical Jewish Press' Website
The scanning of a newspaper is done from one of three possible sources: paper, microfilm, or microfiche. Every effort is made to use the very best copy, which is determined by both quality and clarity of writing and the completeness of the inventory of the newspaper editions. This is no easy task, especially since newspapers undergo a constant process of wear and disintegration. In this sense, efforts to scan archival material, particularly historical newspapers, are part of a larger undertaking: the preservation of information and knowledge which might otherwise be lost forever.
The two additional technologies—writing identification and segmentation—operate when the ActivePaper software adapts the scanned pages into electronic versions of the newspaper. This stage, which is largely automatic, includes identification of all the different elements of each article, which—as stated above—is the fundamental building block of the newspaper:
- Main body of article
- Accompanying illustrations or photographs
Within each of these elements the searched-for word or phrase is identified, and to each result is assigned a corresponding level of relevance to the search. Thus, for instance, when we search for a certain keyword (e.g., a name) the system will give preference to articles in which that keyword appears in the title over other articles in which that same keyword merely appears in the main body of the article.
The final product of the processing stage is a vast collection of files which constitute the electronic version of a publication. Each article is composed of image files of the original document and text files of the content as identified by OCR. What the user sees when viewing an article is actually that article's image, whereas the identified text is posted "behind" that image. Presentation of the newspapers is achieved using Extensible Markup Language (XML) technology, which enables the future adaptation of the material to other platforms.
Although the three main technologies of which this website makes use (scanning, OCR, and segmentation) are mature and time-proven, they are still not perfect. Neither OCR nor segmentation identification has reached a 100% standard of accuracy, and the poorer the quality of the material, the lower the level of accuracy in identification. Because the Historical Jewish Press website works with newspapers from the past, and sometimes even the distant past, we are compelled to deal with many different phenomena that threaten to ruin the identification process. These phenomena include inferior quality of printing (which characterizes early publications), yellowing paper, marks or damage in the original printing, unique fonts, torn pages, scribbled-on pages, and even pages damaged by rodents.
Technological limitations combine with the limitations of the raw materials with which we work and are manifested in two primary problems that the user is likely to encounter: word identification errors and segmentation errors. Word identification errors appear either in the form of existing words that are not identified or in the form of words that are mistakenly identified. In the first case, the user will see that a certain word is present in the article but fails to come up during a textual search. In the second case, the user will see that the identified word is not the same as the word for which he searched. Both of these occurrences are known phenomena and need to be taken into account by the user. Despite this limitation, however, the chance of finding entries is not significantly hurt because for the most part the sought-for word or phrase appears more than once in the article and therefore even if an identification error occurs the first time that word appears in the article, there is an excellent chance that the second time identification will be successful and the article will appear on the list of search results.
The second problem the user may encounter is segmentation errors. Here, too, errors are likely to arise in one of two ways: either as identification of several articles together as one unit, or as identification of one article as several different articles. As a rule, segmentation errors are less critical than word identification errors because they do not prevent the finding of articles matching the sought-for concept but are simply likely to disrupt the organization of the results. Segmentation errors may, however, lead to a certain inconvenience for the user arising from the need to access the full page of the newspaper and identify the proper boundaries of the article. As much as possible, the Historical Jewish Press website makes every effort to minimize both word identification and segmentation errors.
In conclusion, it is important to remember that a search of the newspapers is done in the environment of free text, which means that if a certain query does not yield search results, or comes up with only a few suggestions, there is a good chance that the spelling of that query was not accurate. This is likely to occur because of a simple error in spelling or because in the past that word or phrase was spelled differently (For more on this topic, see Content-related Aspects of the Project