The task of this step is to translate the paper pages of the book into their corresponding TIFF files with a resolution of 300dpi. This resolution is enough for a book text of a normal (“readable”) size. Fine print or a desire to convey fine details of illustrations may require more resolution. Rummage through the settings of your scanner. At the output, you need to get image files in TIFF format. One sheet - one file. And no multi-page TIFFs (where there are several pages in one TIFF file)! No pdfs! No OCRs (text recognition)!
At this stage, you also need to make a decision to scan the book in color or grayscale. It is usually not recommended to scan a book in a strictly black and white version (b & w), since the scanner will then have to decide what to do black and what to white. Say, the bend on the page can be transferred to black and create black stripes and spots, and even worse, these spots will cover the black text. It is then impossible to clean out such “black on black”. If the spot (strip, another defect) is gray (or another, during color scanning), and the text is black (different from the defect), then the defect can be removed at the stage of cleaning by removing the color of the spot from the image. It also happens that a strictly black-and-white scan thins and breaks the lines and the font (that is, when the letter, say, “d” looks like “cl”). Therefore, for high-quality scanning, imagine that the option (b & w) does not exist.
For my sheet scanner, scanning begins with cutting the cover. An ordinary kitchen knife with a short blade and a convenient handle is quite suitable. For a soft cover, a knife slips between the cover and the first page (with the cover closed) and the cover is cut off. If the book has a hard cover, then with the cover open, the book itself is cut out of it. Pages then either come off one at a time or are cut off. Torn edges can then be removed using the program at the cleaning stage. The main thing is that the torn edges do not fit into the text.
I am writing these lines, and in my head sounds a poem by Marshak:
I have books from my childhood that I love and will not cut. But often you have to scan manuals, often computer ones, often thick ones, and waste paper is the best place for them. And it's a pity to spend your time scanning “on glass”.
Once again about the basic scanner settings. Resolution - 300dpi and the color mode "grayscale" or "color" (color). The file format is TIFF.
By measuring the page of a book in millimeters, you can specify the length and width. Of course, “on glass” this can be done only approximately, since it is impossible to precisely put a book on glass. And the sheet scanner will suck the sheets from the even side (either top / bottom or, if on the side, it should be laid even) and everything will be right down to the millimeter. On my sheet scanner, lately, due to congenital laziness, I choose the option “text enhancement”, which “bolds” and “darkens” text and spoils color illustrations (thickens colors) and the option “align images” (deskew ) since even sheets are easier to process later. But you can generally no other options except dpi and colors do not choose, and leave everything else at the stage of cleaning.
In the past, manual typing of books was more commonly used.
Today, the digitization process includes two approaches.
- Mandatory: obtaining copies of pages in the form of graphic (usually raster) images, carried out by scanning or photographing with subsequent processing and saving in one of the graphic file formats. In this case, the original layout of the book is completely preserved, and any errors are excluded, however, it is impossible to search for or extract fragments of text for, for example, citation purposes.
- Optional: text recognition (OCR technology), followed by saving the recognized text in one of the electronic book formats. In this case, full-text search in the book and indexing of large arrays of electronic books becomes possible, however, it is difficult to reproduce the original layout, images, diagrams and formulas, recognition errors become almost inevitable.
Recently, (especially with the advent of the PDF and DjVu formats), a mixed approach has been used more and more: the text of the book is automatically recognized and fits into the original bitmap images of the pages, which allows you to combine the advantages of both approaches.
Book scanners [edit |
When we talk about the digitization of books of any library, then, in addition to preserving the originals and ensuring the authenticity of the electronic copy, we must remember that the structure of the classification and information retrieval in the paper and electronic collections is identical. In other words, scanning books requires the creation of an electronic catalog and the formation of an index-search database with maximum completeness.
Library Electronic Resource Projects
are one of the most difficult and intense in terms of labor,
applied methodologies and technical performance.
A natural question arises - why? Why, with such a difficulty in implementing such projects, start digitizing library materials, because “books can be stored for centuries,” and “no one even goes to the library”?
This is a misconception. In recent years, libraries have been actively changing, introducing modern technologies and service standards to meet the needs of a new generation of readers educated on the freedom to use digital content. Re-equipment programs are being adopted, performance indicators are being introduced into practice, uniform catalogs, regional and local history electronic collections are being created. In 2015, the National Electronic Library (NEB) was launched, for the development of which regular digitization of the collections of Russian libraries is carried out.
Do not forget about the preservation of invaluable knowledge and cultural values accumulated in book depositories throughout the country. For these purposes, digitization is the most effective way to preserve publications and ensure safe access to the information they contain.
The project in one large library lasted from 2003 to 2011. During the project, more than 2 million cards of the systematic catalog in Russian and foreign languages were scanned and indexed. The data on 17 fields from each card was transferred to the ABIS.
The basis of the automation of modern libraries is the creation of an electronic catalog and the filling of an automated library information system (ABIS). ABIS is necessary to automate the accounting of funds. A full-fledged electronic catalog significantly increases the efficiency and speed of information retrieval, significantly increasing the overall quality of readers' services.
As a rule, several types of catalogs are kept in the library: alphabetical, in which all the cards are arranged alphabetically, systematic, where the cards are arranged by branch of knowledge. There are directories that are divided by the scope of the fund: general or individual parts of the fund, according to purpose: reading or service, according to many other criteria: local history, subject, etc.
With a large number of funds
digitizing the entire catalog is a rather lengthy process,
which is usually carried out in stages.
The basis of the catalog is a library card containing information about the publication, classification indexes, book number (ISBN) and other data. Due to the large amount of specific information, the card is the most difficult document to extract index data. Writing in foreign languages, handwriting or diacritics (various superscripts, subscripts, and less often inline characters) make it even more difficult to process information.
One bibliographic record can contain up to 24 different fields. Transferring records to the system directly from paper media is impractical due to the low speed and risk of losing / skipping key information, so work on creating an electronic catalog involves mandatory preliminary scanning of the library card catalog, the formation and verification of the index database before downloading to the ABIS.
Even in a small library, the number of cards in the thousands of units. In such circumstances, it is almost impossible to search for your own human and technical resources and independently create an electronic catalog, therefore, to save time and money, professional contractors are involved who specialize in processing library information and are ready to guarantee the final result.
Typical electronic catalog creation process
It is advisable to digitize on the territory of the library in order not to withdraw library cards from use and not interfere with the work with readers. The process is divided into several stages:
Expertise. The physical condition of the cards and file cabinets is estimated. The composition of the bibliographic description and the required format of machine-readable entries are determined. Based on the data obtained, a further technological chain of work is compiled. The list of works and data extraction methods are influenced by the nuances in the writing of characters, the format and even the composition of the material (cardboard, paper). The following are the types of cards:
Modern equipment allows you to achieve a scan speed of 170 cards per minute, while choosing a professional scanner avoids damage to the cards themselves.
- additional card. Feature: Printed and Handwritten Symbols,
- separator. Feature: different from the standard card format,
- reference card. Feature: only handwritten characters,
- description. Feature: Old Russian text.
Scanning. In-line scanning of paper cards is carried out on high-speed document scanners. Standard requirements for digitization: 300 dpi resolution, black and white scanning mode, TIFF or JPEG file format. Most cards are of a typical size 130x80 mm, but are found up to A6 format (148x105 mm) inclusive. Sometimes before scanning glued cards are glued together. Often two-sided scanning of cards is carried out, where on the back side there are inventory numbers, a breakdown by branches. Insignificant separator cards are not scanned.
After digitization, the paper array of the file cabinet is returned to its original state.
All subsequent work is carried out with received
graphic images of cards.
Be sure to rotate skewed images, remove the background, the manifestation of low-contrast characters, etc.
The properties of electronic copies should exclude the loss of information and not impair the readability of the document compared to the paper original. In the case of poor condition of the source material is allowed to use software tools to improve image quality.
All image processing is performed automatically. It is possible to manually correct the geometry of the images, cleaning from noise and bending marks if it is necessary to process a small number of damaged documents.
Even in the case of typing on a typewriter, not all characters are recognized correctly. Automatic recognition of handwritten text, pencil marks, and cards created before the mid-20th century is almost impossible.
The number of graphic images must match the number of sheets of paper array. Scanning in the order of catalog cards has become the norm. Skipping pages is considered a marriage.
Retroconversion: entering information from scanned cards and creating a database. Cards may contain typewritten and handwritten text, pencil marks, fuzzy characters and have other filling characteristics.
In rare cases, when the quality of a document containing printed text is good, recognition tools can be used to automatically extract certain fields of the card.
Therefore, data from library cards are mainly entered manually
and go through a multi-level quality control system.
Before retroconversion, images are separated (sorted) to group individual parts of the array by type of card and other indexing features (concatenation of composite cards, creation of data blocks for volumes, language separation, etc.). The blocks are marked for ease of data retrieval by the operator.
At the output, a database is formed in the format required for the library (RUSMARC, UNIMARK, MARC21, etc.). In some cases, when creating an electronic catalog, graphic images of books can serve as the subject of processing directly. Then, operators who have knowledge of the rules for compiling bibliographic descriptions are involved in the work.
The requirements for the minimum percentage of errors in the database are very high, as this directly affects the quality of information search in the electronic catalog. Therefore, after entering the data, there is a verification step for various parameters by experienced verifiers.
To accelerate the process of retroconversion, borrowing technology is used,
simplifying information input due to auto-selection of filling fields
based on previously entered data.
Many libraries already have professional planetary scanners for daily digitization of books. But on its own, incoming literature is mainly scanned. Outsourcing services are usually ordered for mass quality digitization. So, in one large federal library from 2008 to 2014. more than 16.5 million pages of library and archival collections were digitized by a contractor.
After creating an electronic catalog, or in parallel with this process, libraries solve the problems of ensuring the safety and accessibility of the book fund by digitizing books. Digitization works are carried out to fill in national electronic projects, create collections of rare books and full-text resources, collections of thematic illustrated materials and much more.
Libraries can conduct digitization of funds on their own. For example, large libraries have organized entire scanning departments with a fleet of professional equipment.
An important aspect is the characteristics of digital copies. If local problems are solved, the library can independently determine the requirements for output electronic resources. But in the implementation of national projects where funds from various libraries are used, a common standard is required that regulates the main characteristics of the work.
When creating the NEB, electronic resources created by a technical executor
and libraries on their own, had different digitization options,
which complicated the work of processing and downloading digital content.
Therefore, the industry expert council prepared “Recommendations on the digitization of materials from library collections” *, which show the principle of creating electronic library resources. The recommendations indicate three types of digital copies. Master copy - a reference copy of the original in printing quality (resolution not lower than 600 dpi). User copy - to create electronic collections and provide readers (resolution of at least 300 dpi). A service copy is used for internal library tasks and placement on web-sites (resolution not lower than 150 dpi).
Book Scan Features
When digitizing books, the stages of the work repeat the process of creating an electronic catalog. The library operates on its own or hires a contractor, but one way or another, first of all, it is necessary to determine the purpose of the project and conduct an examination of the book fund to understand the cost and complexity of the work. In the future, the composition of the editions subject to digitization is formed, technical requirements are agreed upon, and the project is finally evaluated.
Consider several features of digitizing books that affect the cost and timing of projects. Of great importance is the format and condition of the books, as well as the size of the paper stock. Based on these features, the type of scanning equipment and digitization technology are determined.
After scanning, the received digital copies go through the process of software correction, are reduced to the most convenient image quality for reading. Often when using professional book scanners, the built-in processing tools are sufficient for this. After receiving an array of digital copies, if necessary, bibliographic descriptions of scanned publications are compiled.
Separately, it is worth highlighting the work on high-quality scanning of collections of rare books, book monuments, tomes and other valuable items. For this, specialized high-quality scanning systems are used that provide uniquely high optical resolution.
Особенности формирования полнотекстовых PDF-книг
*В Гражданском кодексе РФ (ч. IV в ред. 2006 г.) статьями 1274 и 1275 допускается без согласия автора предоставление экземпляров произведений, правомерно введенных в гражданский оборот во временное безвозмездное пользование. But digital copies of the works can only be provided on the premises of the libraries, provided that it is impossible to create copies of these works in digital form. To provide access to limited editions in the environment of the national electronic library, a special protected viewer was developed for remote work of citizens with works in electronic form.
But digital copies are sometimes not enough. There are tasks to turn the image into a full e-book. On the basis of graphic images, electronic books in PDF format are formed. This format is the most universal and allows full-text search and navigation through the table of contents and hyperlinks. Electronic books not limited by copyright can be published on the Internet or given “secure” access in the library reading room *.
To create such books, full-text recognition is carried out with further verification of the text and spelling. Professional proofreaders are involved in the final proofreading of a document.
As a result of prototyping, an electronic book is formed,
completely identical to the paper original - with the exact layout of the pages,
illustrations, preservation of language and style.
Digitization of books and catalogs of libraries, especially with a significant volume, heterogeneity of funds in the format and condition of the originals, is a complex production process that can only be carried out by specialized companies that have all the necessary infrastructure and extensive experience in creating electronic resources.