This additional mode can be thus quite useful for information/structure extraction applications exploiting pdfalto output. headnote introduced at the end of the page content). title element is introduced at the end of the page element, while visually present at the top of the page), and minor changes can be present in up to 100% of PDF for some scientific publishers (e.g. Note that, with this modification and this new option, only the blocks are re-ordered.įrom our experiments, the raw order can diverge quite significantly from the order of elements according to the visual/reading layout in 2-4% of scholar PDF (e.g. In xpdf, several text flow orders are available including the raw order and the reading order. By default in pdf2xml, the elements followed the PDF content stream (the so-called raw order). Output coordinates attributes for the BLOCK elements when the -block option is selected,Īdd a parameter -readingOrder which re-order the blocks following the reading order when the -block option is selected. From our experiments, this problem happens in average for 2-3 scholar PDF out of one thousand. New in version 0.1 (apart various bug fixes):Įncode URI (using xmlURIEscape from libxml2) for the attribute content to avoid blocking XML wellformedness issues. Note: this released version was used for Grobid release 0.5.6 Use the latest version of xpdf, version 4.00ĪLTO output is replacing custom Xerox XML format Generate metadata information in a separate XML file (as ALTO schema does not support that)
Propagate unsolved character Unicode value (free Unicode range for embedded fonts) as encoded special character in ALTO (so-called "placeholder" approach) Generalize reading order to all blocks (it was limited to the blocks of the first page)ĭetect subscript/superscript text font style attribute Support Unicode composition of characters New in version 0.2 (apart various bug fixes): Removal of -blocks option, the block information are always returned for ensuring ALTO validation ( element)įix possible incorrect XMax and YMax values at 0 on block coordinates having only one line Line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file ( noLineNumbers option) New in version 0.3 (apart various bug fixes): Update dependencies and dependency install script Refined line number detection and fixing a bug which could result in random missing numbers in the ALTO outputįix issue with character spacing due to invalid rotation condition
they are pre-installed locally and portable Support for xpdf language support package for language-specific fonts like Arabic, Chinese-simplified, Japanese, etc. New in version 0.4 (apart various bug fixes): When the images are not extracted, image elements with layout properties still appear in the ALTO file, but they reference no extracted image files. This extraction slows down the process very significantly, so if no image is required, use the option -noImage. xml_data/ subdirectory containing the vectorial (.vec) and bitmap images (.png) embedded in the PDF, this is generated by default - when the option -noImage is not present. _outline.xml file containing a possible PDF-embedded table of content (aka outline) obtained with -outline option _annot.xml file containing a description of the annotations in the PDF (e.g. _metadata.xml file containing a pdf file metadata (generate metadata information in a separate XML file as ALTO schema does not support that). In addition to the ALTO file describing the PDF content, the following files are generated: filesLimit : limit of asset files be extracted upw : user password (for encrypted files) opw : owner password (for encrypted files) fullFontName : fonts names are not normalized charReadingOrderAttr : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO) noText : do not extract textual objects (might be useful, but non-valid ALTO) readingOrder : blocks follow the reading order noLineNumbers : do not output line numbers added in manuscript-style textual documents annotation : create an annotations file xml noImage : do not extract Images (Bitmap and Vectorial)