StructuredText#

StructuredText objects hold text from a page that has been analyzed and grouped into blocks, lines and spans. To obtain a StructuredText instance use toStructuredText().

INSTANCE METHODS

Search the text for all instances of needle, and return an array with all matches found on the page.

Each match in the result is an array containing one or more QuadPoints that cover the matching text.

Parameters:

needlestring.

Returns:

Quad[][].

EXAMPLE

var result = sText.search("Hello World!");
highlight(p: Point, q: Point, max_hits: number = 100)#

Return an array with Quads needed to highlight a selection defined by the start and end points.

Parameters:
  • pPoint.

  • qPoint.

  • max_hitsnumber. The number of results to return as a maximum for the search.

Returns:

Quad[].

EXAMPLE

var result = sText.highlight([100,100], [200,100]);
copy(p: Point, q: Point)#

Return the text from the selection defined by the start and end points.

Parameters:
Returns:

string.

EXAMPLE

var result = sText.copy([100,100], [200,100]);
walk(walker: StructuredTextWalker)#
Parameters:

walkerStructuredTextWalker. Function with protocol methods, see example below for details.

Walk through the blocks (images or text blocks) of the structured text. For each text block walk over its lines of text, and for each line each of its characters. For each block, line or character the walker will have a method called.

EXAMPLE

var sText = pdfPage.toStructuredText();
sText.walk({
    beginLine: function (bbox, wmode, direction) {
        console.log("beginLine", bbox, wmode, direction);
    },
    beginTextBlock: function (bbox) {
        console.log("beginTextBlock", bbox);
    },
    endLine: function () {
        console.log("endLine");
    },
    endTextBlock: function () {
        console.log("endTextBlock");
    },
    onChar: function (utf, origin, font, size, quad) {
        console.log("onChar", utf, origin, font, size, quad);
    },
    onImageBlock: function (bbox, transform, image) {
        console.log("onImageBlock", bbox, transform, image);
    },
});

Note

On beginLine the direction parameter is a vector (e.g. [0, 1]) and can you can calculate the rotation as an angle with some trigonometry on the vector.

asJSON(scale: number = 1)#

Returns the instance in JSON format.

Parameters:

scalenumber. Default: 1. Multiply all the coordinates by this factor to get the coordinates at another resolution. The structured text has all coordinates in points (72 DPI), however you may want to use the coordinates in the StructuredText data at another resolution.

Returns:

string.

EXAMPLE

var json = sText.asJSON();

Note

If you want the coordinates to be 300 DPI then pass (300/72) as the scale parameter.

asHTML(id: number)#

Returns the instance in HTML format.

Parameters:

idnumber. Used to identify the page id of the main div, if omitted then html in the top node will be: <div id="page0">.

Returns:

string.

asText()#

Returns the instance in plain text format.

Returns:

string.


Code samples

Code samples are in TypeScript and assume that the following requirements are defined in your TypeScript file header as follows:

import * as fs from "fs"
import * as mupdfjs from "mupdf/mupdfjs"