StructuredText#
StructuredText objects hold text from a page that has been analyzed and
grouped into blocks, lines and spans. To obtain a StructuredText
instance use toStructuredText()
.
INSTANCE METHODS
- search(needle: string)#
Search the text for all instances of
needle
, and return an array with all matches found on the page.Each match in the result is an array containing one or more QuadPoints that cover the matching text.
- Parameters:
needle –
string
.- Returns:
Quad[][]
.
EXAMPLE
var result = sText.search("Hello World!");
- highlight(p: Point, q: Point, max_hits: number = 100)#
Return an array with Quads needed to highlight a selection defined by the start and end points.
- Parameters:
- Returns:
Quad[]
.
EXAMPLE
var result = sText.highlight([100,100], [200,100]);
- copy(p: Point, q: Point)#
Return the text from the selection defined by the start and end points.
EXAMPLE
var result = sText.copy([100,100], [200,100]);
- walk(walker: StructuredTextWalker)#
- Parameters:
walker –
StructuredTextWalker
. Function with protocol methods, see example below for details.
Walk through the blocks (images or text blocks) of the structured text. For each text block walk over its lines of text, and for each line each of its characters. For each block, line or character the walker will have a method called.
EXAMPLE
var sText = pdfPage.toStructuredText(); sText.walk({ beginLine: function (bbox, wmode, direction) { console.log("beginLine", bbox, wmode, direction); }, beginTextBlock: function (bbox) { console.log("beginTextBlock", bbox); }, endLine: function () { console.log("endLine"); }, endTextBlock: function () { console.log("endTextBlock"); }, onChar: function (utf, origin, font, size, quad) { console.log("onChar", utf, origin, font, size, quad); }, onImageBlock: function (bbox, transform, image) { console.log("onImageBlock", bbox, transform, image); }, });
Note
On
beginLine
the direction parameter is a vector (e.g.[0, 1]
) and can you can calculate the rotation as an angle with some trigonometry on the vector.
- asJSON(scale: number = 1)#
Returns the instance in JSON format.
- Parameters:
scale –
number
. Default:1
. Multiply all the coordinates by this factor to get the coordinates at another resolution. The structured text has all coordinates in points (72 DPI), however you may want to use the coordinates in the StructuredText data at another resolution.- Returns:
string
.
EXAMPLE
var json = sText.asJSON();
Note
If you want the coordinates to be 300 DPI then pass (300/72) as the
scale
parameter.
- asHTML(id: number)#
Returns the instance in HTML format.
- Parameters:
id –
number
. Used to identify the pageid
of the main div, if omitted then html in the top node will be:<div id="page0">
.- Returns:
string
.
- asText()#
Returns the instance in plain text format.
- Returns:
string
.
Code samples
Code samples are in TypeScript and assume that the following requirements are defined in your TypeScript file header as follows:
import * as fs from "fs"
import * as mupdfjs from "mupdf/mupdfjs"