Working with Documents#

Passwords & Security#

A document may require a password if it is protected. To check this use the needsPassword method as follows:

EXAMPLE

let needsPassword = document.needsPassword()

To provide a password use the authenticate method as follows:

EXAMPLE

let auth = document.authenticate("abracadabra")

See the authenticate password return values for what the return value means.

Document Metadata#

Get Metadata#

You can get metadata for a document using the getMetaData() method.

The common keys are: format, encryption, info:ModDate, and info:Title.

EXAMPLE

const format = document.getMetaData("format")
const modificationDate = document.getMetaData("info:ModDate")
const author = document.getMetaData("info:Author")

Set Metadata#

You can set metadata for a document using the setMetaData() method.

EXAMPLE

document.setMetaData("info:Author", "Jane Doe")

Get the Document Page Count#

Count the number of pages in the document.

EXAMPLE

const numPages = document.countPages()

Load a Page of a Document#

To load a page of a document use the PDFPage constructor method to return a page instance.

EXAMPLE

// load the 1st page of the document
let page = document.loadPage(0)

Splitting a Document#

To split a document’s pages into new documents use the split() method. Supply an array of page indicies that you want to use for the splitting operation.

EXAMPLE

let documents = document.split([0,2,3])

The example above would return three new documents from a 10 page PDF as the following:

  • Document containing pages 1 & 2 (from index 0)

  • Document containing page 3 (from index 2)

  • Document containing pages 4-10 (from final index 3)

Merging Documents#

To merge documents we can use the merge() method.

See the script below for an example implementation.

EXAMPLE

// create a blank document and add some text
let sourcePDF = mupdfjs.PDFDocument.createBlankDocument()
let page = sourcPDF.loadPage(0)
page.insertText("HELLO WORLD",
                    [0,0],
                    "Times-Roman",
                    20,
                    {strokeColor:[0,0,0,1], fillColor:[1,0,0,0.75], strokeThickness:0.5})
// now merge this document onto page 2 of our document and rotate it by 90 degrees
document.merge(sourcePDF, 0, -1, 1, 90);

Extracting Document Text#

To get the text for an entire document we can retrieve StructuredText objects as JSON for each page as follows:

EXAMPLE

let i = 0
while (i < document.countPages()) {
    const page = document.loadPage(i)
    const json = page.toStructuredText("preserve-whitespace").asJSON()
    console.log(`json=${json}`)
    i++
}

StructuredText contains objects from a page that have been analyzed and grouped into blocks, lines and spans. As such the JSON returned is structured and contains positional data and font data alongside text values, e.g.:

EXAMPLE

{
    "blocks": [
        {
            "type": "text",
            "bbox": {
                "x": 30,
                "y": 32,
                "w": 216,
                "h": 13
            },
            "lines": [
                {
                    "wmode": 0,
                    "bbox": {
                        "x": 30,
                        "y": 32,
                        "w": 216,
                        "h": 13
                    },
                    "font": {
                        "name": "FKGYDX+Arial",
                        "family": "sans-serif",
                        "weight": "normal",
                        "style": "normal",
                        "size": 12
                    },
                    "x": 30,
                    "y": 43,
                    "text": "Welcome to the Node server test.pdf file."
                }
            ]
        },
        {
            "type": "text",
            "bbox": {
                "x": 30,
                "y": 68,
                "w": 190,
                "h": 13
            },
            "lines": [
                {
                    "wmode": 0,
                    "bbox": {
                        "x": 30,
                        "y": 68,
                        "w": 190,
                        "h": 13
                    },
                    "font": {
                        "name": "FKGYDX+Arial",
                        "family": "sans-serif",
                        "weight": "normal",
                        "style": "normal",
                        "size": 12
                    },
                    "x": 30,
                    "y": 79,
                    "text": "Sorry there is not much to see here!"
                }
            ]
        },
        {
            "type": "text",
            "bbox": {
                "x": 568,
                "y": 31,
                "w": 6,
                "h": 13
            },
            "lines": [
                {
                    "wmode": 0,
                    "bbox": {
                        "x": 568,
                        "y": 31,
                        "w": 6,
                        "h": 13
                    },
                    "font": {
                        "name": "YDTIJL+Arial",
                        "family": "sans-serif",
                        "weight": "normal",
                        "style": "normal",
                        "size": 12
                    },
                    "x": 568,
                    "y": 42,
                    "text": "1"
                }
            ]
        },
        {
            "type": "text",
            "bbox": {
                "x": 28,
                "y": 744,
                "w": 84,
                "h": 19
            },
            "lines": [
                {
                    "wmode": 0,
                    "bbox": {
                        "x": 28,
                        "y": 744,
                        "w": 84,
                        "h": 19
                    },
                    "font": {
                        "name": "Arial",
                        "family": "sans-serif",
                        "weight": "normal",
                        "style": "normal",
                        "size": 14
                    },
                    "x": 28,
                    "y": 759,
                    "text": "Page 1 footer"
                }
            ]
        }
    ]
}

Extracting Document Images#

To get the images for an entire document use the getImages() method on each page.

EXAMPLE

let i = 0
while (i < document.countPages()) {
    const page = document.loadPage(i)
    let imageStack = page.getImages()
    i++
}

The following example would extract all the images from a document and save them as individual files:

let i = 0
while (i < document.countPages()) {
    const page = document.loadPage(i)
    let imageStack = page.getImages()

    for (var j in imageStack) {
        var image = imageStack[j].image;
        var pixmap = image.toPixmap();
        let raster = pixmap.asJPEG(80);
        fs.writeFileSync('page-'+i+'-image-'+j+'.jpg', raster);
    }

    i++
}

Extracting Document Annotations#

We can retrieve Annotation objects from document pages by querying each page with getAnnotations().

EXAMPLE

let i = 0
while (i < document.countPages()) {
    const page = document.loadPage(0)
    const annots = page.getAnnotations()
    console.log(`Page=${page}, Annotations=${annots}`)
    i++
}

“Baking” a Document#

If you need to flatten your document’s annotations and/or widgets this is known as “baking”.

You can use the bake() method as follows:

EXAMPLE

document.bake()

Attaching a File to a Document#

Use the attachFile() method on a document instance with a supplied name and Buffer for the data.

EXAMPLE

const content = "Test content";
const buffer = new mupdfjs.Buffer();
buffer.writeLine(content);
doc.attachFile("test.txt", buffer);

Removing a File from a Document#

Use the deleteEmbeddedFile() method on a document instance to remove an attached file.

EXAMPLE

document.deleteEmbeddedFile("test.txt")

Searching a Document#

To search a document we can look at each page and use the search() method as follows:

EXAMPLE

let results = page.search("my search phrase")

Note

The resulting array contains numbers which are a sequence of [ulx, uly, urx, ury, llx, lly, lrx, lry] which defines each rectangle for each result. These type of rectangles are known as QuadPoints in the PDF specification.

For example, the following would represent a search result with two results showing one “QuadPoint” (or “Quad”) for each result:

EXAMPLE

[
    [
        [
            97.44780731201172,
            32.626708984375,
            114.12963104248047,
            32.626708984375,
            97.44780731201172,
            46.032958984375,
            114.12963104248047,
            46.032958984375
        ]
    ],
    [
        [
            62.767799377441406,
            68.626708984375,
            79.44963073730469,
            68.626708984375,
            62.767799377441406,
            82.032958984375,
            79.44963073730469,
            82.032958984375
        ]
    ]
]