Working with Documents#
Passwords & Security#
A document may require a password if it is protected. To check this use the needsPassword
method as follows:
EXAMPLE
let needsPassword = document.needsPassword()
To provide a password use the authenticate
method as follows:
EXAMPLE
let auth = document.authenticate("abracadabra")
See the authenticate password return values for what the return value means.
Document Metadata#
Get Metadata#
You can get metadata for a document using the getMetaData()
method.
The common keys are: format
, encryption
, info:ModDate
, and info:Title
.
EXAMPLE
const format = document.getMetaData("format")
const modificationDate = document.getMetaData("info:ModDate")
const author = document.getMetaData("info:Author")
Set Metadata#
You can set metadata for a document using the setMetaData()
method.
EXAMPLE
document.setMetaData("info:Author", "Jane Doe")
Get the Document Page Count#
Count the number of pages in the document.
EXAMPLE
const numPages = document.countPages()
Load a Page of a Document#
To load a page of a document use the PDFPage constructor method to return a page instance.
EXAMPLE
// load the 1st page of the document
let page = document.loadPage(0)
Splitting a Document#
To split a document’s pages into new documents use the split()
method. Supply an array of page indicies that you want to use for the splitting operation.
EXAMPLE
let documents = document.split([0,2,3])
The example above would return three new documents from a 10 page PDF as the following:
Document containing pages 1 & 2 (from index
0
)Document containing page 3 (from index
2
)Document containing pages 4-10 (from final index
3
)
Merging Documents#
To merge documents we can use the merge()
method.
See the script below for an example implementation.
EXAMPLE
// create a blank document and add some text
let sourcePDF = mupdfjs.PDFDocument.createBlankDocument()
let page = sourcPDF.loadPage(0)
page.insertText("HELLO WORLD",
[0,0],
"Times-Roman",
20,
{strokeColor:[0,0,0,1], fillColor:[1,0,0,0.75], strokeThickness:0.5})
// now merge this document onto page 2 of our document and rotate it by 90 degrees
document.merge(sourcePDF, 0, -1, 1, 90);
Extracting Document Text#
To get the text for an entire document we can retrieve StructuredText objects as JSON
for each page as follows:
EXAMPLE
let i = 0
while (i < document.countPages()) {
const page = document.loadPage(i)
const json = page.toStructuredText("preserve-whitespace").asJSON()
console.log(`json=${json}`)
i++
}
StructuredText contains objects from a page that have been analyzed and grouped into blocks, lines and spans. As such the JSON
returned is structured and contains positional data and font data alongside text values, e.g.:
EXAMPLE
{
"blocks": [
{
"type": "text",
"bbox": {
"x": 30,
"y": 32,
"w": 216,
"h": 13
},
"lines": [
{
"wmode": 0,
"bbox": {
"x": 30,
"y": 32,
"w": 216,
"h": 13
},
"font": {
"name": "FKGYDX+Arial",
"family": "sans-serif",
"weight": "normal",
"style": "normal",
"size": 12
},
"x": 30,
"y": 43,
"text": "Welcome to the Node server test.pdf file."
}
]
},
{
"type": "text",
"bbox": {
"x": 30,
"y": 68,
"w": 190,
"h": 13
},
"lines": [
{
"wmode": 0,
"bbox": {
"x": 30,
"y": 68,
"w": 190,
"h": 13
},
"font": {
"name": "FKGYDX+Arial",
"family": "sans-serif",
"weight": "normal",
"style": "normal",
"size": 12
},
"x": 30,
"y": 79,
"text": "Sorry there is not much to see here!"
}
]
},
{
"type": "text",
"bbox": {
"x": 568,
"y": 31,
"w": 6,
"h": 13
},
"lines": [
{
"wmode": 0,
"bbox": {
"x": 568,
"y": 31,
"w": 6,
"h": 13
},
"font": {
"name": "YDTIJL+Arial",
"family": "sans-serif",
"weight": "normal",
"style": "normal",
"size": 12
},
"x": 568,
"y": 42,
"text": "1"
}
]
},
{
"type": "text",
"bbox": {
"x": 28,
"y": 744,
"w": 84,
"h": 19
},
"lines": [
{
"wmode": 0,
"bbox": {
"x": 28,
"y": 744,
"w": 84,
"h": 19
},
"font": {
"name": "Arial",
"family": "sans-serif",
"weight": "normal",
"style": "normal",
"size": 14
},
"x": 28,
"y": 759,
"text": "Page 1 footer"
}
]
}
]
}
Extracting Document Images#
To get the images for an entire document use the getImages()
method on each page.
EXAMPLE
let i = 0
while (i < document.countPages()) {
const page = document.loadPage(i)
let imageStack = page.getImages()
i++
}
The following example would extract all the images from a document and save them as individual files:
let i = 0
while (i < document.countPages()) {
const page = document.loadPage(i)
let imageStack = page.getImages()
for (var j in imageStack) {
var image = imageStack[j].image;
var pixmap = image.toPixmap();
let raster = pixmap.asJPEG(80);
fs.writeFileSync('page-'+i+'-image-'+j+'.jpg', raster);
}
i++
}
Extracting Document Annotations#
We can retrieve Annotation objects from document pages by querying each page with getAnnotations()
.
EXAMPLE
let i = 0
while (i < document.countPages()) {
const page = document.loadPage(0)
const annots = page.getAnnotations()
console.log(`Page=${page}, Annotations=${annots}`)
i++
}
“Baking” a Document#
If you need to flatten your document’s annotations and/or widgets this is known as “baking”.
You can use the bake()
method as follows:
EXAMPLE
document.bake()
Attaching a File to a Document#
Use the attachFile()
method on a document instance with a supplied name and Buffer for the data.
EXAMPLE
const content = "Test content";
const buffer = new mupdfjs.Buffer();
buffer.writeLine(content);
doc.attachFile("test.txt", buffer);
Removing a File from a Document#
Use the deleteEmbeddedFile()
method on a document instance to remove an attached file.
EXAMPLE
document.deleteEmbeddedFile("test.txt")
Searching a Document#
To search a document we can look at each page and use the search()
method as follows:
EXAMPLE
let results = page.search("my search phrase")
Note
The resulting array contains numbers which are a sequence of [ulx, uly, urx, ury, llx, lly, lrx, lry]
which defines each rectangle for each result. These type of rectangles are known as QuadPoints in the PDF specification.
For example, the following would represent a search result with two results showing one “QuadPoint” (or “Quad”) for each result:
EXAMPLE
[
[
[
97.44780731201172,
32.626708984375,
114.12963104248047,
32.626708984375,
97.44780731201172,
46.032958984375,
114.12963104248047,
46.032958984375
]
],
[
[
62.767799377441406,
68.626708984375,
79.44963073730469,
68.626708984375,
62.767799377441406,
82.032958984375,
79.44963073730469,
82.032958984375
]
]
]
Getting Document Links#
To get document links (if any) we can look at each page and use the getLinks()
method as follows:
let links = page.getLinks()
Note
The resulting array contains an array of Link objects which have their own bounds and uri
for the link.
Code samples
Code samples are in TypeScript and assume that the following requirements are defined in your TypeScript file header as follows:
import * as fs from "fs"
import * as mupdfjs from "mupdf/mupdfjs"