Skip to content

Starting with 3.2.1, x and y coordinates in pdfData.Pages[].Texts are incorrect #408

@dpmott

Description

@dpmott

For example, when parsing a PDF such as test/pdf/misc/i28_line_break_210.pdf in this repo,

      const pdfParser = new PDFParser();

      pdfParser.on('pdfParser_dataError', (errData: PDFParserError) => {
        // handle error
      });

      pdfParser.on('pdfParser_dataReady', (pdfData: PDFData) => {
        try {
          // write pdfData.Pages[0].Texts to file
        } catch (error) {
          // handle error
        }
      });

      pdfParser.loadPDF(`test/pdf/misc/i28_line_break_210.pdf`);

The resulting output is something like this:

[ { x: -0.25,
    y: 48.75,
    w: 3,
    clr: 0,
    sw: 0.32553125,
    A: 'left',
    R: [ { T: '%20', S: -1, TS: [ 0, 15, 0, 0 ] } ] },
  { x: -0.25,
    y: 48.75,
    w: 110.016,
    clr: 0,
    sw: 0.32553125,
    A: 'left',
    R: [ { T: 'BY%20ORDER%20OF%20THE%20', S: -1, TS: [ 0, 16, 1, 1 ] } ] },
  { x: -0.25,
    y: 48.75,
    w: 3,
    clr: 0,
    sw: 0.32553125,
    A: 'left',
    R: [ { T: '%20', S: -1, TS: [ 0, 16, 1, 1 ] } ] },
  { x: -0.25,
    y: 48.75,
    w: 140.376,
    clr: 0,
    sw: 0.32553125,
    A: 'left',
    R: [ { T: 'SECRETARY%20OF%20THE%20AIR', S: -1, TS: [ 0, 16, 1, 1 ] } ] },
...

Sometimes the first item has a unique x/y coordinate, but thereafter all elements have the same x/y coordinate.

This makes spatial-aware parsing and grouping algorithms which depend on these coordinates useless.

Output was fine with version 3.2.0, and is broken in version 3.2.1 and 3.2.2.

The issue happens for both parseBuffer() and loadPDF().

The major refactor of version 3.2.1 added "Type3 glyph font support", but after reviewing the diff I was unable to identify where the above undesirable behavior was introduced.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions