Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Playwright script to generate PDF of a page that requires scrolling (Khan Academy)

−0

Currently I am using One Click Page to PDF to get the job done, although the lack of working open-source tools fit for this task has led me to creating my own; the web page "that won't print" is this Khan Academy article.

It seems as though I have 2 options, both involving Playwright:

A. (Somehow) embed text on a page screenshot

pros:
- I can get the whole page all at once without worrying about bits being cut off
cons:
- positioning text layer in the right place is likely going to be a pain
- splitting one long 'scroll' into PDF pages will be a fiddly process

B. Generate a single-page PDF of what is currently visible, scroll just enough for the next page and repeat, combining all PDFs into one

pros:
- no awkward image-to-PDF-with-text conversion required (pagination and text layer come out of the box)
cons:
- figuring out just how much to scroll won't be trivial (easy to under/overshoot)
- page header and footer removal is required (otherwise they would be repeated for every page)

Before I embarked in this quest, I was curious to hear opinions from the more experienced!

I will try to code in Javascript as I understand this gives greater flexibility in Playwright, although if I can I would generally find my way easier in Python - how do you think I should tackle this problem aiming for a simple and effective solution?

Web2PDF is likely the closest base I can work from: it runs directly in browser, so as a program it is simpler to work with and more straightforward to debug.

posted 10 days ago

CC BY-SA 4.0

10d ago

Elefy‭

6 reputation 1 0 1 4

Raw

Markdown

History

1 comment thread

clarifying the problem (2 comments)

1 answer

−0

I'm not sure what you've tried or where you're stuck with your attempts exactly, but I can answer generally as this is a common scenario. It's a good illustration that there are rarely one-size-fits-all solutions in web scraping.

If there's a scroll container like this cutting off content, you'll need to "pop" the content out of the container so it scrolls at the top level of the page. I typically use this to do so, removing or hiding all elements except for those in the subtree of the container.

But doing a hard strip here crashes the page, so this script:

forces the scroll container element tree to be visible
makes all other elements invisible
sets general styles to make the visible elements look presentable
finally, captures the PDF.

const {chromium} = require("playwright"); // ^1.58.0

const url = "https://www.khanacademy.org/math/ap-calculus-ab/ab-diff-analytical-applications-new/ab-5-6b/a/review-analyzing-the-second-derivative-to-find-inflection-points";

let browser;
(async () => {
  browser = await chromium.launch({headless: true});
  const width = 1200;
  const page = await browser.newPage({
    viewport: {width, height: 1200},
  });
  await page.goto(url, {waitUntil: "networkidle"});
  await page
    .locator('[data-testid="content-panel-wrapper"]')
    .evaluate(target => {
      for (let el = target; el; el = el.parentElement) {
        el.style.overflow = "visible";
        el.style.overflowY = "visible";
        el.style.overflowX = "visible";
        el.style.height = "auto";
        el.style.maxHeight = "none";
        el.style.minHeight = "0";
        el.style.position = "static";
      }

      const keep = new Set();
      keep.add(target);
      target.querySelectorAll("*").forEach(el => keep.add(el));

      for (
        let el = target.parentElement;
        el;
        el = el.parentElement
      ) {
        keep.add(el);
      }

      document.querySelectorAll("body *").forEach(el => {
        if (!keep.has(el)) {
          el.style.display = "none";
        }
      });

      target.style.position = "absolute";
      target.style.left = "0";
      target.style.top = "0";
      target.style.width = "100%";
      target.style.maxWidth = "none";
      target.style.margin = "0";
      target.style.padding = "20px";
      document.body.style.margin = "0";
      document.body.style.padding = "0";
      document.body.style.overflow = "visible";
      document.documentElement.style.overflow = "visible";
    });

  await page.waitForTimeout(1000);
  const height = await page.evaluate(() =>
    Math.max(
      document.body.scrollHeight,
      document.documentElement.scrollHeight
    )
  );
  await page.setViewportSize({width, height});
  await page.pdf({
    path: "khan.pdf",
    printBackground: true,
    width: `${width}px`,
    height: `${height}px`,
    margin: {
      top: "0",
      right: "0",
      bottom: "0",
      left: "0",
    },
  });
  console.log("Saved khan.pdf");
})()
  .finally(() => browser?.close());

posted 8 days ago

CC BY-SA 4.0

8d ago

ggorlen‭

51 reputation 1 2 8 9

Copy Link

Raw

Markdown

History

1 comment thread

@ggorlen thank you, it works great! My only concern is: can we tell Playwright when to page break... (5 comments)

Communities

Playwright script to generate PDF of a page that requires scrolling (Khan Academy)

1 comment thread

1 answer

1 comment thread