3

I am just exploring scrapy with splash and I am trying to scrape all the product (pants) data with productid,name and price from one of the e-commerce site gap but I didn't see all the dynamic product data loaded when I see from splash web UI splash web UI (only 16 items are loading though for every request - no clue why) I tried with the following options but no luck

  • Increasing wait time upto 20 sec
  • By starting the docker with "--disable-private-mode"
  • By using lua_script for page scrolling
  • With view report full option splash:set_viewport_full()

lua_script2 = """ function main(splash)
    local num_scrolls = 10
    local scroll_delay = 2.0

    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(splash.args.wait)

    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end        
    return splash:html()
end"""                 
                              
            yield SplashRequest(
                url,
                self.parse_product_contents,
                endpoint='execute', 
                args={
                        'lua_source': lua_script2,
                        'wait': 5,
                    }
                )
 

Can anyone please shed some light on this behavior? p.s : I am using scrapy framework and I am able to parse the product information (itemid,name and price) from the render.html (but render.html has only 16 items information)

4
  • How about using the api to get the data directly instead of so much effort ? "gap.com/resources/productSearch/v1/search?cid=80799" Commented Sep 5, 2017 at 19:18
  • Hi Tarun, Thanks for the reply. But my goal is to crawl into each product listed on the site (for example here each pant) and get all the skus available (for example this pant has nearly 23 size (skus) that I can see from view-source link which i didn't find through api. I am new this api approach . can you please give some information Commented Sep 6, 2017 at 0:42
  • I have tried to use splash:set_viewport_full() but no luck and tried with scrpay also like yield SplashRequest(url, self.parse_product_contents, args={'wait': 10, 'viewport':'full' , 'render_all': 1},endpoint='render.html' ) Still no luck But when I set the view report size a large dimensions splash:set_viewport_size(1980, 8020) I saw the content got loaded but still it has limitation. got this error when I try to increase dimension of the png Viewport is out of range (20000x20000, area=16000000) Commented Sep 6, 2017 at 8:51
  • @TarunLalwani do you have any thoughts, please share Commented Sep 6, 2017 at 13:24

1 Answer 1

3

I updated the script to below

function main(splash)
    local num_scrolls = 10
    local scroll_delay = 2.0
    splash:set_viewport_size(1980, 8020)
    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
--    splash:set_viewport_full()
    splash:wait(10)
    splash:runjs("jQuery('span.icon-x').click();")
    splash:wait(1)
    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end      

      splash:wait(30)

    return { 
        png = splash:png(),
        html = splash:html(),
        har = splash:har()
       }
end

And ran it in my local splash, the png doesn't work fine but the HTML has the last product

Last Image on page

Splash Rendered HTML

The only issue was when the email subscribe popup is there it won't scroll, so I added code to close it

Sign up to request clarification or add additional context in comments.

8 Comments

thats great ! but when I tried to run the same script in my local splash I got bad request error. Is there anything I need to get it work? please tell me ` { "type": "ScriptError", "error": 400, "description": "Error happened while executing Lua script", "info": { "line_number": 9, "type": "LUA_ERROR", "source": "[string \"function main(splash)\r...\"]", "message": "Lua error: [string \"function main(splash)\r...\"]:9: network3", "error": "network3" } }`
I pulled the latest docker image of splash, may be you are using something old? also try changing function main(splash) to function main(splash, args)
I have pulled the docker image 3days ago (docker pull scrapinghub/splash) I believe it is latest one. I have tried with function main(splash, args) same network issue I faced. Let me check the docker once again. did you by any chance pull by pull scrapinghub/splash:master ?
No, just pull scrapinghub/splash and i ran in the browser renderer
I think the splash is giving response after I made multiple requests. and intermittently giving the bad request error. wired !
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.