opaidios
opaidios9mo ago

Buildship Scraping Help

I need to scrape websites by cycling through different variations of the same URL(I provide the URL extension), but I can't figure out how to get the crawler to visit a root website (ex: www.google.com) and then cycle through the URL extensions (www.google.com/1, /2, /3, /4, /5) extracting the data attached to the selector. Thank you in advance
5 Replies
Gaurav Chadha
Gaurav Chadha9mo ago
Hi @opaidios, Currently there is no option to directly cycle through different pages. Maybe we can implement something like this https://github.com/puppeteer/puppeteer/blob/1e66d332b8faf6a15803c0ad36178e56d4dadf7b/docs/api.md#pagewaitforselectorselector-options. @Deepanshu is it feasible to support this in our current Puppeteer runner?
GitHub
puppeteer/docs/api.md at 1e66d332b8faf6a15803c0ad36178e56d4dadf7b ·...
Node.js API for Chrome . Contribute to puppeteer/puppeteer development by creating an account on GitHub.
Deepanshu
Deepanshu9mo ago
@opaidios can you tell which node you are using to crawl the website?
opaidios
opaidios9mo ago
@Deepanshu I haven't actually used a node, but the one I've attempted was 'crawler'. I will try other things today. thank you guys for looking into this I had the AI write a node to send data via webhook to Make.com, hopefully there is a workaround. I tried running crawler in parallel, but it can't select URL extension from a list of items.
Gaurav Chadha
Gaurav Chadha9mo ago
Try adding the list of urls in the as items and loop through it. Use give a try using loop node and inside add crawler.
opaidios
opaidios9mo ago
here's the catch; the base url is "https://www.google.com/maps/search/?api=1&query=gym&query_place_id=" & the data that I've extracted so far looks like this [ { "lat": 6.2476376, "lng": -75.56581530000001, "place_id": "ChIJBa0PuN8oRI4RVju1x_x8E0I" }, { "lat": 6.2073151, "lng": -75.57068579999999, "place_id": "ChIJA-ZKeCooRI4RsT_eKlHovT8" }, { "lat": 6.211902599999998, "lng": -75.5652279, "place_id": "ChIJOx9p1igoRI4RzNB2nThQF0M" }] I need to add the place_id to the end of the url and extract the data that way I sent the data to a replit script to return a list of urls instead with the place_id, I will try crawler again with new data