Newest 'stormcrawler' Questions

0 votes

1 answer

43 views

Is there an automatic way to switch between different protocols automatically [closed]

This is for pages which require browser to render. I have a long list of urls and I don't know which will work with httpProtocol and which require selenium protocol. Is there a way to automatically ...

jaspreet chahal

9,149

asked Dec 3 at 7:41

0 votes

1 answer

87 views

Problem passing crawler configuration yaml files to stormcrawler

I have storm running locally as single-machine setup. I want to send a topology with an alternative yaml configuration for the crawler. I get an error when the topology cannot load an expected ...

Patricio Page

11

asked Jan 11, 2024 at 20:01

0 votes

0 answers

70 views

Unable to Inject URL seed file in stormcrawler

I am new to stormcrawler and using SC with Elasticsearch. I unable to inject urls (seeds.txt). there are 10 URLs in the seed file. I am following the README file instructions. Here is the command that ...

Safeer Khan

1

asked Dec 12, 2023 at 3:37

-2 votes

1 answer

65 views

Unable to install Stormcrawler error with connection refusal port 7071 [closed]

I am installing Stormcrawler on Ubuntu everything worked but unable to inject the seeds.txt file. when i Run the injector using this command "java -cp target/crawler-1.0-SNAPSHOT.jar ...

Ahmad Afzaal

23

asked Dec 9, 2023 at 4:49

0 votes

1 answer

54 views

How to store custom metatags in elasticsearch index from a website using stormcrawler

I am crawling intranet websites using stormcrawler(v 2.10) and storing data on Elasticsearch (v 7.8.0). Using kibana for visualization. Intranet pages have custom meta tags as below <meta name=&...

MarioCB

1

asked Nov 22, 2023 at 16:24

3 votes

0 answers

120 views

KryoException: Buffer underflow error in Apache Storm and Storm-Crawler

I have been encountering a recurring issue during the deployment of a new version of my topology in Storm-Crawler, and I am seeking assistance in understanding and resolving the problem. Error: Upon ...

Hamide Ahadi

56

asked Jul 3, 2023 at 7:09

0 votes

1 answer

54 views

How do I set log level in stormcrawler/storm local?

I am running stormcrawler 2.8 in local mode (storm local...). I wish to wind down the amount of logging, changing the level to WARN or ERROR. I have tried editing the storm worker.XML but that does ...

user17022777

asked Jun 30, 2023 at 21:56

0 votes

1 answer

90 views

Storm Crawler to fetch urls with query string

I am new to storm crawler. I could configure storm crawler to fetch and parse the url "https://pubmed.ncbi.nlm.nih.gov/18926286/". But, my need is to crawl https://pubmed.ncbi.nlm.nih.gov/...

Biju George

1

asked Jun 23, 2023 at 6:40

0 votes

1 answer

107 views

StormCrawler: urlfrontier.StatusUpdaterBolt performance bottleneck

we are running a basic StormCrawler topology, which which we want to opt for high throughput. Topology: FrontierSpout Fetcher JSoupParser DummyIndexer WarcBolt urlfrontier.StatusUpdaterBolt However, ...

Michael Dinzinger

3

asked Jun 21, 2023 at 11:54

0 votes

1 answer

29 views

StormCrawler - Metadata fields not being persisted

I have a topology with a spout that emits a tuple to the status stream and is picked up by the StatusUpdaterBolt, which in turn write data to an elasticsearch index. The spout emits a tuple with a ...

ndtreviv

3,654

asked Jun 20, 2023 at 15:28

0 votes

2 answers

54 views

Logging DEBUG messages in Stormcrawler

after looking at this previous question on logging in the StormCrawler, it was not completely clear to me how to enable the logging of DEBUG messages when running the StormCrawler. By default, I only ...

Michael Dinzinger

3

asked Feb 25, 2023 at 12:48

0 votes

1 answer

108 views

I started web crawling using Storm Crawler but I do not know where crawled results go to? Im not using Solr or Elastic Search

The Storm Crawler started crawling data but it seems I cannot find where data is stored I need to save this data to to a database so I can connect the data to a remote server and have it indexed. ...

abls1

1

asked Jan 18, 2023 at 8:32

0 votes

1 answer

48 views

StormCrawler: setting "maxDepth": 0 prevents ES seed injection

With StormCrawler 2.3-SNAPSHOT, setting "maxDepth": 0 in the urlfilters.json prevents the seed injection into the ES index. Is that the expected behaviour? Or should it be injecting the ...

ejo

5

asked Mar 24, 2022 at 8:20

0 votes

1 answer

91 views

Problem running example topology with storm-crawler 2.3-SNAPSHOT

I'm building SC 2.3-SNAPSHOT from source and generating a project from the archetype. Then I try to run the example Flux topology. Seeds are injected properly. I can see all of them in the ES index ...

ejo

5

asked Feb 28, 2022 at 8:58

0 votes

1 answer

41 views

Replacement of ESSeedInjector in storm-crawler 2.2

I'm updating our crawler from storm-crawler 1.14 to 2.2. What is the replacement for the old ESSeedInjector?

ejo

5

asked Feb 15, 2022 at 7:41

Collectives™ on Stack Overflow

Is there an automatic way to switch between different protocols automatically [closed]

Problem passing crawler configuration yaml files to stormcrawler

Unable to Inject URL seed file in stormcrawler

Unable to install Stormcrawler error with connection refusal port 7071 [closed]

How to store custom metatags in elasticsearch index from a website using stormcrawler

KryoException: Buffer underflow error in Apache Storm and Storm-Crawler

How do I set log level in stormcrawler/storm local?

Storm Crawler to fetch urls with query string

StormCrawler: urlfrontier.StatusUpdaterBolt performance bottleneck

StormCrawler - Metadata fields not being persisted

Logging DEBUG messages in Stormcrawler

I started web crawling using Storm Crawler but I do not know where crawled results go to? Im not using Solr or Elastic Search

StormCrawler: setting "maxDepth": 0 prevents ES seed injection

Problem running example topology with storm-crawler 2.3-SNAPSHOT

Replacement of ESSeedInjector in storm-crawler 2.2

Hot Network Questions