218 questions
0
votes
1
answer
43
views
Is there an automatic way to switch between different protocols automatically [closed]
This is for pages which require browser to render. I have a long list of urls and I don't know which will work with httpProtocol and which require selenium protocol. Is there a way to automatically ...
0
votes
1
answer
87
views
Problem passing crawler configuration yaml files to stormcrawler
I have storm running locally as single-machine setup. I want to send a topology with an alternative yaml configuration for the crawler.
I get an error when the topology cannot load an expected ...
0
votes
0
answers
70
views
Unable to Inject URL seed file in stormcrawler
I am new to stormcrawler and using SC with Elasticsearch. I unable to inject urls (seeds.txt).
there are 10 URLs in the seed file. I am following the README file instructions.
Here is the command that ...
-2
votes
1
answer
65
views
Unable to install Stormcrawler error with connection refusal port 7071 [closed]
I am installing Stormcrawler on Ubuntu everything worked but unable to inject the seeds.txt file.
when i Run the injector using this command "java -cp target/crawler-1.0-SNAPSHOT.jar ...
0
votes
1
answer
54
views
How to store custom metatags in elasticsearch index from a website using stormcrawler
I am crawling intranet websites using stormcrawler(v 2.10) and storing data on Elasticsearch (v 7.8.0). Using kibana for visualization. Intranet pages have custom meta tags as below
<meta name=&...
3
votes
0
answers
120
views
KryoException: Buffer underflow error in Apache Storm and Storm-Crawler
I have been encountering a recurring issue during the deployment of a new version of my topology in Storm-Crawler, and I am seeking assistance in understanding and resolving the problem.
Error: Upon ...
0
votes
1
answer
54
views
How do I set log level in stormcrawler/storm local?
I am running stormcrawler 2.8 in local mode (storm local...). I wish to wind down the amount of logging, changing the level to WARN or ERROR.
I have tried editing the storm worker.XML but that does ...
0
votes
1
answer
90
views
Storm Crawler to fetch urls with query string
I am new to storm crawler. I could configure storm crawler to fetch and parse the url "https://pubmed.ncbi.nlm.nih.gov/18926286/". But, my need is to crawl https://pubmed.ncbi.nlm.nih.gov/...
0
votes
1
answer
107
views
StormCrawler: urlfrontier.StatusUpdaterBolt performance bottleneck
we are running a basic StormCrawler topology, which which we want to opt for high throughput.
Topology:
FrontierSpout
Fetcher
JSoupParser
DummyIndexer
WarcBolt
urlfrontier.StatusUpdaterBolt
However, ...
0
votes
1
answer
29
views
StormCrawler - Metadata fields not being persisted
I have a topology with a spout that emits a tuple to the status stream and is picked up by the StatusUpdaterBolt, which in turn write data to an elasticsearch index.
The spout emits a tuple with a ...
0
votes
2
answers
54
views
Logging DEBUG messages in Stormcrawler
after looking at this previous question on logging in the StormCrawler, it was not completely clear to me how to enable the logging of DEBUG messages when running the StormCrawler.
By default, I only ...
0
votes
1
answer
108
views
I started web crawling using Storm Crawler but I do not know where crawled results go to? Im not using Solr or Elastic Search
The Storm Crawler started crawling data but it seems I cannot find where data is stored I need to save this data to to a database so I can connect the data to a remote server and have it indexed. ...
0
votes
1
answer
48
views
StormCrawler: setting "maxDepth": 0 prevents ES seed injection
With StormCrawler 2.3-SNAPSHOT, setting "maxDepth": 0 in the urlfilters.json prevents the seed injection into the ES index. Is that the expected behaviour? Or should it be injecting the ...
0
votes
1
answer
91
views
Problem running example topology with storm-crawler 2.3-SNAPSHOT
I'm building SC 2.3-SNAPSHOT from source and generating a project from the archetype. Then I try to run the example Flux topology. Seeds are injected properly. I can see all of them in the ES index ...
0
votes
1
answer
41
views
Replacement of ESSeedInjector in storm-crawler 2.2
I'm updating our crawler from storm-crawler 1.14 to 2.2. What is the replacement for the old ESSeedInjector?