Twin Peaks crawler
This crawler download texts and metadata from Twin Peaks Fandom Wiki. The output format is JSON. The crawler is based on the combination of Scrapy and fandom-py.
Several wiki pages are discarded, since they are not related to Twin Peaks plot and create noise in the Question Answering index.
Installation
pip install -r requirements.txt
- copy this folder (if needed, see stackoverflow)
Usage
- (if needed, activate the virtual environment)
cd tpcrawler
scrapy crawl tpcrawler
- you can find the downloaded pages in
data
subfolder