wiseflow

mirror of https://github.com/TeamWiseFlow/wiseflow.git synced 2025-01-23 10:50:25 +08:00

History

bigbrother666 06a6ac19e3 0.12 final code		2024-06-16 14:33:21 +08:00
..
__init__.py	0.12 final code	2024-06-16 14:33:21 +08:00
general_crawler.py	0.12 final code	2024-06-16 14:33:21 +08:00
general_scraper.py	0.12 final code	2024-06-16 14:33:21 +08:00
mp_crawler.py	scrapers updated	2024-06-15 15:41:31 +08:00
README_CN.md	scrapers updated	2024-06-15 15:41:31 +08:00
README_de.md	add start-up scrip	2024-06-15 20:04:10 +08:00
README_fr.md	add start-up scrip	2024-06-15 20:04:10 +08:00
README_jp.md	scrapers updated	2024-06-15 15:41:31 +08:00
README.md	add start-up scrip	2024-06-15 20:04:10 +08:00

README.md

This folder is intended for placing crawlers specific to particular sources. Note that the crawlers here should be able to parse the article list URL of the source and return a dictionary of article details.

Custom Crawler Configuration

After writing the crawler, place the crawler program in this folder and register it in the scraper_map in __init__.py, similar to:
{'www.securityaffairs.com': securityaffairs_scraper}
Here, the key is the source URL, and the value is the function name.

The crawler should be written in the form of a function with the following input and output specifications:

Input:

expiration: A datetime.date object, the crawler should only fetch articles on or after this date.

existings: [str], a list of URLs of articles already in the database. The crawler should ignore the URLs in this list.

Output:

[dict], a list of result dictionaries, each representing an article, formatted as follows: [{'url': str, 'title': str, 'author': str, 'publish_time': str, 'content': str, 'abstract': str, 'images': [Path]}, {...}, ...]

Note: The format of publish_time should be "%Y%m%d". If the crawler cannot fetch it, the current date can be used.

Additionally, title and content are mandatory fields.

Generic Page Parser

We provide a generic page parser here, which can intelligently fetch article lists from the source. For each article URL, it will first attempt to parse using gne. If it fails, it will then attempt to parse using llm.

Through this solution, it is possible to scan and extract information from most general news and portal sources.

However, we still strongly recommend that users write custom crawlers themselves or directly subscribe to our data service for more ideal and efficient scanning.