.. | ||
__init__.py | ||
general_crawler.py | ||
mp_crawler.py | ||
README_CN.md | ||
README_de.md | ||
README_fr.md | ||
README_jp.md | ||
README.md |
We provide a general page parser that can intelligently retrieve article lists from sources. For each article URL, it first attempts to use gne
for parsing, and if that fails, it will try using llm
.
This solution allows scanning and extracting information from most general news and portal sources.
However, we strongly recommend that users develop custom parsers for specific sources tailored to their actual business scenarios for more ideal and efficient scanning.
We also provide a parser specifically for WeChat public articles (mp.weixin.qq.com).
If you are willing to contribute your custom source-specific parsers to this repository, we would greatly appreciate it!
Custom Source Parser Development Specifications
Specifications
Remember It should be an asynchronous function
- The parser should be able to intelligently distinguish between article list pages and article detail pages.
- The parser's input parameters should only include
url
andlogger
:url
is the complete address of the source (typestr
).logger
is the logging object (please do not configure a separate logger for your custom source parser).
- The parser's output should include
flag
andresult
, formatted astuple[int, Union[list, dict]]
:-
If the
url
is an article list page,flag
returns1
, andresult
returns a list of all article page URLs (list
). -
If the
url
is an article page,flag
returns11
, andresult
returns all article details (dict
), in the following format:{'url': str, 'title': str, 'author': str, 'publish_time': str, 'content': str, 'abstract': str, 'images': [str]}
Note:
title
andcontent
cannot be empty.Note:
publish_time
should be in the format"%Y%m%d"
(date only, no-
). If the scraper cannot fetch it, use the current date. -
If parsing fails,
flag
returns0
, andresult
returns an empty dictionary{}
.pipeline
will try other parsing solutions (if any) upon receivingflag
0. -
If page retrieval fails (e.g., network issues),
flag
returns-7
, andresult
returns an empty dictionary{}
.pipeline
will not attempt to parse again in the same process upon receivingflag
-7.
-
Registration
After writing your scraper, place the scraper program in this folder and register the scraper in scraper_map
under __init__.py
, similar to:
{'domain': 'crawler def name'}
It is recommended to use urllib.parse to get the domain:
from urllib.parse import urlparse
parsed_url = urlparse("site's url")
domain = parsed_url.netloc