c309cf7afe
* feat 解析微信文章目录 * fix mp_crawler should return https url |
||
---|---|---|
asset | ||
core | ||
dashboard | ||
.dockerignore | ||
.gitignore | ||
compose.yaml | ||
Dockerfile | ||
env_sample | ||
LICENSE | ||
README_CN.md | ||
README_DE.md | ||
README_FR.md | ||
README_JP.md | ||
README.md | ||
version |
WiseFlow
Wiseflow is an agile information extraction tool that can refine information from various sources such as websites, WeChat Public Accounts, and social media platforms based on predefined focus points, automatically categorize tags, and upload to the database.
SiliconFlow has officially announced that several LLM online inference services, such as Qwen2-7B-Instruct and glm-4-9b-chat, will be free starting from June 25, 2024. This means you can perform information mining with wiseflow at “zero cost”!
We are not short of information; what we need is to filter out the noise from the vast amount of information so that valuable information stands out!
See how WiseFlow helps you save time, filter out irrelevant information, and organize key points of interest!
https://github.com/TeamWiseFlow/wiseflow/assets/96130569/bd4b2091-c02d-4457-9ec6-c072d8ddfb16
🔥 Major Update V0.3.0
-
✅ Completely rewritten general web content parser, using a combination of statistical learning (relying on the open-source project GNE) and LLM, adapted to over 90% of news pages;
-
✅ Brand new asynchronous task architecture;
-
✅ New information extraction and labeling strategy, more accurate, more refined, and can perform tasks perfectly with only a 9B LLM!
🌟 Key Features
-
🚀 Native LLM Application
We carefully selected the most suitable 7B~9B open-source models to minimize usage costs and allow data-sensitive users to switch to local deployment at any time. -
🌱 Lightweight Design
Without using any vector models, the system has minimal overhead and does not require a GPU, making it suitable for any hardware environment. -
🗃️ Intelligent Information Extraction and Classification
Automatically extracts information from various sources and tags and classifies it according to user interests.😄 Wiseflow is particularly good at extracting information from WeChat official account articles; for this, we have configured a dedicated mp article parser!
-
🌍 Can be Integrated into Any Agent Project
Can serve as a dynamic knowledge base for any Agent project, without needing to understand the code of Wiseflow, just operate through database reads! -
📦 Popular Pocketbase Database
The database and interface use PocketBase. Besides the web interface, SDK for Go/Javascript/Python languages are available.
🔄 What are the Differences and Connections between Wiseflow and Common Crawlers, LLM-Agent Projects?
Feature | Wiseflow | Crawler / Scraper | LLM-Agent |
---|---|---|---|
Main Problem Solved | Data processing (filtering, extraction, labeling) | Raw data acquisition | Downstream applications |
Connection | Can be integrated into Wiseflow for more powerful raw data acquisition | Can integrate Wiseflow as a dynamic knowledge base |
📥 Installation and Usage
WiseFlow has virtually no hardware requirements, with minimal system overhead, and does not need GPU or CUDA (when using online LLM services).
-
Clone the Repository
😄 Starring and forking are good habits
git clone https://github.com/TeamWiseFlow/wiseflow.git cd wiseflow
-
Highly Recommended: Use Docker
For users in China, please configure your network properly or specify a Docker Hub mirror
docker compose up
You may modify
compose.yaml
as needed.Note:
- Run the above command in the root directory of the wiseflow repository.
- Before running, create and edit a
.env
file in the same directory as the Dockerfile (root directory of the wiseflow repository). Refer toenv_sample
for the.env
file. - The first time you run the Docker container, an error might occur because you haven't created an admin account for the pb repository.
At this point, keep the container running, open
http://127.0.0.1:8090/_/
in your browser, and follow the instructions to create an admin account (make sure to use an email). Then enter the created admin email (again, make sure it's an email) and password into the.env
file, and restart the container.If you want to change the container's timezone and language [which will determine the prompt language, but has little effect on the results], run the image with the following command
docker run -e LANG=zh_CN.UTF-8 -e LC_CTYPE=zh_CN.UTF-8 your_image
-
[Alternative] Run Directly with Python
conda create -n wiseflow python=3.10 conda activate wiseflow cd core pip install -r requirements.txt
Afterward, you can refer to the scripts in core/scripts to start pb, task, and backend respectively (move the script files to the core directory).
Note:
- Start pb first; task and backend are independent processes, and the order doesn't matter. You can start any one of them as needed.
- Download the pocketbase client suitable for your device from https://pocketbase.io/docs/ and place it in the /core/pb directory.
- For issues with pb (including first-run errors), refer to core/pb/README.md.
- Before use, create and edit a
.env
file and place it in the root directory of the wiseflow repository (the directory above core). Refer toenv_sample
for the.env
file, and see below for detailed configuration.
📚 For developers, see /core/README.md for more information.
Access data via pocketbase:
- http://127.0.0.1:8090/_/ - Admin dashboard UI
- http://127.0.0.1:8090/api/ - REST API
-
Configuration
Windows users can set the following items directly in "Start - Settings - System - About - Advanced System Settings - Environment Variables". After setting, a terminal restart is required for the changes to take effect.
Copy
env_sample
from the directory and rename it to.env
, then fill in your configuration information (such as LLM service tokens) as follows:- LLM_API_KEY # API key for large language model inference services
- LLM_API_BASE # This project relies on the OpenAI SDK. Configure this if your model service supports OpenAI's API. If using OpenAI's service, you can omit this.
- WS_LOG="verbose" # Set to enable debug observation. Delete if not needed.
- GET_INFO_MODEL # Model for information extraction and tag matching tasks, default is gpt-3.5-turbo
- REWRITE_MODEL # Model for approximate information merging and rewriting tasks, default is gpt-3.5-turbo
- HTML_PARSE_MODEL # Model for web page parsing (intelligently enabled if the GNE algorithm performs poorly), default is gpt-3.5-turbo
- PROJECT_DIR # Storage location for data, cache, and log files, relative to the repository. Default is within the repository.
- PB_API_AUTH='email|password' # Email and password for the pb database admin (must be an email, can be a fictitious one)
- PB_API_BASE # Typically unnecessary. Only configure if you're not using the default local pocketbase interface (8090).
-
Model Recommendations
Based on extensive testing (for both Chinese and English tasks), we recommend "zhipuai/glm4-9B-chat" for GET_INFO_MODEL, "alibaba/Qwen2-7B-Instruct" for REWRITE_MODEL, and "alibaba/Qwen2-7B-Instruct" for HTML_PARSE_MODEL.
These models are well-suited for this project, with stable adherence to instructions and excellent generation quality. The project's prompts have been optimized for these three models. (HTML_PARSE_MODEL can also use "01-ai/Yi-1.5-9B-Chat", which has been tested to perform excellently.)
⚠️ We highly recommend using SiliconFlow's online inference service for lower costs, faster speeds, and higher free quotas! ⚠️
SiliconFlow's online inference service is compatible with the OpenAI SDK and provides open-source services for the above three models. Simply configure
LLM_API_BASE
to "https://api.siliconflow.cn/v1" and setLLM_API_KEY
to use it.😄 Alternatively, you can use my invitation link, which also rewards me with more tokens 😄
-
Focus Points and Scheduled Source Scanning
After starting the program, open the pocketbase Admin dashboard UI (http://127.0.0.1:8090/_/)
6.1 Open the **tags form** Use this form to specify your focus points. The LLM will extract, filter, and classify information based on these. Tags field description: - name, Description of the focus point. **Note: Be specific.** Good example: `Trends in US-China competition`. Bad example: `International situation`. - activated, Whether activated. If deactivated, the focus point will be ignored. It can be reactivated later. Activation and deactivation don't require a Docker container restart and will update in the next scheduled task. 6.2 Open the **sites form** Use this form to specify custom sources. The system will start background tasks to scan, parse, and analyze these sources locally. Sites field description: - url, URL of the source. Provide a URL to the list page rather than a specific article page. - per_hours, Scan frequency in hours, as an integer (range 1-24; we recommend no more than once a day, i.e., set to 24). - activated, Whether activated. If deactivated, the source will be ignored. It can be reactivated later. Activation and deactivation don't require a Docker container restart and will update in the next scheduled task.
-
Local Deployment
As you can see, this project uses 7B/9B LLMs and does not require any vector models, which means you only need a single RTX 3090 (24GB VRAM) to fully deploy this project locally.
Ensure your local LLM service is compatible with the OpenAI SDK and configure
LLM_API_BASE
accordingly.
🛡️ License
This project is open-source under the Apache 2.0 license.
For commercial use and customization cooperation, please contact Email: 35252986@qq.com.
- Commercial customers, please register with us. The product promises to be free forever.
- For customized customers, we provide the following services according to your sources and business needs:
- Dedicated crawler and parser for customer business scenario sources
- Customized information extraction and classification strategies
- Targeted LLM recommendations or even fine-tuning services
- Private deployment services
- UI interface customization
📬 Contact Information
If you have any questions or suggestions, feel free to contact us through issue.
🤝 This Project is Based on the Following Excellent Open-source Projects:
- GeneralNewsExtractor (General Extractor of News Web Page Body Based on Statistical Learning) https://github.com/GeneralNewsExtractor/GeneralNewsExtractor
- json_repair (Repair invalid JSON documents) https://github.com/josdejong/jsonrepair/tree/main
- python-pocketbase (PocketBase client SDK for Python) https://github.com/vaphes/pocketbase
Citation
If you refer to or cite part or all of this project in related work, please indicate the following information:
Author: Wiseflow Team
https://openi.pcl.ac.cn/wiseflow/wiseflow
https://github.com/TeamWiseFlow/wiseflow
Licensed under Apache2.0