**Wiseflow** is an agile information mining tool that extracts concise messages from various sources such as websites, WeChat official accounts, social platforms, etc. It automatically categorizes and uploads them to the database.
- ✅ Completely rewritten general web content parser, using a combination of statistical learning (relying on the open-source project GNE) and LLM, adapted to over 90% of news pages;
We carefully selected the most suitable 7B~9B open-source models to minimize usage costs and allow data-sensitive users to switch to local deployment at any time.
- 🌱 **Lightweight Design**
Without using any vector models, the system has minimal overhead and does not require a GPU, making it suitable for any hardware environment.
- 🗃️ **Intelligent Information Extraction and Classification**
Automatically extracts information from various sources and tags and classifies it according to user interests.
😄 **Wiseflow is particularly good at extracting information from WeChat official account articles**; for this, we have configured a dedicated mp article parser!
- 🌍 **Can be Integrated into Any RAG Project**
Can serve as a dynamic knowledge base for any RAG project, without needing to understand the code of Wiseflow, just operate through database reads!
- 📦 **Popular Pocketbase Database**
The database and interface use PocketBase. Besides the web interface, APIs for Go/Javascript/Python languages are available.
| **Main Problem Solved** | Data processing (filtering, extraction, labeling) | Raw data acquisition | Downstream applications |
| **Connection** | | Can be integrated into Wiseflow for more powerful raw data acquisition | Can integrate Wiseflow as a dynamic knowledge base |
## 📥 Installation and Usage
WiseFlow has virtually no hardware requirements, with minimal system overhead, and does not need a discrete GPU or CUDA (when using online LLM services).
You can start `pb`, `task`, and `backend` using the scripts in the `core/scripts` directory (move the script files to the `core` directory).
**Note:**
- Always start `pb` first. `task` and `backend` are independent processes and can be started in any order or only one of them can be started as needed.
- First, download the PocketBase client corresponding to your device from [here](https://pocketbase.io/docs/) and place it in the `/core/pb` directory.
- For issues with running `pb` (including errors on the first run, etc.), refer to [`core/pb/README.md`](/core/pb/README.md).
- Before using, create and edit the `.env` file and place it in the root directory of the wiseflow code repository (one level above the `core` directory). The `.env` file can reference `env_sample`. Detailed configuration instructions are below.
- It is highly recommended to use the Docker approach, see the fifth point below.
📚 For developers, see [/core/README.md](/core/README.md) for more.
- PROJECT_DIR # Location for storing data, cache and log files, relative to the code repository; default is the code repository itself if not specified
- PB_API_AUTH='email|password' # Admin email and password for the pb database (**it can be a fictitious one but must be an email**)
After extensive testing (in both Chinese and English tasks), for comprehensive effect and cost, we recommend the following for **GET_INFO_MODEL**, **REWRITE_MODEL**, and **HTML_PARSE_MODEL**: **"zhipuai/glm4-9B-chat"**, **"alibaba/Qwen2-7B-Instruct"**, **"alibaba/Qwen2-7B-Instruct"**.
These models fit the project well, with stable command adherence and excellent generation effects. The related prompts for this project are also optimized for these three models. (**HTML_PARSE_MODEL** can also use **"01-ai/Yi-1.5-9B-Chat"**, which also performs excellently in tests)
SiliconFlow online inference service is compatible with the OpenAI SDK and provides open-source services for the above three models. Just configure LLM_API_BASE as "https://api.siliconflow.cn/v1" and set up LLM_API_KEY to use it.
😄 Or you may prefer to use my [invitation link](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92), so I can also get more token rewards 😄
As you can see, this project uses 7B/9B LLMs and does not require any vector models, which means you can fully deploy this project locally with just an RTX 3090 (24GB VRAM).
- Run the above commands in the root directory of the wiseflow code repository.
- Before running, create and edit the `.env` file in the same directory as the Dockerfile (root directory of the wiseflow code repository). The `.env` file can reference `env_sample`.
- You may encounter errors when running the Docker container for the first time. This is normal because you have not yet created an admin account for the `pb` repository.
At this point, keep the container running, open `http://127.0.0.1:8090/_/` in your browser, and follow the instructions to create an admin account (make sure to use an email). Then, fill in the created admin email (again, make sure to use an email) and password in the `.env` file, and restart the container.
-`name`: Description of the point of interest. **Note: Be specific**. A good example is `Trends in US-China competition`; a poor example is `International situation`.
-`activated`: Whether the tag is activated. If deactivated, this point of interest will be ignored. It can be toggled on and off without restarting the Docker container; updates will be applied at the next scheduled task.
This form allows you to specify custom information sources. The system will start background scheduled tasks to scan, parse, and analyze these sources locally.
- per_hours: Scanning frequency, in hours, integer type (range 1~24; we recommend a scanning frequency of no more than once per day, i.e., set to 24).
- activated: Whether to activate. If turned off, the source will be ignored; it can be turned on again later. Turning on and off does not require restarting the Docker container and will be updated at the next scheduled task.
- Customized information extraction and classification strategies
- Targeted LLM recommendations or even fine-tuning services
- Private deployment services
- UI interface customization
## 📬 Contact Information
If you have any questions or suggestions, feel free to contact us through [issue](https://github.com/TeamWiseFlow/wiseflow/issues).
## 🤝 This Project is Based on the Following Excellent Open-source Projects:
- GeneralNewsExtractor (General Extractor of News Web Page Body Based on Statistical Learning) https://github.com/GeneralNewsExtractor/GeneralNewsExtractor