🚀 **Chief Intelligence Officer** (Wiseflow) is an agile information mining tool that can precisely extract specific information from various given sources by leveraging the thinking and analytical capabilities of large models, requiring no human intervention throughout the process.
We have horizontally tested and compared the performance of deepseekV2.5, Qwen2.5-32B-Instruct, Qwen2.5-14B-Instruct, and Qwen2.5-coder-7B-Instruct models provided by siliconflow across four real-case tasks and a total of ten real webpage samples.
Please refer to the [report](./test/reports/wiseflow_report_20241223_bigbrother666/README.md) for test results.
We have also open-sourced the test scripts and welcome everyone to actively submit more test results. Wiseflow is an open-source project, and we hope to create an "information crawling tool that everyone can use" through our collective contributions!
At this stage, **submitting test results is equivalent to submitting project code**, and you will similarly be accepted as a contributor and may even be invited to participate in commercialization projects!
Additionally, we have improved the download and username/password configuration solution for pocketbase. Thanks to @ourines for contributing the install_pocketbase.sh script.
(The docker deployment solution has been temporarily removed as we felt it wasn't very convenient for users...)
🌟 **V0.3.6 Version Preview**
Version V0.3.6 is planned for release before December 30, 2024. This version is a performance optimization of v0.3.5, with significant improvements in information extraction quality. It will also introduce visual large models to extract page image information as supplementary when webpage information is insufficient.
**V0.3.x Plan**
- Attempt to support WeChat official account subscription without wxbot (V0.3.7)
- Introduce support for RSS information sources (V0.3.8)
- Attempt to introduce LLM-driven lightweight knowledge graphs to help users build insights from infos (V0.3.9)
Starting from version V0.3.5, wiseflow uses a completely new architecture and introduces [Crawlee](https://github.com/apify/crawlee-python) as the basic crawler and task management framework, significantly improving page acquisition capabilities. We will continue to enhance wiseflow's page acquisition capabilities. If you encounter pages that cannot be properly acquired, please provide feedback in [issue #136](https://github.com/TeamWiseFlow/wiseflow/issues/136).
Since the release of version V0.3.0 in late June 2024, wiseflow has received widespread attention from the open-source community, attracting even some self-media reports. First of all, we would like to express our gratitude!
However, we have also noticed some misunderstandings about the functional positioning of wiseflow among some followers. The following table, through comparison with traditional crawler tools, AI search, and knowledge base (RAG) projects, represents our current thinking on the latest product positioning of wiseflow.
| | Comparison with **Chief Intelligence Officer (Wiseflow)** |
|-------------|-----------------|
| **Crawler Tools** | First, wiseflow is a project based on crawler tools (in the current version, we use the crawler framework Crawlee). However, traditional crawler tools require manual intervention in information extraction, providing explicit Xpath, etc. This not only blocks ordinary users but also lacks generality. For different websites (including upgraded websites), manual reanalysis and updating of extraction code are required. Wiseflow is committed to automating web analysis and extraction using LLM. Users only need to tell the program their focus points. From this perspective, wiseflow can be simply understood as an "AI agent that can automatically use crawler tools." |
| **AI Search** | AI search is mainly used for **instant question-and-answer** scenarios, such as "Who is the founder of XX company?" or "Where can I buy the xx product under the xx brand?" Users want **a single answer**; wiseflow is mainly used for **continuous information collection** in certain areas, such as tracking related information of XX company, continuously tracking market behavior of XX brand, etc. In these scenarios, users can provide focus points (a company, a brand) or even information sources (site URLs, etc.), but cannot pose specific search questions. Users want **a series of related information**.|
| **Knowledge Base (RAG) Projects** | Knowledge base (RAG) projects are generally based on downstream tasks of existing information and usually face private knowledge (such as operation manuals, product manuals, government documents within enterprises, etc.); wiseflow currently does not integrate downstream tasks and faces public information on the internet. From the perspective of "agents," the two belong to agents built for different purposes. RAG projects are "internal knowledge assistant agents," while wiseflow is an "external information collection agent."|
Wiseflow 0.3.x uses pocketbase as its database. You can also manually download the pocketbase client (remember to download version 0.23.4 and place it in the [pb](./pb) directory) and manually create the superuser (remember to save it in the .env file).
🌟 **Wiseflow does not restrict the source of model services - as long as the service is compatible with the openAI SDK, including locally deployed services like ollama, Xinference, etc.**
Siliconflow provides online MaaS services for most mainstream open-source models. With its accumulated acceleration inference technology, its service has great advantages in both speed and price. When using siliconflow's service, the .env configuration can refer to the following:
😄 If you'd like, you can use my [siliconflow referral link](https://cloud.siliconflow.cn?referrer=clx6wrtca00045766ahvexw92), which will help me earn more token rewards 🌹
If your information sources are mostly non-Chinese pages and you don't require the extracted info to be in Chinese, then it's recommended to use closed-source commercial models like OpenAI, Claude, Gemini. You can try the third-party proxy **AiHubMix**, seamlessly access a wide range of leading AI models like OpenAI, Claude, Google, Llama, and more with just one API.
This is where you set the superuser username and password for the pocketbase database, remember to separate them with | (if the install_pocketbase.sh script executed successfully, this should already exist)
Project runtime data directory. If not configured, defaults to `core/work_dir`. Note: Currently the entire core directory is mounted under the container, meaning you can access it directly.
-#PB_API_BASE=""
Only needs to be configured if your pocketbase is not running on the default IP or port. Under default circumstances, you can ignore this.
✋ The V0.3.5 version architecture and dependencies are significantly different from previous versions. Please make sure to re-pull the code, delete (or rebuild) pb_data
It is recommended to use conda to build a virtual environment (of course you can skip this step, or use other Python virtual environment solutions)
🌟 This script will automatically determine if pocketbase is already running. If not, it will automatically start. However, please note that when you terminate the process with ctrl+c or ctrl+z, the pocketbase process will not be terminated until you close the terminal.
run_task.sh will periodically execute crawling-extraction tasks (it will execute immediately at startup, then every hour after that). If you only need to execute once, you can use the run.sh script.
Through this form, you can specify your focus points, and LLM will refine, filter, and categorize information accordingly.
Field description:
- focuspoint, focus point description (required), such as "Shanghai elementary to junior high school information," "cryptocurrency prices"
- explanation, detailed explanation or specific conventions of the focus point, such as "Only official junior high school admission information released by Shanghai" or "Current price, price change data of BTC, ETH"
- activated, whether to activate. If closed, this focus point will be ignored, and it can be re-enabled later.
Through this form, you can specify custom information sources. The system will start background scheduled tasks to scan, parse, and analyze the information sources locally.
Note that the core part of wiseflow does not require a dashboard, and the current product does not integrate a dashboard. If you need a dashboard, please download [V0.2.1 version](https://github.com/TeamWiseFlow/wiseflow/releases/tag/V0.2.1)
- crawlee-python (A web scraping and browser automation library for Python to build reliable crawlers. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.) https://github.com/apify/crawlee-python
- SeeAct (a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multimodal models (LMMs) such as GPT-4Vision.) https://github.com/OSU-NLP-Group/SeeAct
Also inspired by [GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor) and [AutoCrawler](https://github.com/kingname/AutoCrawler).