diff --git a/client/.dockerignore b/.dockerignore
similarity index 100%
rename from client/.dockerignore
rename to .dockerignore
diff --git a/.gitignore b/.gitignore
index 012e462..5d0d7a6 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,7 @@
.DS_Store
.idea/
__pycache__
+.env
+.venv/
+core/pb/pb_data/
+core/WStest/
diff --git a/asset/wiseflow_arch.png b/asset/wiseflow_arch.png
deleted file mode 100644
index 3f917b5..0000000
Binary files a/asset/wiseflow_arch.png and /dev/null differ
diff --git a/client/.gitignore b/client/.gitignore
deleted file mode 100644
index 7394c6a..0000000
--- a/client/.gitignore
+++ /dev/null
@@ -1,4 +0,0 @@
-.env
-.venv/
-pb/pb_data/
-backend/WStest/
\ No newline at end of file
diff --git a/client/Dockerfile.api b/client/Dockerfile.api
deleted file mode 100644
index 6ad45b1..0000000
--- a/client/Dockerfile.api
+++ /dev/null
@@ -1,19 +0,0 @@
-FROM python:3.10-slim
-
-RUN apt-get update && \
- apt-get install -yq tzdata build-essential
-
-RUN ln -fs /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
-
-WORKDIR /app
-
-COPY backend/requirements.txt requirements.txt
-RUN pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
-
-COPY backend .
-
-EXPOSE 7777
-
-CMD tail -f /dev/null
-
-# ENTRYPOINT ["bash", "docker_entrypoint.sh"]
\ No newline at end of file
diff --git a/client/Dockerfile.web b/client/Dockerfile.web
deleted file mode 100644
index d6012d2..0000000
--- a/client/Dockerfile.web
+++ /dev/null
@@ -1,35 +0,0 @@
-FROM node:20-slim as builder
-
-WORKDIR /app
-
-COPY web ./
-RUN npm install -g pnpm
-RUN pnpm install
-RUN pnpm build
-
-
-FROM alpine:latest
-
-ARG PB_VERSION=0.21.1
-
-RUN apk add --no-cache unzip ca-certificates tzdata && \
- ln -fs /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
-
-
-# download and unzip PocketBase
-ADD https://github.com/pocketbase/pocketbase/releases/download/v${PB_VERSION}/pocketbase_${PB_VERSION}_linux_amd64.zip /tmp/pb.zip
-RUN unzip /tmp/pb.zip -d /pb/
-
-RUN mkdir -p /pb
-
-COPY ./pb/pb_migrations /pb/pb_migrations
-COPY ./pb/pb_hooks /pb/pb_hooks
-COPY --from=builder /app/dist /pb/pb_public
-
-WORKDIR /pb
-
-EXPOSE 8090
-
-CMD tail -f /dev/null
-
-# CMD ["/pb/pocketbase", "serve", "--http=0.0.0.0:8090"]
diff --git a/client/backend/README.md b/client/backend/README.md
deleted file mode 100644
index 9556afb..0000000
--- a/client/backend/README.md
+++ /dev/null
@@ -1,86 +0,0 @@
-# WiseFlow Client Backend
-
-# for developers
-
-## 部署
-
-1、建议创建新环境, **python版本为3.10**
-
-2、 安装requirements.txt
-
-## 单独启动数据库(需要先下载对应平台的pocketbase,或者单独build web docker并启动)
-
-pocketbase [下载地址](https://pocketbase.io/docs/)
-
-文件放入 backend/pb 目录下
-
-```bash
-chmod +x pocketbase
-./pocketbase serve
-```
-
-之后将pb的服务地址配置为环境变量PB_API_BASE
-
-(如果选择使用docker,可以参考client文件夹下的docker files)
-
-注:pb目录下的pb_migrations文件夹保持与repo同步,数据库会自动创建本项目需要的表单,如果不一致,可能导致后面运行失败
-
-pb_data是数据库数据存放目录,如果更改了admin的密码,记得修改.env
-
-## 脚本文件说明
-
-- tasks.sh #这是启动定时任务的脚本 (本地纯调试后端,这个不启动也行)
-- backend.sh #这是启动后端服务的脚本,(默认使用 localhost:7777, 通过 http://localhost:7777/docs/ 查看接口详情)
-
-备注:backend 服务返回格式统一为 dict,`{"flag": int, "result": [{"type": "text", "content": "xxx"}]}`
-
-统一返回flag约定
-
-| flag 码 | 内容 |
-|--------|-----------------|
-| -11 | LLM 错误/异常 |
-| -7 | 网络请求失败 |
-| -6 | 翻译接口失败 |
-| -5 | 入参格式错误 |
-| -4 | 向量模型错误 |
-| -3 | (预留) |
-| -2 | pb数据库接口失败 |
-| -1 | 未知错误 |
-| 0 | 正常返回 |
-| 1 | (预留) |
-| 2 | (预留) |
-| 3 | (预留) |
-| 11 | 用户所处流程正常结束 |
-| 21 | 生成了新的文件 |
-
-注: 1、提交后端request status 200 只意味着提交成功,不表示后端完全处理成功,**收到flag 11才表示流程正常结束**,所有运算成功。
-
-2、flag 0 通常意味着虽然所有运算都执行了,但没有任何结果,即没有新文件产生,也没有新的数据提交给数据库。
-
-3、另外对于translation接口,由于是批量处理,存在着部分成功(翻译结果提交数据库并做了关联),部分失败的情况,所以即便是没有收到flag11,也建议重新从pb读一遍数据
-
-
-## 目录结构
-
-```
-backend
-├── llms # 大模型的wraper
-├── scrapers # 爬虫库
- |—— __init__.py #如果要添加具体网站的专有爬虫,需要把爬虫脚本放在这个文件的同级目录,同时编辑这面的scraper_map字典
- |—— general_scraper.py #通用网页解析器
- |—— simple_crawler.py #基于gne的快速单一页面解析器
-|—— __init__.py # backend主函数
-├── background_task.py # 后台任务主程序,如果要定义多个后台任务,请编辑这个文件
-├── main.py # 后端服务主程序(fastapi框架)
-├── tasks.sh # 后台服务启动脚本
-|—— backend.sh # 后端服务启动脚本`
-├── embedding.py # embedding模型服务
-├── pb_api.py # pb数据库接口
-├── general_utils.py # 工具库
-├── get_insight.py # 线索分析与提炼模块
-├── get_logger.py # logger配置
-├── get_report.py # 报告生成模块
-├── get_search.py # 基于sogu的搜索实现
-├── work_process.py # 后台服务主流程(抓取与提炼)
-├── tranlsation_volcengine.py # 基于火山引擎api的翻译模块
-```
diff --git a/client/backend/background_task.py b/client/backend/background_task.py
deleted file mode 100644
index b944fdb..0000000
--- a/client/backend/background_task.py
+++ /dev/null
@@ -1,41 +0,0 @@
-"""
-通过编辑这个脚本,可以自定义需要的后台任务
-"""
-import schedule
-import time
-from get_insight import pb, logger
-from work_process import ServiceProcesser
-
-
-sp = ServiceProcesser()
-counter = 0
-
-
-# 每小时唤醒一次,如果pb的sites表中有信源,会挑取符合周期的信源执行,没有没有的话,则每24小时执行专有爬虫一次
-def task():
- global counter
- sites = pb.read('sites', filter='activated=True')
- urls = []
- for site in sites:
- if not site['per_hours'] or not site['url']:
- continue
- if counter % site['per_hours'] == 0:
- urls.append(site['url'])
- logger.info(f'\033[0;32m task execute loop {counter}\033[0m')
- logger.info(urls)
- if urls:
- sp(sites=urls)
- else:
- if counter % 24 == 0:
- sp()
- else:
- print('\033[0;33mno work for this loop\033[0m')
- counter += 1
-
-
-schedule.every().hour.at(":38").do(task)
-
-task()
-while True:
- schedule.run_pending()
- time.sleep(60)
diff --git a/client/backend/docker_entrypoint.sh b/client/backend/docker_entrypoint.sh
deleted file mode 100644
index c5675ca..0000000
--- a/client/backend/docker_entrypoint.sh
+++ /dev/null
@@ -1,5 +0,0 @@
-#!/bin/bash
-set -o allexport
-set +o allexport
-exec uvicorn main:app --reload --host 0.0.0.0 --port 7777 &
-exec python background_task.py
\ No newline at end of file
diff --git a/client/backend/embeddings.py b/client/backend/embeddings.py
deleted file mode 100644
index 66daa9e..0000000
--- a/client/backend/embeddings.py
+++ /dev/null
@@ -1,17 +0,0 @@
-from BCEmbedding.tools.langchain import BCERerank
-from langchain_community.embeddings import HuggingFaceEmbeddings
-import os
-
-
-embedding_model_name = os.environ.get('EMBEDDING_MODEL_PATH', "")
-rerank_model_name = os.environ.get('RERANKER_MODEL_PATH', "")
-
-if not embedding_model_name or not rerank_model_name:
- raise Exception("请设置 EMBEDDING_MODEL_PATH 和 RERANKER_MODEL_PATH")
-
-device = os.environ.get('DEVICE', 'cpu')
-embedding_model_kwargs = {'device': device}
-embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True, 'show_progress_bar': False}
-reranker_args = {'model': rerank_model_name, 'top_n': 5, 'device': device}
-embed_model = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=embedding_model_kwargs, encode_kwargs=embedding_encode_kwargs)
-reranker = BCERerank(**reranker_args)
diff --git a/client/backend/get_insight.py b/client/backend/get_insight.py
deleted file mode 100644
index 61e7136..0000000
--- a/client/backend/get_insight.py
+++ /dev/null
@@ -1,286 +0,0 @@
-from embeddings import embed_model, reranker
-from langchain_community.vectorstores import FAISS
-from langchain_core.documents import Document
-from langchain_community.vectorstores.utils import DistanceStrategy
-from langchain.retrievers import ContextualCompressionRetriever
-from llms.dashscope_wrapper import dashscope_llm
-from general_utils import isChinesePunctuation, is_chinese
-from tranlsation_volcengine import text_translate
-import time
-import re
-import os
-from general_utils import get_logger_level
-from loguru import logger
-from pb_api import PbTalker
-
-project_dir = os.environ.get("PROJECT_DIR", "")
-os.makedirs(project_dir, exist_ok=True)
-logger_file = os.path.join(project_dir, 'scanning_task.log')
-dsw_log = get_logger_level()
-
-logger.add(
- logger_file,
- level=dsw_log,
- backtrace=True,
- diagnose=True,
- rotation="50 MB"
-)
-pb = PbTalker(logger)
-
-
-max_tokens = 4000
-relation_theshold = 0.525
-
-role_config = pb.read(collection_name='roleplays', filter=f'activated=True')
-_role_config_id = ''
-if role_config:
- character = role_config[0]['character']
- focus = role_config[0]['focus']
- focus_type = role_config[0]['focus_type']
- good_sample1 = role_config[0]['good_sample1']
- good_sample2 = role_config[0]['good_sample2']
- bad_sample = role_config[0]['bad_sample']
- _role_config_id = role_config[0]['id']
-else:
- character, good_sample1, focus, focus_type, good_sample2, bad_sample = '', '', '', '', '', ''
-
-if not character:
- character = input('\033[0;32m 请为首席情报官指定角色设定(eg. 来自中国的网络安全情报专家):\033[0m\n')
- _role_config_id = pb.add(collection_name='roleplays', body={'character': character, 'activated': True})
-
-if not _role_config_id:
- raise Exception('pls check pb data, 无法获取角色设定')
-
-if not (focus and focus_type and good_sample1 and good_sample2 and bad_sample):
- focus = input('\033[0;32m 请为首席情报官指定关注点(eg. 中国关注的网络安全新闻):\033[0m\n')
- focus_type = input('\033[0;32m 请为首席情报官指定关注点类型(eg. 网络安全新闻):\033[0m\n')
- good_sample1 = input('\033[0;32m 请给出一个你期望的情报描述示例(eg. 黑客组织Rhysida声称已入侵中国国有能源公司): \033[0m\n')
- good_sample2 = input('\033[0;32m 请再给出一个理想示例(eg. 差不多一百万份包含未成年人数据(包括家庭地址和照片)的文件对互联网上的任何人都开放,对孩子构成威胁): \033[0m\n')
- bad_sample = input('\033[0;32m 请给出一个你不期望的情报描述示例(eg. 黑客组织活动最近频发): \033[0m\n')
- _ = pb.update(collection_name='roleplays', id=_role_config_id, body={'focus': focus, 'focus_type': focus_type, 'good_sample1': good_sample1, 'good_sample2': good_sample2, 'bad_sample': bad_sample})
-
-# 实践证明,如果强调让llm挖掘我国值得关注的线索,则挖掘效果不好(容易被新闻内容误导,错把别的国家当成我国,可能这时新闻内有我国这样的表述)
-# step by step 如果是内心独白方式,输出格式包含两种,难度增加了,qwen-max不能很好的适应,也许可以改成两步,第一步先输出线索列表,第二步再会去找对应的新闻编号
-# 但从实践来看,这样做的性价比并不高,且会引入新的不确定性。
-_first_stage_prompt = f'''你是一名{character},你将被给到一个新闻列表,新闻文章用XML标签分隔。请对此进行分析,挖掘出特别值得{focus}线索。你给出的线索应该足够具体,而不是同类型新闻的归类描述,好的例子如:
-"""{good_sample1}"""
-不好的例子如:
-"""{bad_sample}"""
-
-请从头到尾仔细阅读每一条新闻的内容,不要遗漏,然后列出值得关注的线索,每条线索都用一句话进行描述,最终按一条一行的格式输出,并整体用三引号包裹,如下所示:
-"""
-{good_sample1}
-{good_sample2}
-"""
-
-不管新闻列表是何种语言,请仅用中文输出分析结果。'''
-
-_rewrite_insight_prompt = f'''你是一名{character},你将被给到一个新闻列表,新闻文章用XML标签分隔。请对此进行分析,从中挖掘出一条最值得关注的{focus_type}线索。你给出的线索应该足够具体,而不是同类型新闻的归类描述,好的例子如:
-"""{good_sample1}"""
-不好的例子如:
-"""{bad_sample}"""
-
-请保证只输出一条最值得关注的线索,线索请用一句话描述,并用三引号包裹输出,如下所示:
-"""{good_sample1}"""
-
-不管新闻列表是何种语言,请仅用中文输出分析结果。'''
-
-
-def _parse_insight(article_text: str, cache: dict) -> (bool, dict):
- input_length = len(cache)
- result = dashscope_llm([{'role': 'system', 'content': _first_stage_prompt}, {'role': 'user', 'content': article_text}],
- 'qwen1.5-72b-chat', logger=logger)
- if result:
- pattern = re.compile(r'\"\"\"(.*?)\"\"\"', re.DOTALL)
- result = pattern.findall(result)
- else:
- logger.warning('1st-stage llm generate failed: no result')
-
- if result:
- try:
- results = result[0].split('\n')
- results = [_.strip() for _ in results if _.strip()]
- to_del = []
- to_add = []
- for element in results:
- if ";" in element:
- to_del.append(element)
- to_add.extend(element.split(';'))
- for element in to_del:
- results.remove(element)
- results.extend(to_add)
- results = list(set(results))
- for text in results:
- logger.debug(f'parse result: {text}')
- # qwen-72b-chat 特例
- # potential_insight = re.sub(r'编号[^:]*:', '', text)
- potential_insight = text.strip()
- if len(potential_insight) < 2:
- logger.debug(f'parse failed: not enough potential_insight: {potential_insight}')
- continue
- if isChinesePunctuation(potential_insight[-1]):
- potential_insight = potential_insight[:-1]
- if potential_insight in cache:
- continue
- else:
- cache[potential_insight] = []
- except Exception as e:
- logger.debug(f'parse failed: {e}')
-
- output_length = len(cache)
- if input_length == output_length:
- return True, cache
- return False, cache
-
-
-def _rewrite_insight(context: str) -> (bool, str):
- result = dashscope_llm([{'role': 'system', 'content': _rewrite_insight_prompt}, {'role': 'user', 'content': context}],
- 'qwen1.5-72b-chat', logger=logger)
- if result:
- pattern = re.compile(r'\"\"\"(.*?)\"\"\"', re.DOTALL)
- result = pattern.findall(result)
- else:
- logger.warning(f'insight rewrite process llm generate failed: no result')
-
- if not result:
- return True, ''
- try:
- results = result[0].split('\n')
- text = results[0].strip()
- logger.debug(f'parse result: {text}')
- if len(text) < 2:
- logger.debug(f'parse failed: not enough potential_insight: {text}')
- return True, ''
- if isChinesePunctuation(text[-1]):
- text = text[:-1]
- except Exception as e:
- logger.debug(f'parse failed: {e}')
- return True, ''
- return False, text
-
-
-def get_insight(articles: dict, titles: dict) -> list:
- context = ''
- cache = {}
- for value in articles.values():
- if value['abstract']:
- text = value['abstract']
- else:
- if value['title']:
- text = value['title']
- else:
- if value['content']:
- text = value['content']
- else:
- continue
- # 这里不使用long context是因为阿里灵积经常检查出输入敏感词,但又不给敏感词反馈,对应批次只能放弃,用long context风险太大
- # 另外long context中间部分llm可能会遗漏
- context += f"{text}\n"
- if len(context) < max_tokens:
- continue
-
- flag, cache = _parse_insight(context, cache)
- if flag:
- logger.warning(f'following articles may not be completely analyzed: \n{context}')
-
- context = ''
- # 据说频繁调用会引发性能下降,每次调用后休息1s。现在轮替调用qwen-72b和max,所以不必了。
- time.sleep(1)
- if context:
- flag, cache = _parse_insight(context, cache)
- if flag:
- logger.warning(f'following articles may not be completely analyzed: \n{context}')
-
- if not cache:
- logger.warning('no insights found')
- return []
-
- # second stage: 匹配insights和article_titles
- title_list = [Document(page_content=key, metadata={}) for key, value in titles.items()]
- retriever = FAISS.from_documents(title_list, embed_model,
- distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT).as_retriever(search_type="similarity",
- search_kwargs={"score_threshold": relation_theshold, "k": 10})
- compression = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
-
- for key in cache.keys():
- logger.debug(f'searching related articles for insight: {key}')
- rerank_results = compression.get_relevant_documents(key)
- for i in range(len(rerank_results)):
- if rerank_results[i].metadata['relevance_score'] < relation_theshold:
- break
- cache[key].append(titles[rerank_results[i].page_content])
- if titles[rerank_results[i].page_content] not in articles:
- articles[titles[rerank_results[i].page_content]] = {'title': rerank_results[i].page_content}
- logger.info(f'{key} - {cache[key]}')
-
- # third stage:对于对应文章重叠率超过25%的合并,然后对于有多个文章的,再次使用llm生成insight
- # 因为实践中发现,第一次insight召回的文章标题可能都很相关,但是汇总起来却指向另一个角度的insight
- def calculate_overlap(list1, list2):
- # 计算两个列表的交集长度
- intersection_length = len(set(list1).intersection(set(list2)))
- # 计算重合率
- overlap_rate = intersection_length / min(len(list1), len(list2))
- return overlap_rate >= 0.75
-
- merged_dict = {}
- for key, value in cache.items():
- if not value:
- continue
- merged = False
- for existing_key, existing_value in merged_dict.items():
- if calculate_overlap(value, existing_value):
- merged_dict[existing_key].extend(value)
- merged = True
- break
- if not merged:
- merged_dict[key] = value
-
- cache = {}
- for key, value in merged_dict.items():
- value = list(set(value))
- if len(value) > 1:
- context = ''
- for _id in value:
- context += f"{articles[_id]['title']}\n"
- if len(context) >= max_tokens:
- break
- if not context:
- continue
-
- flag, new_insight = _rewrite_insight(context)
- if flag:
- logger.warning(f'insight {key} may contain wrong')
- cache[key] = value
- else:
- if cache:
- title_list = [Document(page_content=key, metadata={}) for key, value in cache.items()]
- retriever = FAISS.from_documents(title_list, embed_model,
- distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT).as_retriever(
- search_type="similarity",
- search_kwargs={"score_threshold": 0.85, "k": 1})
- compression = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
- rerank_results = compression.get_relevant_documents(new_insight)
- if rerank_results and rerank_results[0].metadata['relevance_score'] > 0.85:
- logger.debug(f"{new_insight} is too similar to {rerank_results[0].page_content}, merging")
- cache[rerank_results[0].page_content].extend(value)
- cache[rerank_results[0].page_content] = list(set(cache[rerank_results[0].page_content]))
- else:
- cache[new_insight] = value
- else:
- cache[new_insight] = value
- else:
- cache[key] = value
-
- # 排序,对应articles越多的越靠前
- # sorted_cache = sorted(cache.items(), key=lambda x: len(x[1]), reverse=True)
- logger.info('re-ranking ressult:')
- new_cache = []
- for key, value in cache.items():
- if not is_chinese(key):
- translate_text = text_translate([key], target_language='zh', logger=logger)
- if translate_text:
- key = translate_text[0]
- logger.info(f'{key} - {value}')
- new_cache.append({'content': key, 'articles': value})
-
- return new_cache
diff --git a/client/backend/llms/dashscope_wrapper.py b/client/backend/llms/dashscope_wrapper.py
deleted file mode 100644
index 61ccd81..0000000
--- a/client/backend/llms/dashscope_wrapper.py
+++ /dev/null
@@ -1,93 +0,0 @@
-# 使用aliyun dashscope的api封装
-# 非流式接口
-# 为了兼容性,输入输出都使用message格式(与openai SDK格式一致)
-import time
-from http import HTTPStatus
-import dashscope
-import random
-import os
-
-
-DASHSCOPE_KEY = os.getenv("LLM_API_KEY")
-if not DASHSCOPE_KEY:
- raise ValueError("请指定LLM_API_KEY的环境变量")
-dashscope.api_key = DASHSCOPE_KEY
-
-
-def dashscope_llm(messages: list, model: str, logger=None, **kwargs) -> str:
-
- if logger:
- logger.debug(f'messages:\n {messages}')
- logger.debug(f'model: {model}')
- logger.debug(f'kwargs:\n {kwargs}')
-
- response = dashscope.Generation.call(
- messages=messages,
- model=model,
- result_format='message', # set the result to be "message" format.
- **kwargs
- )
-
- for i in range(2):
- if response.status_code == HTTPStatus.OK:
- break
- if response.message == "Input data may contain inappropriate content.":
- break
-
- if logger:
- logger.warning(f"request failed. code: {response.code}, message:{response.message}\nretrying...")
- else:
- print(f"request failed. code: {response.code}, message:{response.message}\nretrying...")
-
- time.sleep(1 + i*30)
- kwargs['seed'] = random.randint(1, 10000)
- response = dashscope.Generation.call(
- messages=messages,
- model=model,
- result_format='message', # set the result to be "message" format.
- **kwargs
- )
-
- if response.status_code != HTTPStatus.OK:
- if logger:
- logger.warning(f"request failed. code: {response.code}, message:{response.message}\nabort after multiple retries...")
- else:
- print(f"request failed. code: {response.code}, message:{response.message}\naborted after multiple retries...")
- return ''
-
- if logger:
- logger.debug(f'result:\n {response.output.choices[0]}')
- logger.debug(f'usage:\n {response.usage}')
-
- return response.output.choices[0]['message']['content']
-
-
-if __name__ == '__main__':
- from pprint import pprint
-
- # logging.basicConfig(level=logging.DEBUG)
- system_content = ''
- user_content = '''你是一名优秀的翻译,请帮我把如下新闻标题逐条(一行为一条)翻译为中文,你的输出也必须为一条一行的格式。
-
-The New York Times reported on 2021-01-01 that the COVID-19 cases in China are increasing.
-Cyber ops linked to Israel-Hamas conflict largely improvised, researchers say
-Russian hackers disrupted Ukrainian electrical grid last year
-Reform bill would overhaul controversial surveillance law
-GitHub disables pro-Russian hacktivist DDoS pages
-Notorious Russian hacking group appears to resurface with fresh cyberattacks on Ukraine
-Russian hackers attempted to breach petroleum refining company in NATO country, researchers say
-As agencies move towards multi-cloud networks, proactive security is key
-Keeping a competitive edge in the cybersecurity ‘game’
-Mud, sweat and data: The hard work of democratizing data at scale
-SEC sues SolarWinds and CISO for fraud
-Cyber workforce demand is outpacing supply, survey finds
-Four dozen countries declare they won
-White House executive order on AI seeks to address security risks
-malware resembling NSA code
-CISA budget cuts would be
-Hackers that breached Las Vegas casinos rely on violent threats, research shows'''
-
- data = [{'role': 'user', 'content': user_content}]
- start_time = time.time()
- pprint(dashscope_llm(data, 'qwen-72b-chat'))
- print(f'time cost: {time.time() - start_time}')
diff --git a/client/backend/llms/lmdeploy_wrapper.py b/client/backend/llms/lmdeploy_wrapper.py
deleted file mode 100644
index 38b537b..0000000
--- a/client/backend/llms/lmdeploy_wrapper.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# 使用lmdepoly_wrapper的api封装
-# 非流式接口
-# 为了兼容性,输入输出都使用message格式(与openai SDK格式一致)
-from lagent.llms.meta_template import INTERNLM2_META as META
-from lagent.llms.lmdepoly_wrapper import LMDeployClient
-from requests import ConnectionError
-import os
-
-
-def lmdeploy_llm(messages: list[dict],
- model: str = "qwen-7b",
- seed: int = 1234,
- max_tokens: int = 2000,
- temperature: float = 1,
- stop: list = None,
- enable_search: bool = False,
- logger=None) -> str:
-
- if logger:
- logger.debug(f'messages:\n {messages}')
- logger.debug(f'params:\n model: {model}, max_tokens: {max_tokens}, temperature: {temperature}, stop: {stop},'
- f'enable_search: {enable_search}, seed: {seed}')
-
- top_p = 0.7
- url = os.environ.get('LLM_API_BASE', "http://127.0.0.1:6003")
- api_client = LMDeployClient(model_name=model,
- url=url,
- meta_template=META,
- max_new_tokens=max_tokens,
- top_p=top_p,
- top_k=100,
- temperature=temperature,
- repetition_penalty=1.0,
- stop_words=['<|im_end|>'])
- response = ""
- for i in range(3):
- try:
- response = api_client.chat(messages)
- break
- except ConnectionError:
- if logger:
- logger.warning(f'ConnectionError, url:{url}')
- else:
- print(f"ConnectionError, url:{url}")
- return ""
-
- return response
-
-
-if __name__ == '__main__':
- import time
- from pprint import pprint
-
- # logging.basicConfig(level=logging.DEBUG)
- system_content = ''
- user_content = '''你是一名优秀的翻译,请帮我把如下新闻标题逐条(一行为一条)翻译为中文,你的输出也必须为一条一行的格式。
-
-The New York Times reported on 2021-01-01 that the COVID-19 cases in China are increasing.
-Cyber ops linked to Israel-Hamas conflict largely improvised, researchers say
-Russian hackers disrupted Ukrainian electrical grid last year
-Reform bill would overhaul controversial surveillance law
-GitHub disables pro-Russian hacktivist DDoS pages
-Notorious Russian hacking group appears to resurface with fresh cyberattacks on Ukraine
-Russian hackers attempted to breach petroleum refining company in NATO country, researchers say
-As agencies move towards multi-cloud networks, proactive security is key
-Keeping a competitive edge in the cybersecurity ‘game’
-Mud, sweat and data: The hard work of democratizing data at scale
-SEC sues SolarWinds and CISO for fraud
-Cyber workforce demand is outpacing supply, survey finds
-Four dozen countries declare they won
-White House executive order on AI seeks to address security risks
-malware resembling NSA code
-CISA budget cuts would be
-Hackers that breached Las Vegas casinos rely on violent threats, research shows'''
-
- data = [{'role': 'user', 'content': user_content}]
- start_time = time.time()
- pprint(lmdeploy_llm(data, 'qwen-7b'))
- print(f'time cost: {time.time() - start_time}')
diff --git a/client/backend/llms/qwen1.5-7b-deploy.sh b/client/backend/llms/qwen1.5-7b-deploy.sh
deleted file mode 100644
index 22c762e..0000000
--- a/client/backend/llms/qwen1.5-7b-deploy.sh
+++ /dev/null
@@ -1,15 +0,0 @@
-#!/bin/sh
-
-docker run -d --runtime nvidia --gpus all \
- -v ~/.cache/huggingface:/root/.cache/huggingface \
- --env "HUGGING_FACE_HUB_TOKEN=" \
- --env "LMDEPLOY_USE_MODELSCOPE=True" \
- --env "TOKENIZERS_PARALLELISM=False" \
- --name qwen1.5-7b-service \
- -p 6003:6003 \
- --restart=always \
- --ipc=host \
- openmmlab/lmdeploy:v0.2.5 \
- pip install modelscope & \
- lmdeploy serve api_server Qwen/Qwen1.5-7B-Chat \
- --server-name 0.0.0.0 --server-port 6003 --tp 1 --rope-scaling-factor 1 --backend pytorch
\ No newline at end of file
diff --git a/client/backend/requirements.txt b/client/backend/requirements.txt
deleted file mode 100644
index f6ad15f..0000000
--- a/client/backend/requirements.txt
+++ /dev/null
@@ -1,19 +0,0 @@
-fastapi
-pydantic
-uvicorn
-dashscope #optional(使用阿里灵积时安装)
-openai #optional(使用兼容openai sdk的llm服务时安装)
-volcengine #optional(使用火山翻译时安装)
-python-docx
-BCEmbedding==0.1.3
-langchain==0.1.0
-langchain-community==0.0.9
-langchain-core==0.1.7
-langsmith==0.0.77
-# faiss-gpu for gpu environment
-faiss-cpu # for cpu-only environment
-pocketbase==0.10.0
-gne
-chardet
-schedule
-loguru
\ No newline at end of file
diff --git a/client/backend/scrapers/__init__.py b/client/backend/scrapers/__init__.py
deleted file mode 100644
index 4d17519..0000000
--- a/client/backend/scrapers/__init__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-
-
-scraper_map = {
-}
diff --git a/client/backend/tasks.sh b/client/backend/tasks.sh
deleted file mode 100755
index f3ea41d..0000000
--- a/client/backend/tasks.sh
+++ /dev/null
@@ -1,4 +0,0 @@
-set -o allexport
-source ../.env
-set +o allexport
-python background_task.py
\ No newline at end of file
diff --git a/client/backend/work_process.py b/client/backend/work_process.py
deleted file mode 100644
index 6b8f0eb..0000000
--- a/client/backend/work_process.py
+++ /dev/null
@@ -1,158 +0,0 @@
-import os
-import json
-import requests
-from datetime import datetime, timedelta, date
-from scrapers import scraper_map
-from scrapers.general_scraper import general_scraper
-from urllib.parse import urlparse
-from get_insight import get_insight, pb, logger
-from general_utils import is_chinese
-from tranlsation_volcengine import text_translate
-import concurrent.futures
-
-
-# 一般用于第一次爬虫时,避免抓取过多太久的文章,同时超过这个天数的数据库文章也不会再用来匹配insight
-expiration_date = datetime.now() - timedelta(days=90)
-expiration_date = expiration_date.date()
-expiration_str = expiration_date.strftime("%Y%m%d")
-
-
-class ServiceProcesser:
- def __init__(self, record_snapshot: bool = False):
- self.project_dir = os.environ.get("PROJECT_DIR", "")
- # 1. base initialization
- self.cache_url = os.path.join(self.project_dir, 'scanning_task')
- os.makedirs(self.cache_url, exist_ok=True)
- # 2. load the llm
- # self.llm = LocalLlmWrapper() # if you use the local-llm
-
- if record_snapshot:
- snap_short_server = os.environ.get('SNAPSHOTS_SERVER', '')
- if not snap_short_server:
- raise Exception('SNAPSHOTS_SERVER is not set.')
- self.snap_short_server = f"http://{snap_short_server}"
- else:
- self.snap_short_server = None
-
- logger.info('scanning task init success.')
-
- def __call__(self, expiration: date = expiration_date, sites: list[str] = None):
- # 先清空一下cache
- logger.info(f'wake, prepare to work, now is {datetime.now()}')
- cache = {}
- logger.debug(f'clear cache -- {cache}')
- # 从pb数据库中读取所有文章url
- # 这里publish_time用int格式,综合考虑下这个是最容易操作的模式,虽然糙了点
- existing_articles = pb.read(collection_name='articles', fields=['id', 'title', 'url'], filter=f'publish_time>{expiration_str}')
- all_title = {}
- existings = []
- for article in existing_articles:
- all_title[article['title']] = article['id']
- existings.append(article['url'])
-
- # 定义扫描源列表,如果不指定就默认遍历scraper_map, 另外这里还要考虑指定的source不在scraper_map的情况,这时应该使用通用爬虫
- sources = sites if sites else list(scraper_map.keys())
- new_articles = []
- with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
- futures = []
- for site in sources:
- if site in scraper_map:
- futures.append(executor.submit(scraper_map[site], expiration, existings))
- else:
- futures.append(executor.submit(general_scraper, site, expiration, existings, logger))
- concurrent.futures.wait(futures)
- for future in futures:
- try:
- new_articles.extend(future.result())
- except Exception as e:
- logger.error(f'error when scraping-- {e}')
-
- for value in new_articles:
- if not value:
- continue
- from_site = urlparse(value['url']).netloc
- from_site = from_site.replace('www.', '')
- from_site = from_site.split('.')[0]
- if value['abstract']:
- value['abstract'] = f"({from_site} 报道){value['abstract']}"
- value['content'] = f"({from_site} 报道){value['content']}"
- value['images'] = json.dumps(value['images'])
-
- article_id = pb.add(collection_name='articles', body=value)
-
- if article_id:
- cache[article_id] = value
- all_title[value['title']] = article_id
- else:
- logger.warning(f'add article {value} failed, writing to cache_file')
- with open(os.path.join(self.cache_url, 'cache_articles.json'), 'a', encoding='utf-8') as f:
- json.dump(value, f, ensure_ascii=False, indent=4)
-
- if not cache:
- logger.warning(f'no new articles. now is {datetime.now()}')
- return
-
- # insight 流程
- new_insights = get_insight(cache, all_title)
- if new_insights:
- for insight in new_insights:
- if not insight['content']:
- continue
- insight_id = pb.add(collection_name='insights', body=insight)
- if not insight_id:
- logger.warning(f'write insight {insight} to pb failed, writing to cache_file')
- with open(os.path.join(self.cache_url, 'cache_insights.json'), 'a', encoding='utf-8') as f:
- json.dump(insight, f, ensure_ascii=False, indent=4)
- for article_id in insight['articles']:
- raw_article = pb.read(collection_name='articles', fields=['abstract', 'title', 'translation_result'], filter=f'id="{article_id}"')
- if not raw_article or not raw_article[0]:
- logger.warning(f'get article {article_id} failed, skipping')
- continue
- if raw_article[0]['translation_result']:
- continue
- if is_chinese(raw_article[0]['title']):
- continue
- translate_text = text_translate([raw_article[0]['title'], raw_article[0]['abstract']], target_language='zh', logger=logger)
- if translate_text:
- related_id = pb.add(collection_name='article_translation', body={'title': translate_text[0], 'abstract': translate_text[1], 'raw': article_id})
- if not related_id:
- logger.warning(f'write article_translation {article_id} failed')
- else:
- _ = pb.update(collection_name='articles', id=article_id, body={'translation_result': related_id})
- if not _:
- logger.warning(f'update article {article_id} failed')
- else:
- logger.warning(f'translate article {article_id} failed')
- else:
- # 尝试把所有文章的title作为insigts,这是备选方案
- if len(cache) < 25:
- logger.info('generate_insights-warning: no insights and no more than 25 articles so use article title as insights')
- for key, value in cache.items():
- if value['title']:
- if is_chinese(value['title']):
- text_for_insight = value['title']
- else:
- text_for_insight = text_translate([value['title']], logger=logger)
- if text_for_insight:
- insight_id = pb.add(collection_name='insights', body={'content': text_for_insight[0], 'articles': [key]})
- if not insight_id:
- logger.warning(f'write insight {text_for_insight[0]} to pb failed, writing to cache_file')
- with open(os.path.join(self.cache_url, 'cache_insights.json'), 'a',
- encoding='utf-8') as f:
- json.dump({'content': text_for_insight[0], 'articles': [key]}, f, ensure_ascii=False, indent=4)
- else:
- logger.warning('generate_insights-error: can not generate insights, pls re-try')
- logger.info(f'work done, now is {datetime.now()}')
-
- if self.snap_short_server:
- logger.info(f'now starting article snapshot with {self.snap_short_server}')
- for key, value in cache.items():
- if value['url']:
- try:
- snapshot = requests.get(f"{self.snap_short_server}/zip", {'url': value['url']}, timeout=60)
- file = open(snapshot.text, 'rb')
- _ = pb.upload('articles', key, 'snapshot', key, file)
- file.close()
- except Exception as e:
- logger.warning(f'error when snapshot {value["url"]}, {e}')
- logger.info(f'now snapshot done, now is {datetime.now()}')
diff --git a/client/compose.yaml b/client/compose.yaml
deleted file mode 100755
index 0b95674..0000000
--- a/client/compose.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-services:
- web:
- build:
- dockerfile: Dockerfile.web
- image: wiseflow/web
- ports:
- - 8090:8090
- # env_file:
- # - .env
- volumes:
- - ./pb/pb_data:/pb/pb_data
- # - ./${PROJECT_DIR}:/pb/${PROJECT_DIR}
- entrypoint: /pb/pocketbase serve --http=0.0.0.0:8090
-
- api:
- build:
- dockerfile: Dockerfile.api
- image: wiseflow/api
- tty: true
- stdin_open: true
- entrypoint: bash docker_entrypoint.sh
- env_file:
- - .env
- ports:
- - 7777:7777
- volumes:
- - ./${PROJECT_DIR}:/app/${PROJECT_DIR}
- - ${EMBEDDING_MODEL_PATH}:${EMBEDDING_MODEL_PATH}
- - ${RERANKER_MODEL_PATH}:${RERANKER_MODEL_PATH}
- depends_on:
- - web
\ No newline at end of file
diff --git a/client/env_sample b/client/env_sample
deleted file mode 100755
index 8c937d9..0000000
--- a/client/env_sample
+++ /dev/null
@@ -1,14 +0,0 @@
-export LLM_API_KEY=""
-export LLM_API_BASE="" ##使用本地模型服务或者使用openai_wrapper调用非openai服务时用
-export VOLC_KEY="AK|SK"
-
-#**for embeddig model**
-export EMBEDDING_MODEL_PATH="" ##填写完整的绝对路径
-export RERANKER_MODEL_PATH="" ##填写完整的绝对路径
-export DEVICE="cpu" ##cuda用户填写 "cuda:0"
-
-#**for processer**
-export PROJECT_DIR="work_dir"
-export PB_API_AUTH="test@example.com|123467890"
-export PB_API_BASE="web:8090" ##可以参考https://stackoverflow.com/questions/70151702/how-to-network-2-separate-docker-containers-to-communicate-with-eachother
-export WS_LOG="verbose" ##如果需要详细的log,观察系统的每一步动作填写此项,正常使用无需
\ No newline at end of file
diff --git a/client/version b/client/version
deleted file mode 100644
index 22c08f7..0000000
--- a/client/version
+++ /dev/null
@@ -1 +0,0 @@
-v0.2.1
diff --git a/client/README.md b/core/README.md
similarity index 100%
rename from client/README.md
rename to core/README.md
diff --git a/core/dm.py b/core/dm.py
new file mode 100644
index 0000000..fa8bf1b
--- /dev/null
+++ b/core/dm.py
@@ -0,0 +1,47 @@
+import asyncio
+import websockets
+import concurrent.futures
+import json
+from insights import pipeline
+
+
+async def get_public_msg():
+ uri = "ws://127.0.0.1:8066/ws/publicMsg"
+ reconnect_attempts = 0
+ max_reconnect_attempts = 3 # 可以根据需要设置最大重连次数
+
+ while True:
+ try:
+ async with websockets.connect(uri, max_size=10 * 1024 * 1024) as websocket:
+ loop = asyncio.get_running_loop()
+ with concurrent.futures.ThreadPoolExecutor() as pool:
+ while True:
+ response = await websocket.recv()
+ datas = json.loads(response)
+ for data in datas["data"]:
+ if data["IsSender"] != "0":
+ print('self-send message, pass')
+ print(data)
+ continue
+ input_data = {
+ "user_id": data["StrTalker"],
+ "type": "publicMsg",
+ "content": data["Content"],
+ "addition": data["MsgSvrID"]
+ }
+ await loop.run_in_executor(pool, pipeline, input_data)
+ except websockets.exceptions.ConnectionClosedError as e:
+ print(f"Connection closed with exception: {e}")
+ reconnect_attempts += 1
+ if reconnect_attempts <= max_reconnect_attempts:
+ print(f"Reconnecting attempt {reconnect_attempts}...")
+ await asyncio.sleep(5) # 等待一段时间后重试
+ else:
+ print("Max reconnect attempts reached. Exiting.")
+ break
+ except Exception as e:
+ print(f"An unexpected error occurred: {e}")
+ break
+
+# 使用asyncio事件循环运行get_public_msg coroutine
+asyncio.run(get_public_msg())
diff --git a/core/insights/__init__.py b/core/insights/__init__.py
new file mode 100644
index 0000000..3bc0c78
--- /dev/null
+++ b/core/insights/__init__.py
@@ -0,0 +1,164 @@
+from scrapers import *
+from utils.general_utils import extract_urls, compare_phrase_with_list
+from insights.get_info import get_info, pb, project_dir, logger
+from insights.rewrite import info_rewrite
+import os
+import json
+from datetime import datetime, timedelta
+from urllib.parse import urlparse
+import re
+import time
+
+
+# 用正则不用xml解析方案是因为公众号消息提取出来的xml代码存在异常字符
+item_pattern = re.compile(r'- (.*?)
', re.DOTALL)
+url_pattern = re.compile(r'')
+summary_pattern = re.compile(r'', re.DOTALL)
+
+expiration_days = 3
+existing_urls = [url['url'] for url in pb.read(collection_name='articles', fields=['url']) if url['url']]
+
+
+def pipeline(_input: dict):
+ cache = {}
+ source = _input['user_id'].split('@')[-1]
+ logger.debug(f"received new task, user: {source}, MsgSvrID: {_input['addition']}")
+
+ if _input['type'] == 'publicMsg':
+ items = item_pattern.findall(_input["content"])
+ # 遍历所有- 内容,提取和
+ for item in items:
+ url_match = url_pattern.search(item)
+ url = url_match.group(1) if url_match else None
+ if not url:
+ logger.warning(f"can not find url in \n{item}")
+ continue
+ # url处理,http换成https, 去掉chksm之后的部分
+ url = url.replace('http://', 'https://')
+ cut_off_point = url.find('chksm=')
+ if cut_off_point != -1:
+ url = url[:cut_off_point-1]
+ if url in cache:
+ logger.debug(f"{url} already find in item")
+ continue
+ summary_match = summary_pattern.search(item)
+ summary = summary_match.group(1) if summary_match else None
+ cache[url] = summary
+ urls = list(cache.keys())
+
+ elif _input['type'] == 'text':
+ urls = extract_urls(_input['content'])
+ if not urls:
+ logger.debug(f"can not find any url in\n{_input['content']}\npass...")
+ return
+ elif _input['type'] == 'url':
+ urls = []
+ pass
+ else:
+ return
+
+ global existing_urls
+
+ for url in urls:
+ # 0、先检查是否已经爬取过
+ if url in existing_urls:
+ logger.debug(f"{url} has been crawled, skip")
+ continue
+
+ logger.debug(f"fetching {url}")
+ # 1、选择合适的爬虫fetch article信息
+ if url.startswith('https://mp.weixin.qq.com') or url.startswith('http://mp.weixin.qq.com'):
+ flag, article = mp_crawler(url, logger)
+ if flag == -7:
+ # 对于mp爬虫,-7 的大概率是被微信限制了,等待1min即可
+ logger.info(f"fetch {url} failed, try to wait 1min and try again")
+ time.sleep(60)
+ flag, article = mp_crawler(url, logger)
+ else:
+ parsed_url = urlparse(url)
+ domain = parsed_url.netloc
+ if domain in scraper_map:
+ flag, article = scraper_map[domain](url, logger)
+ else:
+ flag, article = simple_crawler(url, logger)
+
+ if flag == -7:
+ # -7 代表网络不同,用其他爬虫也没有效果
+ logger.info(f"cannot fetch {url}")
+ continue
+
+ if flag != 11:
+ logger.info(f"{url} failed with mp_crawler and simple_crawler")
+ flag, article = llm_crawler(url, logger)
+ if flag != 11:
+ logger.info(f"{url} failed with llm_crawler")
+ continue
+
+ # 2、判断是否早于 当日- expiration_days ,如果是的话,舍弃
+ expiration_date = datetime.now() - timedelta(days=expiration_days)
+ expiration_date = expiration_date.strftime('%Y-%m-%d')
+ article_date = int(article['publish_time'])
+ if article_date < int(expiration_date.replace('-', '')):
+ logger.info(f"publish date is {article_date}, too old, skip")
+ continue
+
+ article['source'] = source
+ if cache[url]:
+ article['abstract'] = cache[url]
+
+ # 3、使用content从中提炼信息
+ insights = get_info(f"标题:{article['title']}\n\n内容:{article['content']}")
+ # 提炼info失败的article不入库,不然在existing里面后面就再也不会处理了,但提炼成功没有insight的article需要入库,后面不再分析。
+
+ # 4、article入库
+ try:
+ article_id = pb.add(collection_name='articles', body=article)
+ except Exception as e:
+ logger.error(f'add article failed, writing to cache_file - {e}')
+ with open(os.path.join(project_dir, 'cache_articles.json'), 'a', encoding='utf-8') as f:
+ json.dump(article, f, ensure_ascii=False, indent=4)
+ continue
+
+ existing_urls.append(url)
+
+ if not insights:
+ continue
+ # insight 比对去重与合并, article打标签,insight入库
+ article_tags = set()
+ # 从数据库中读取过去expiration_days的insight记录,避免重复
+ old_insights = pb.read(collection_name='insights', filter=f"updated>'{expiration_date}'", fields=['id', 'tag', 'content', 'articles'])
+ for insight in insights:
+ article_tags.add(insight['tag'])
+ insight['articles'] = [article_id]
+ # 从old_insights 中挑出相同tag的insight,组成 content: id 的反查字典
+ old_insight_dict = {i['content']: i for i in old_insights if i['tag'] == insight['tag']}
+ # 因为要比较的是抽取出来的信息短语是否讲的是一个事情,用向量模型计算相似度未必适合且过重
+ # 因此这里使用一个简化的方案,直接使用jieba分词器,计算两个短语之间重叠的词语是否超过90%
+ similar_insights = compare_phrase_with_list(insight['content'], list(old_insight_dict.keys()), 0.65)
+ if similar_insights:
+ to_rewrite = similar_insights + [insight['content']]
+ new_info_content = info_rewrite(to_rewrite, logger)
+ if not new_info_content:
+ continue
+ insight['content'] = new_info_content
+ # 合并关联article、删除旧insight
+ for old_insight in similar_insights:
+ insight['articles'].extend(old_insight_dict[old_insight]['articles'])
+ pb.delete(collection_name='insights', id=old_insight_dict[old_insight]['id'])
+ old_insights.remove(old_insight_dict[old_insight])
+
+ try:
+ insight['id'] = pb.add(collection_name='insights', body=insight)
+ # old_insights.append(insight)
+ except Exception as e:
+ logger.error(f'add insight failed, writing to cache_file - {e}')
+ with open(os.path.join(project_dir, 'cache_insights.json'), 'a', encoding='utf-8') as f:
+ json.dump(insight, f, ensure_ascii=False, indent=4)
+
+ try:
+ pb.update(collection_name='articles', id=article_id, body={'tag': list(article_tags)})
+ except Exception as e:
+ logger.error(f'update article failed - article_id: {article_id}\n{e}')
+ article['tag'] = list(article_tags)
+ with open(os.path.join(project_dir, 'cache_articles.json'), 'a', encoding='utf-8') as f:
+ json.dump(article, f, ensure_ascii=False, indent=4)
diff --git a/core/insights/get_info.py b/core/insights/get_info.py
new file mode 100644
index 0000000..f74a64c
--- /dev/null
+++ b/core/insights/get_info.py
@@ -0,0 +1,94 @@
+# from llms.dashscope_wrapper import dashscope_llm
+from llms.openai_wrapper import openai_llm
+# from llms.siliconflow_wrapper import sfa_llm
+import re
+from utils.general_utils import get_logger_level
+from loguru import logger
+from utils.pb_api import PbTalker
+import os
+
+
+project_dir = os.environ.get("PROJECT_DIR", "")
+if project_dir:
+ os.makedirs(project_dir, exist_ok=True)
+logger_file = os.path.join(project_dir, 'insights.log')
+dsw_log = get_logger_level()
+logger.add(
+ logger_file,
+ level=dsw_log,
+ backtrace=True,
+ diagnose=True,
+ rotation="50 MB"
+)
+
+pb = PbTalker(logger)
+
+model = "glm-4-flash"
+focus_data = pb.read(collection_name='tags', filter=f'activated=True')
+focus_list = [item["name"] for item in focus_data if item["name"]]
+focus_dict = {item["name"]: item["id"] for item in focus_data if item["name"]}
+
+system_prompt = f'''请仔细阅读用户输入的新闻内容,并根据所提供的类型列表进行分析。类型列表如下:
+{focus_list}
+
+如果新闻中包含上述任何类型的信息,请使用以下格式标记信息的类型,并提供仅包含时间、地点、人物和事件的一句话信息摘要:
+类型名称仅包含时间、地点、人物和事件的一句话信息摘要
+
+如果新闻中包含多个信息,请逐一分析并按一条一行的格式输出,如果新闻不涉及任何类型的信息,则直接输出:无。
+务必注意:1、严格忠于新闻原文,不得提供原文中不包含的信息;2、对于同一事件,仅选择一个最贴合的tag,不要重复输出;3、仅用一句话做信息摘要,且仅包含时间、地点、人物和事件;4、严格遵循给定的格式输出。'''
+
+# pattern = re.compile(r'\"\"\"(.*?)\"\"\"', re.DOTALL)
+
+
+def get_info(article_content: str) -> list[dict]:
+ # logger.debug(f'receive new article_content:\n{article_content}')
+ result = openai_llm([{'role': 'system', 'content': system_prompt}, {'role': 'user', 'content': article_content}],
+ model=model, logger=logger)
+
+ # results = pattern.findall(result)
+ texts = result.split('')
+ texts = [_.strip() for _ in texts if '' in _.strip()]
+ if not texts:
+ logger.info(f'can not find info, llm result:\n{result}')
+ return []
+
+ cache = []
+ for text in texts:
+ try:
+ strings = text.split('')
+ tag = strings[0]
+ tag = tag.strip()
+ if tag not in focus_list:
+ logger.info(f'tag not in focus_list: {tag}, aborting')
+ continue
+ info = ''.join(strings[1:])
+ info = info.strip()
+ except Exception as e:
+ logger.info(f'parse error: {e}')
+ tag = ''
+ info = ''
+
+ if not info or not tag:
+ logger.info(f'parse failed-{text}')
+ continue
+
+ if len(info) < 7:
+ logger.info(f'info too short, possible invalid: {info}')
+ continue
+
+ if info.startswith('无相关信息') or info.startswith('该新闻未提及') or info.startswith('未提及'):
+ logger.info(f'no relevant info: {text}')
+ continue
+
+ while info.endswith('"'):
+ info = info[:-1]
+ info = info.strip()
+
+ # 拼接下来源信息
+ sources = re.findall(r'内容:\((.*?) 文章\)', article_content)
+ if sources and sources[0]:
+ info = f"【{sources[0]} 公众号】 {info}"
+
+ cache.append({'content': info, 'tag': focus_dict[tag]})
+
+ return cache
diff --git a/core/insights/rewrite.py b/core/insights/rewrite.py
new file mode 100644
index 0000000..00ed9d9
--- /dev/null
+++ b/core/insights/rewrite.py
@@ -0,0 +1,24 @@
+# from llms.openai_wrapper import openai_llm
+from llms.dashscope_wrapper import dashscope_llm
+# from llms.siliconflow_wrapper import sfa_llm
+
+
+rewrite_prompt = '''请综合给到的内容,提炼总结为一个新闻摘要。
+给到的内容会用XML标签分隔。
+请仅输出总结出的摘要,不要输出其他的信息。'''
+
+model = "qwen2-7b-instruct"
+
+
+def info_rewrite(contents: list[str], logger=None) -> str:
+ context = f"{''.join(contents)}"
+ try:
+ result = dashscope_llm([{'role': 'system', 'content': rewrite_prompt}, {'role': 'user', 'content': context}],
+ model=model, temperature=0.1, logger=logger)
+ return result.strip()
+ except Exception as e:
+ if logger:
+ logger.warning(f'rewrite process llm generate failed: {e}')
+ else:
+ print(f'rewrite process llm generate failed: {e}')
+ return ''
diff --git a/client/backend/llms/README.md b/core/llms/README.md
similarity index 100%
rename from client/backend/llms/README.md
rename to core/llms/README.md
diff --git a/client/backend/llms/__init__.py b/core/llms/__init__.py
similarity index 100%
rename from client/backend/llms/__init__.py
rename to core/llms/__init__.py
diff --git a/client/backend/llms/openai_wrapper.py b/core/llms/openai_wrapper.py
similarity index 95%
rename from client/backend/llms/openai_wrapper.py
rename to core/llms/openai_wrapper.py
index d924b4b..5fe46f0 100644
--- a/client/backend/llms/openai_wrapper.py
+++ b/core/llms/openai_wrapper.py
@@ -7,13 +7,13 @@ import os
from openai import OpenAI
-token = os.environ.get('LLM_API_KEY', "")
-if not token:
- raise ValueError('请设置环境变量LLM_API_KEY')
-
base_url = os.environ.get('LLM_API_BASE', "")
+token = os.environ.get('LLM_API_KEY', "")
-client = OpenAI(api_key=token, base_url=base_url)
+if token:
+ client = OpenAI(api_key=token, base_url=base_url)
+else:
+ client = OpenAI(base_url=base_url)
def openai_llm(messages: list, model: str, logger=None, **kwargs) -> str:
diff --git a/core/llms/siliconflow_wrapper.py b/core/llms/siliconflow_wrapper.py
new file mode 100644
index 0000000..c86d886
--- /dev/null
+++ b/core/llms/siliconflow_wrapper.py
@@ -0,0 +1,70 @@
+"""
+siliconflow api wrapper
+https://siliconflow.readme.io/reference/chat-completions-1
+"""
+import os
+import requests
+
+
+token = os.environ.get('LLM_API_KEY', "")
+if not token:
+ raise ValueError('请设置环境变量LLM_API_KEY')
+
+url = "https://api.siliconflow.cn/v1/chat/completions"
+
+
+def sfa_llm(messages: list, model: str, logger=None, **kwargs) -> str:
+
+ if logger:
+ logger.debug(f'messages:\n {messages}')
+ logger.debug(f'model: {model}')
+ logger.debug(f'kwargs:\n {kwargs}')
+
+ payload = {
+ "model": model,
+ "messages": messages
+ }
+
+ payload.update(kwargs)
+
+ headers = {
+ "accept": "application/json",
+ "content-type": "application/json",
+ "authorization": f"Bearer {token}"
+ }
+
+ for i in range(2):
+ try:
+ response = requests.post(url, json=payload, headers=headers)
+ if response.status_code == 200:
+ try:
+ body = response.json()
+ usage = body.get('usage', 'Field "usage" not found')
+ choices = body.get('choices', 'Field "choices" not found')
+ if logger:
+ logger.debug(choices)
+ logger.debug(usage)
+ return choices[0]['message']['content']
+ except ValueError:
+ # 如果响应体不是有效的JSON格式
+ if logger:
+ logger.warning("Response body is not in JSON format.")
+ else:
+ print("Response body is not in JSON format.")
+ except requests.exceptions.RequestException as e:
+ if logger:
+ logger.warning(f"A request error occurred: {e}")
+ else:
+ print(f"A request error occurred: {e}")
+
+ if logger:
+ logger.info("retrying...")
+ else:
+ print("retrying...")
+
+ if logger:
+ logger.error("After many time, finally failed to get response from API.")
+ else:
+ print("After many time, finally failed to get response from API.")
+
+ return ''
diff --git a/core/pb/CHANGELOG.md b/core/pb/CHANGELOG.md
new file mode 100644
index 0000000..ab2136a
--- /dev/null
+++ b/core/pb/CHANGELOG.md
@@ -0,0 +1,1016 @@
+## v0.22.12
+
+- Fixed calendar picker grid layout misalignment on Firefox ([#4865](https://github.com/pocketbase/pocketbase/issues/4865)).
+
+- Updated Go deps and bumped the min Go version in the GitHub release action to Go 1.22.3 since it comes with [some minor security fixes](https://github.com/golang/go/issues?q=milestone%3AGo1.22.3).
+
+
+## v0.22.11
+
+- Load the full record in the relation picker edit panel ([#4857](https://github.com/pocketbase/pocketbase/issues/4857)).
+
+
+## v0.22.10
+
+- Updated the uploaded filename normalization to take double extensions in consideration ([#4824](https://github.com/pocketbase/pocketbase/issues/4824))
+
+- Added Collection models cache to help speed up the common List and View requests execution with ~25%.
+ _This was extracted from the ongoing work on [#4355](https://github.com/pocketbase/pocketbase/discussions/4355) and there are many other small optimizations already implemented but they will have to wait for the refactoring to be finalized._
+
+
+## v0.22.9
+
+- Fixed Admin UI OAuth2 "Clear all fields" btn action to properly unset all form fields ([#4737](https://github.com/pocketbase/pocketbase/issues/4737)).
+
+
+## v0.22.8
+
+- Fixed '~' auto wildcard wrapping when the param has escaped `%` character ([#4704](https://github.com/pocketbase/pocketbase/discussions/4704)).
+
+- Other minor UI improvements (added `aria-expanded=true/false` to the dropdown triggers, added contrasting border around the default mail template btn style, etc.).
+
+- Updated Go deps and bumped the min Go version in the GitHub release action to Go 1.22.2 since it comes with [some `net/http` security and bug fixes](https://github.com/golang/go/issues?q=milestone%3AGo1.22.2).
+
+
+## v0.22.7
+
+- Replaced the default `s3blob` driver with a trimmed vendored version to reduce the binary size with ~10MB.
+ _It can be further reduced with another ~10MB once we replace entirely the `aws-sdk-go-v2` dependency but I stumbled on some edge cases related to the headers signing and for now is on hold._
+
+- Other minor improvements (updated GitLab OAuth2 provider logo [#4650](https://github.com/pocketbase/pocketbase/pull/4650), normalized error messages, updated npm dependencies, etc.)
+
+
+## v0.22.6
+
+- Admin UI accessibility improvements:
+ - Fixed the dropdowns tab/enter/space keyboard navigation ([#4607](https://github.com/pocketbase/pocketbase/issues/4607)).
+ - Added `role`, `aria-label`, `aria-hidden` attributes to some of the elements in attempt to better assist screen readers.
+
+
+## v0.22.5
+
+- Minor test helpers fixes ([#4600](https://github.com/pocketbase/pocketbase/issues/4600)):
+ - Call the `OnTerminate` hook on `TestApp.Cleanup()`.
+ - Automatically run the DB migrations on initializing the test app with `tests.NewTestApp()`.
+
+- Added more elaborate warning message when restoring a backup explaining how the operation works.
+
+- Skip irregular files (symbolic links, sockets, etc.) when restoring a backup zip from the Admin UI or calling `archive.Extract(src, dst)` because they come with too many edge cases and ambiguities.
+
+ More details
+
+ This was initially reported as security issue (_thanks Harvey Spec_) but in the PocketBase context it is not something that can be exploited without an admin intervention and since the general expectations are that the PocketBase admins can do anything and they are the one who manage their server, this should be treated with the same diligence when using `scp`/`rsync`/`rclone`/etc. with untrusted file sources.
+
+ It is not possible (_or at least I'm not aware how to do that easily_) to perform virus/malicious content scanning on the uploaded backup archive files and some caution is always required when using the Admin UI or running shell commands, hence the backup-restore warning text.
+
+ **Or in other words, if someone sends you a file and tell you to upload it to your server (either as backup zip or manually via scp) obviously you shouldn't do that unless you really trust them.**
+
+ PocketBase is like any other regular application that you run on your server and there is no builtin "sandbox" for what the PocketBase process can execute. This is left to the developers to restrict on application or OS level depending on their needs. If you are self-hosting PocketBase you usually don't have to do that, but if you are offering PocketBase as a service and allow strangers to run their own PocketBase instances on your server then you'll need to implement the isolation mechanisms on your own.
+
+
+
+## v0.22.4
+
+- Removed conflicting styles causing the detailed codeblock log data preview to not visualize properly ([#4505](https://github.com/pocketbase/pocketbase/pull/4505)).
+
+- Minor JSVM improvements:
+ - Added `$filesystem.fileFromUrl(url, optSecTimeout)` helper.
+ - Implemented the `FormData` interface and added support for sending `multipart/form-data` requests with `$http.send()` ([#4544](https://github.com/pocketbase/pocketbase/discussions/4544)).
+
+
+## v0.22.3
+
+- Fixed the z-index of the current admin dropdown on Safari ([#4492](https://github.com/pocketbase/pocketbase/issues/4492)).
+
+- Fixed `OnAfterApiError` debug log `nil` error reference ([#4498](https://github.com/pocketbase/pocketbase/issues/4498)).
+
+- Added the field name as part of the `@request.data.someRelField.*` join to handle the case when a collection has 2 or more relation fields pointing to the same place ([#4500](https://github.com/pocketbase/pocketbase/issues/4500)).
+
+- Updated Go deps and bumped the min Go version in the GitHub release action to Go 1.22.1 since it comes with [some security fixes](https://github.com/golang/go/issues?q=milestone%3AGo1.22.1).
+
+
+## v0.22.2
+
+- Fixed a small regression introduced with v0.22.0 that was causing some missing unknown fields to always return an error instead of applying the specific `nullifyMisingField` resolver option to the query.
+
+
+## v0.22.1
+
+- Fixed Admin UI record and collection panels not reinitializing properly on browser back/forward navigation ([#4462](https://github.com/pocketbase/pocketbase/issues/4462)).
+
+- Initialize `RecordAuthWithOAuth2Event.IsNewRecord` for the `OnRecordBeforeAuthWithOAuth2Request` hook ([#4437](https://github.com/pocketbase/pocketbase/discussions/4437)).
+
+- Added error checks to the autogenerated Go migrations ([#4448](https://github.com/pocketbase/pocketbase/issues/4448)).
+
+
+## v0.22.0
+
+- Added Planning Center OAuth2 provider ([#4393](https://github.com/pocketbase/pocketbase/pull/4393); thanks @alxjsn).
+
+- Admin UI improvements:
+ - Autosync collection changes across multiple open browser tabs.
+ - Fixed vertical image popup preview scrolling.
+ - Added options to export a subset of collections.
+ - Added option to import a subset of collections without deleting the others ([#3403](https://github.com/pocketbase/pocketbase/issues/3403)).
+
+- Added support for back/indirect relation `filter`/`sort` (single and multiple).
+ The syntax to reference back relation fields is `yourCollection_via_yourRelField.*`.
+ ⚠️ To avoid excessive joins, the nested relations resolver is now limited to max 6 level depth (the same as `expand`).
+ _Note that in the future there will be also more advanced and granular options to specify a subset of the fields that are filterable/sortable._
+
+- Added support for multiple back/indirect relation `expand` and updated the keys to use the `_via_` reference syntax (`yourCollection_via_yourRelField`).
+ _To minimize the breaking changes, the old parenthesis reference syntax (`yourCollection(yourRelField)`) will still continue to work but it is soft-deprecated and there will be a console log reminding you to change it to the new one._
+
+- ⚠️ Collections and fields are no longer allowed to have `_via_` in their name to avoid collisions with the back/indirect relation reference syntax.
+
+- Added `jsvm.Config.OnInit` optional config function to allow registering custom Go bindings to the JSVM.
+
+- Added `@request.context` rule field that can be used to apply a different set of constraints based on the API rule execution context.
+ For example, to disallow user creation by an OAuth2 auth, you could set for the users Create API rule `@request.context != "oauth2"`.
+ The currently supported `@request.context` values are:
+ ```
+ default
+ realtime
+ protectedFile
+ oauth2
+ ```
+
+- Adjusted the `cron.Start()` to start the ticker at the `00` second of the cron interval ([#4394](https://github.com/pocketbase/pocketbase/discussions/4394)).
+ _Note that the cron format has only minute granularity and there is still no guarantee that the scheduled job will be always executed at the `00` second._
+
+- Fixed auto backups cron not reloading properly after app settings change ([#4431](https://github.com/pocketbase/pocketbase/discussions/4431)).
+
+- Upgraded to `aws-sdk-go-v2` and added special handling for GCS to workaround the previous [GCS headers signature issue](https://github.com/pocketbase/pocketbase/issues/2231) that we had with v2.
+ _This should also fix the SVG/JSON zero response when using Cloudflare R2 ([#4287](https://github.com/pocketbase/pocketbase/issues/4287#issuecomment-1925168142), [#2068](https://github.com/pocketbase/pocketbase/discussions/2068), [#2952](https://github.com/pocketbase/pocketbase/discussions/2952))._
+ _⚠️ If you are using S3 for uploaded files or backups, please verify that you have a green check in the Admin UI for your S3 configuration (I've tested the new version with GCS, MinIO, Cloudflare R2 and Wasabi)._
+
+- Added `:each` modifier support for `file` and `relation` type fields (_previously it was supported only for `select` type fields_).
+
+- Other minor improvements (updated the `ghupdate` plugin to use the configured executable name when printing to the console, fixed the error reporting of `admin update/delete` commands, etc.).
+
+
+## v0.21.3
+
+- Ignore the JS required validations for disabled OIDC providers ([#4322](https://github.com/pocketbase/pocketbase/issues/4322)).
+
+- Allow `HEAD` requests to the `/api/health` endpoint ([#4310](https://github.com/pocketbase/pocketbase/issues/4310)).
+
+- Fixed the `editor` field value when visualized inside the View collection preview panel.
+
+- Manually clear all TinyMCE events on editor removal (_workaround for [tinymce#9377](https://github.com/tinymce/tinymce/issues/9377)_).
+
+
+## v0.21.2
+
+- Fixed `@request.auth.*` initialization side-effect which caused the current authenticated user email to not being returned in the user auth response ([#2173](https://github.com/pocketbase/pocketbase/issues/2173#issuecomment-1932332038)).
+ _The current authenticated user email should be accessible always no matter of the `emailVisibility` state._
+
+- Fixed `RecordUpsert.RemoveFiles` godoc example.
+
+- Bumped to `NumCPU()+2` the `thumbGenSem` limit as some users reported that it was too restrictive.
+
+
+## v0.21.1
+
+- Small fix for the Admin UI related to the _Settings > Sync_ menu not being visible even when the "Hide controls" toggle is off.
+
+
+## v0.21.0
+
+- Added Bitbucket OAuth2 provider ([#3948](https://github.com/pocketbase/pocketbase/pull/3948); thanks @aabajyan).
+
+- Mark user as verified on confirm password reset ([#4066](https://github.com/pocketbase/pocketbase/issues/4066)).
+ _If the user email has changed after issuing the reset token (eg. updated by an admin), then the `verified` user state remains unchanged._
+
+- Added support for loading a serialized json payload for `multipart/form-data` requests using the special `@jsonPayload` key.
+ _This is intended to be used primarily by the SDKs to resolve [js-sdk#274](https://github.com/pocketbase/js-sdk/issues/274)._
+
+- Added graceful OAuth2 redirect error handling ([#4177](https://github.com/pocketbase/pocketbase/issues/4177)).
+ _Previously on redirect error we were returning directly a standard json error response. Now on redirect error we'll redirect to a generic OAuth2 failure screen (similar to the success one) and will attempt to auto close the OAuth2 popup._
+ _The SDKs are also updated to handle the OAuth2 redirect error and it will be returned as Promise rejection of the `authWithOAuth2()` call._
+
+- Exposed `$apis.gzip()` and `$apis.bodyLimit(bytes)` middlewares to the JSVM.
+
+- Added `TestMailer.SentMessages` field that holds all sent test app emails until cleanup.
+
+- Optimized the cascade delete of records with multiple `relation` fields.
+
+- Updated the `serve` and `admin` commands error reporting.
+
+- Minor Admin UI improvements (reduced the min table row height, added option to duplicate fields, added new TinyMCE codesample plugin languages, hide the collection sync settings when the `Settings.Meta.HideControls` is enabled, etc.)
+
+
+## v0.20.7
+
+- Fixed the Admin UI auto indexes update when renaming fields with a common prefix ([#4160](https://github.com/pocketbase/pocketbase/issues/4160)).
+
+
+## v0.20.6
+
+- Fixed JSVM types generation for functions with omitted arg types ([#4145](https://github.com/pocketbase/pocketbase/issues/4145)).
+
+- Updated Go deps.
+
+
+## v0.20.5
+
+- Minor CSS fix for the Admin UI to prevent the searchbar within a popup from expanding too much and pushing the controls out of the visible area ([#4079](https://github.com/pocketbase/pocketbase/issues/4079#issuecomment-1876994116)).
+
+
+## v0.20.4
+
+- Small fix for a regression introduced with the recent `json` field changes that was causing View collection column expressions recognized as `json` to fail to resolve ([#4072](https://github.com/pocketbase/pocketbase/issues/4072)).
+
+
+## v0.20.3
+
+- Fixed the `json` field query comparisons to work correctly with plain JSON values like `null`, `bool` `number`, etc. ([#4068](https://github.com/pocketbase/pocketbase/issues/4068)).
+ Since there are plans in the future to allow custom SQLite builds and also in some situations it may be useful to be able to distinguish `NULL` from `''`,
+ for the `json` fields (and for any other future non-standard field) we no longer apply `COALESCE` by default, aka.:
+ ```
+ Dataset:
+ 1) data: json(null)
+ 2) data: json('')
+
+ For the filter "data = null" only 1) will resolve to TRUE.
+ For the filter "data = ''" only 2) will resolve to TRUE.
+ ```
+
+- Minor Go tests improvements
+ - Sorted the record cascade delete references to ensure that the delete operation will preserve the order of the fired events when running the tests.
+ - Marked some of the tests as safe for parallel execution to speed up a little the GitHub action build times.
+
+
+## v0.20.2
+
+- Added `sleep(milliseconds)` JSVM binding.
+ _It works the same way as Go `time.Sleep()`, aka. it pauses the goroutine where the JSVM code is running._
+
+- Fixed multi-line text paste in the Admin UI search bar ([#4022](https://github.com/pocketbase/pocketbase/discussions/4022)).
+
+- Fixed the monospace font loading in the Admin UI.
+
+- Fixed various reported docs and code comment typos.
+
+
+## v0.20.1
+
+- Added `--dev` flag and its accompanying `app.IsDev()` method (_in place of the previously removed `--debug`_) to assist during development ([#3918](https://github.com/pocketbase/pocketbase/discussions/3918)).
+ The `--dev` flag prints in the console "everything" and more specifically:
+ - the data DB SQL statements
+ - all `app.Logger().*` logs (debug, info, warning, error, etc.), no matter of the logs persistence settings in the Admin UI
+
+- Minor Admin UI fixes:
+ - Fixed the log `error` label text wrapping.
+ - Added the log `referer` (_when it is from a different source_) and `details` labels in the logs listing.
+ - Removed the blank current time entry from the logs chart because it was causing confusion when used with custom time ranges.
+ - Updated the SQL syntax highlighter and keywords autocompletion in the Admin UI to recognize `CAST(x as bool)` expressions.
+
+- Replaced the default API tests timeout with a new `ApiScenario.Timeout` option ([#3930](https://github.com/pocketbase/pocketbase/issues/3930)).
+ A negative or zero value means no tests timeout.
+ If a single API test takes more than 3s to complete it will have a log message visible when the test fails or when `go test -v` flag is used.
+
+- Added timestamp at the beginning of the generated JSVM types file to avoid creating it everytime with the app startup.
+
+
+## v0.20.0
+
+- Added `expand`, `filter`, `fields`, custom query and headers parameters support for the realtime subscriptions.
+ _Requires JS SDK v0.20.0+ or Dart SDK v0.17.0+._
+
+ ```js
+ // JS SDK v0.20.0
+ pb.collection("example").subscribe("*", (e) => {
+ ...
+ }, {
+ expand: "someRelField",
+ filter: "status = 'active'",
+ fields: "id,expand.someRelField.*:excerpt(100)",
+ })
+ ```
+
+ ```dart
+ // Dart SDK v0.17.0
+ pb.collection("example").subscribe("*", (e) {
+ ...
+ },
+ expand: "someRelField",
+ filter: "status = 'active'",
+ fields: "id,expand.someRelField.*:excerpt(100)",
+ )
+ ```
+
+- Generalized the logs to allow any kind of application logs, not just requests.
+
+ The new `app.Logger()` implements the standard [`log/slog` interfaces](https://pkg.go.dev/log/slog) available with Go 1.21.
+ ```
+ // Go: https://pocketbase.io/docs/go-logging/
+ app.Logger().Info("Example message", "total", 123, "details", "lorem ipsum...")
+
+ // JS: https://pocketbase.io/docs/js-logging/
+ $app.logger().info("Example message", "total", 123, "details", "lorem ipsum...")
+ ```
+
+ For better performance and to minimize blocking on hot paths, logs are currently written with
+ debounce and on batches:
+ - 3 seconds after the last debounced log write
+ - when the batch threshold is reached (currently 200)
+ - right before app termination to attempt saving everything from the existing logs queue
+
+ Some notable log related changes:
+
+ - ⚠️ Bumped the minimum required Go version to 1.21.
+
+ - ⚠️ Removed `_requests` table in favor of the generalized `_logs`.
+ _Note that existing logs will be deleted!_
+
+ - ⚠️ Renamed the following `Dao` log methods:
+ ```go
+ Dao.RequestQuery(...) -> Dao.LogQuery(...)
+ Dao.FindRequestById(...) -> Dao.FindLogById(...)
+ Dao.RequestsStats(...) -> Dao.LogsStats(...)
+ Dao.DeleteOldRequests(...) -> Dao.DeleteOldLogs(...)
+ Dao.SaveRequest(...) -> Dao.SaveLog(...)
+ ```
+ - ⚠️ Removed `app.IsDebug()` and the `--debug` flag.
+ This was done to avoid the confusion with the new logger and its debug severity level.
+ If you want to store debug logs you can set `-4` as min log level from the Admin UI.
+
+ - Refactored Admin UI Logs:
+ - Added new logs table listing.
+ - Added log settings option to toggle the IP logging for the activity logger.
+ - Added log settings option to specify a minimum log level.
+ - Added controls to export individual or bulk selected logs as json.
+ - Other minor improvements and fixes.
+
+- Added new `filesystem/System.Copy(src, dest)` method to copy existing files from one location to another.
+ _This is usually useful when duplicating records with `file` field(s) programmatically._
+
+- Added `filesystem.NewFileFromUrl(ctx, url)` helper method to construct a `*filesystem.BytesReader` file from the specified url.
+
+- OAuth2 related additions:
+
+ - Added new `PKCE()` and `SetPKCE(enable)` OAuth2 methods to indicate whether the PKCE flow is supported or not.
+ _The PKCE value is currently configurable from the UI only for the OIDC providers._
+ _This was added to accommodate OIDC providers that may throw an error if unsupported PKCE params are submitted with the auth request (eg. LinkedIn; see [#3799](https://github.com/pocketbase/pocketbase/discussions/3799#discussioncomment-7640312))._
+
+ - Added new `displayName` field for each `listAuthMethods()` OAuth2 provider item.
+ _The value of the `displayName` property is currently configurable from the UI only for the OIDC providers._
+
+ - Added `expiry` field to the OAuth2 user response containing the _optional_ expiration time of the OAuth2 access token ([#3617](https://github.com/pocketbase/pocketbase/discussions/3617)).
+
+ - Allow a single OAuth2 user to be used for authentication in multiple auth collection.
+ _⚠️ Because now you can have more than one external provider with `collectionId-provider-providerId` pair, `Dao.FindExternalAuthByProvider(provider, providerId)` method was removed in favour of the more generic `Dao.FindFirstExternalAuthByExpr(expr)`._
+
+- Added `onlyVerified` auth collection option to globally disallow authentication requests for unverified users.
+
+- Added support for single line comments (ex. `// your comment`) in the API rules and filter expressions.
+
+- Added support for specifying a collection alias in `@collection.someCollection:alias.*`.
+
+- Soft-deprecated and renamed `app.Cache()` with `app.Store()`.
+
+- Minor JSVM updates and fixes:
+
+ - Updated `$security.parseUnverifiedJWT(token)` and `$security.parseJWT(token, key)` to return the token payload result as plain object.
+
+ - Added `$apis.requireGuestOnly()` middleware JSVM binding ([#3896](https://github.com/pocketbase/pocketbase/issues/3896)).
+
+- Use `IS NOT` instead of `!=` as not-equal SQL query operator to handle the cases when comparing with nullable columns or expressions (eg. `json_extract` over `json` field).
+ _Based on my local dataset I wasn't able to find a significant difference in the performance between the 2 operators, but if you stumble on a query that you think may be affected negatively by this, please report it and I'll test it further._
+
+- Added `MaxSize` `json` field option to prevent storing large json data in the db ([#3790](https://github.com/pocketbase/pocketbase/issues/3790)).
+ _Existing `json` fields are updated with a system migration to have a ~2MB size limit (it can be adjusted from the Admin UI)._
+
+- Fixed negative string number normalization support for the `json` field type.
+
+- Trigger the `app.OnTerminate()` hook on `app.Restart()` call.
+ _A new bool `IsRestart` field was also added to the `core.TerminateEvent` event._
+
+- Fixed graceful shutdown handling and speed up a little the app termination time.
+
+- Limit the concurrent thumbs generation to avoid high CPU and memory usage in spiky scenarios ([#3794](https://github.com/pocketbase/pocketbase/pull/3794); thanks @t-muehlberger).
+ _Currently the max concurrent thumbs generation processes are limited to "total of logical process CPUs + 1"._
+ _This is arbitrary chosen and may change in the future depending on the users feedback and usage patterns._
+ _If you are experiencing OOM errors during large image thumb generations, especially in container environment, you can try defining the `GOMEMLIMIT=500MiB` env variable before starting the executable._
+
+- Slightly speed up (~10%) the thumbs generation by changing from cubic (`CatmullRom`) to bilinear (`Linear`) resampling filter (_the quality difference is very little_).
+
+- Added a default red colored Stderr output in case of a console command error.
+ _You can now also silence individually custom commands errors using the `cobra.Command.SilenceErrors` field._
+
+- Fixed links formatting in the autogenerated html->text mail body.
+
+- Removed incorrectly imported empty `local('')` font-face declarations.
+
+
+## v0.19.4
+
+- Fixed TinyMCE source code viewer textarea styles ([#3715](https://github.com/pocketbase/pocketbase/issues/3715)).
+
+- Fixed `text` field min/max validators to properly count multi-byte characters ([#3735](https://github.com/pocketbase/pocketbase/issues/3735)).
+
+- Allowed hyphens in `username` ([#3697](https://github.com/pocketbase/pocketbase/issues/3697)).
+ _More control over the system fields settings will be available in the future._
+
+- Updated the JSVM generated types to use directly the value type instead of `* | undefined` union in functions/methods return declarations.
+
+
+## v0.19.3
+
+- Added the release notes to the console output of `./pocketbase update` ([#3685](https://github.com/pocketbase/pocketbase/discussions/3685)).
+
+- Added missing documentation for the JSVM `$mails.*` bindings.
+
+- Relaxed the OAuth2 redirect url validation to allow any string value ([#3689](https://github.com/pocketbase/pocketbase/pull/3689); thanks @sergeypdev).
+ _Note that the redirect url format is still bound to the accepted values by the specific OAuth2 provider._
+
+
+## v0.19.2
+
+- Updated the JSVM generated types ([#3627](https://github.com/pocketbase/pocketbase/issues/3627), [#3662](https://github.com/pocketbase/pocketbase/issues/3662)).
+
+
+## v0.19.1
+
+- Fixed `tokenizer.Scan()/ScanAll()` to ignore the separators from the default trim cutset.
+ An option to return also the empty found tokens was also added via `Tokenizer.KeepEmptyTokens(true)`.
+ _This should fix the parsing of whitespace characters around view query column names when no quotes are used ([#3616](https://github.com/pocketbase/pocketbase/discussions/3616#discussioncomment-7398564))._
+
+- Fixed the `:excerpt(max, withEllipsis?)` `fields` query param modifier to properly add space to the generated text fragment after block tags.
+
+
+## v0.19.0
+
+- Added Patreon OAuth2 provider ([#3323](https://github.com/pocketbase/pocketbase/pull/3323); thanks @ghostdevv).
+
+- Added mailcow OAuth2 provider ([#3364](https://github.com/pocketbase/pocketbase/pull/3364); thanks @thisni1s).
+
+- Added support for `:excerpt(max, withEllipsis?)` `fields` modifier that will return a short plain text version of any string value (html tags are stripped).
+ This could be used to minimize the downloaded json data when listing records with large `editor` html values.
+ ```js
+ await pb.collection("example").getList(1, 20, {
+ "fields": "*,description:excerpt(100)"
+ })
+ ```
+
+- Several Admin UI improvements:
+ - Count the total records separately to speed up the query execution for large datasets ([#3344](https://github.com/pocketbase/pocketbase/issues/3344)).
+ - Enclosed the listing scrolling area within the table so that the horizontal scrollbar and table header are always reachable ([#2505](https://github.com/pocketbase/pocketbase/issues/2505)).
+ - Allowed opening the record preview/update form via direct URL ([#2682](https://github.com/pocketbase/pocketbase/discussions/2682)).
+ - Reintroduced the local `date` field tooltip on hover.
+ - Speed up the listing loading times for records with large `editor` field values by initially fetching only a partial of the records data (the complete record data is loaded on record preview/update).
+ - Added "Media library" (collection images picker) support for the TinyMCE `editor` field.
+ - Added support to "pin" collections in the sidebar.
+ - Added support to manually resize the collections sidebar.
+ - More clear "Nonempty" field label style.
+ - Removed the legacy `.woff` and `.ttf` fonts and keep only `.woff2`.
+
+- Removed the explicit `Content-Type` charset from the realtime response due to compatibility issues with IIS ([#3461](https://github.com/pocketbase/pocketbase/issues/3461)).
+ _The `Connection:keep-alive` realtime response header was also removed as it is not really used with HTTP2 anyway._
+
+- Added new JSVM bindings:
+ - `new Cookie({ ... })` constructor for creating `*http.Cookie` equivalent value.
+ - `new SubscriptionMessage({ ... })` constructor for creating a custom realtime subscription payload.
+ - Soft-deprecated `$os.exec()` in favour of `$os.cmd()` to make it more clear that the call only prepares the command and doesn't execute it.
+
+- ⚠️ Bumped the min required Go version to 1.19.
+
+
+## v0.18.10
+
+- Added global `raw` template function to allow outputting raw/verbatim HTML content in the JSVM templates ([#3476](https://github.com/pocketbase/pocketbase/discussions/3476)).
+ ```
+ {{.description|raw}}
+ ```
+
+- Trimmed view query semicolon and allowed single quotes for column aliases ([#3450](https://github.com/pocketbase/pocketbase/issues/3450#issuecomment-1748044641)).
+ _Single quotes are usually [not a valid identifier quote characters](https://www.sqlite.org/lang_keywords.html), but for resilience and compatibility reasons SQLite allows them in some contexts where only an identifier is expected._
+
+- Bumped the GitHub action to use [min Go 1.21.2](https://github.com/golang/go/issues?q=milestone%3AGo1.21.2) (_the fixed issues are not critical as they are mostly related to the compiler/build tools_).
+
+
+## v0.18.9
+
+- Fixed empty thumbs directories not getting deleted on Windows after deleting a record img file ([#3382](https://github.com/pocketbase/pocketbase/issues/3382)).
+
+- Updated the generated JSVM typings to silent the TS warnings when trying to access a field/method in a Go->TS interface.
+
+
+## v0.18.8
+
+- Minor fix for the View collections API Preview and Admin UI listings incorrectly showing the `created` and `updated` fields as `N/A` when the view query doesn't have them.
+
+
+## v0.18.7
+
+- Fixed JS error in the Admin UI when listing records with invalid `relation` field value ([#3372](https://github.com/pocketbase/pocketbase/issues/3372)).
+ _This could happen usually only during custom SQL import scripts or when directly modifying the record field value without data validations._
+
+- Updated Go deps and the generated JSVM types.
+
+
+## v0.18.6
+
+- Return the response headers and cookies in the `$http.send()` result ([#3310](https://github.com/pocketbase/pocketbase/discussions/3310)).
+
+- Added more descriptive internal error message for missing user/admin email on password reset requests.
+
+- Updated Go deps.
+
+
+## v0.18.5
+
+- Fixed minor Admin UI JS error in the auth collection options panel introduced with the change from v0.18.4.
+
+
+## v0.18.4
+
+- Added escape character (`\`) support in the Admin UI to allow using `select` field values with comma ([#2197](https://github.com/pocketbase/pocketbase/discussions/2197)).
+
+
+## v0.18.3
+
+- Exposed a global JSVM `readerToString(reader)` helper function to allow reading Go `io.Reader` values ([#3273](https://github.com/pocketbase/pocketbase/discussions/3273)).
+
+- Bumped the GitHub action to use [min Go 1.21.1](https://github.com/golang/go/issues?q=milestone%3AGo1.21.1+label%3ACherryPickApproved) for the prebuilt executable since it contains some minor `html/template` and `net/http` security fixes.
+
+
+## v0.18.2
+
+- Prevent breaking the record form in the Admin UI in case the browser's localStorage quota has been exceeded when uploading or storing large `editor` values ([#3265](https://github.com/pocketbase/pocketbase/issues/3265)).
+
+- Updated docs and missing JSVM typings.
+
+- Exposed additional crypto primitives under the `$security.*` JSVM namespace ([#3273](https://github.com/pocketbase/pocketbase/discussions/3273)):
+ ```js
+ // HMAC with SHA256
+ $security.hs256("hello", "secret")
+
+ // HMAC with SHA512
+ $security.hs512("hello", "secret")
+
+ // compare 2 strings with a constant time
+ $security.equal(hash1, hash2)
+ ```
+
+
+## v0.18.1
+
+- Excluded the local temp dir from the backups ([#3261](https://github.com/pocketbase/pocketbase/issues/3261)).
+
+
+## v0.18.0
+
+- Simplified the `serve` command to accept domain name(s) as argument to reduce any additional manual hosts setup that sometimes previously was needed when deploying on production ([#3190](https://github.com/pocketbase/pocketbase/discussions/3190)).
+ ```sh
+ ./pocketbase serve yourdomain.com
+ ```
+
+- Added `fields` wildcard (`*`) support.
+
+- Added option to upload a backup file from the Admin UI ([#2599](https://github.com/pocketbase/pocketbase/issues/2599)).
+
+- Registered a custom Deflate compressor to speedup (_nearly 2-3x_) the backups generation for the sake of a small zip size increase.
+ _Based on several local tests, `pb_data` of ~500MB (from which ~350MB+ are several hundred small files) results in a ~280MB zip generated for ~11s (previously it resulted in ~250MB zip but for ~35s)._
+
+- Added the application name as part of the autogenerated backup name for easier identification ([#3066](https://github.com/pocketbase/pocketbase/issues/3066)).
+
+- Added new `SmtpConfig.LocalName` option to specify a custom domain name (or IP address) for the initial EHLO/HELO exchange ([#3097](https://github.com/pocketbase/pocketbase/discussions/3097)).
+ _This is usually required for verification purposes only by some SMTP providers, such as on-premise [Gmail SMTP-relay](https://support.google.com/a/answer/2956491)._
+
+- Added `NoDecimal` `number` field option.
+
+- `editor` field improvements:
+ - Added new "Strip urls domain" option to allow controlling the default TinyMCE urls behavior (_default to `false` for new content_).
+ - Normalized pasted text while still preserving links, lists, tables, etc. formatting ([#3257](https://github.com/pocketbase/pocketbase/issues/3257)).
+
+- Added option to auto generate admin and auth record passwords from the Admin UI.
+
+- Added JSON validation and syntax highlight for the `json` field in the Admin UI ([#3191](https://github.com/pocketbase/pocketbase/issues/3191)).
+
+- Added datetime filter macros:
+ ```
+ // all macros are UTC based
+ @second - @now second number (0-59)
+ @minute - @now minute number (0-59)
+ @hour - @now hour number (0-23)
+ @weekday - @now weekday number (0-6)
+ @day - @now day number
+ @month - @now month number
+ @year - @now year number
+ @todayStart - beginning of the current day as datetime string
+ @todayEnd - end of the current day as datetime string
+ @monthStart - beginning of the current month as datetime string
+ @monthEnd - end of the current month as datetime string
+ @yearStart - beginning of the current year as datetime string
+ @yearEnd - end of the current year as datetime string
+ ```
+
+- Added cron expression macros ([#3132](https://github.com/pocketbase/pocketbase/issues/3132)):
+ ```
+ @yearly - "0 0 1 1 *"
+ @annually - "0 0 1 1 *"
+ @monthly - "0 0 1 * *"
+ @weekly - "0 0 * * 0"
+ @daily - "0 0 * * *"
+ @midnight - "0 0 * * *"
+ @hourly - "0 * * * *"
+ ```
+
+- ⚠️ Added offset argument `Dao.FindRecordsByFilter(collection, filter, sort, limit, offset, [params...])`.
+ _If you don't need an offset, you can set it to `0`._
+
+- To minimize the footguns with `Dao.FindFirstRecordByFilter()` and `Dao.FindRecordsByFilter()`, the functions now supports an optional placeholder params argument that is safe to be populated with untrusted user input.
+ The placeholders are in the same format as when binding regular SQL parameters.
+ ```go
+ // unsanitized and untrusted filter variables
+ status := "..."
+ author := "..."
+
+ app.Dao().FindFirstRecordByFilter("articles", "status={:status} && author={:author}", dbx.Params{
+ "status": status,
+ "author": author,
+ })
+
+ app.Dao().FindRecordsByFilter("articles", "status={:status} && author={:author}", "-created", 10, 0, dbx.Params{
+ "status": status,
+ "author": author,
+ })
+ ```
+
+- Added JSVM `$mails.*` binds for the corresponding Go [mails package](https://pkg.go.dev/github.com/pocketbase/pocketbase/mails) functions.
+
+- Added JSVM helper crypto primitives under the `$security.*` namespace:
+ ```js
+ $security.md5(text)
+ $security.sha256(text)
+ $security.sha512(text)
+ ```
+
+- ⚠️ Deprecated `RelationOptions.DisplayFields` in favor of the new `SchemaField.Presentable` option to avoid the duplication when a single collection is referenced more than once and/or by multiple other collections.
+
+- ⚠️ Fill the `LastVerificationSentAt` and `LastResetSentAt` fields only after a successfull email send ([#3121](https://github.com/pocketbase/pocketbase/issues/3121)).
+
+- ⚠️ Skip API `fields` json transformations for non 20x responses ([#3176](https://github.com/pocketbase/pocketbase/issues/3176)).
+
+- ⚠️ Changes to `tests.ApiScenario` struct:
+
+ - The `ApiScenario.AfterTestFunc` now receive as 3rd argument `*http.Response` pointer instead of `*echo.Echo` as the latter is not really useful in this context.
+ ```go
+ // old
+ AfterTestFunc: func(t *testing.T, app *tests.TestApp, e *echo.Echo)
+
+ // new
+ AfterTestFunc: func(t *testing.T, app *tests.TestApp, res *http.Response)
+ ```
+
+ - The `ApiScenario.TestAppFactory` now accept the test instance as argument and no longer expect an error as return result ([#3025](https://github.com/pocketbase/pocketbase/discussions/3025#discussioncomment-6592272)).
+ ```go
+ // old
+ TestAppFactory: func() (*tests.TestApp, error)
+
+ // new
+ TestAppFactory: func(t *testing.T) *tests.TestApp
+ ```
+ _Returning a `nil` app instance from the factory results in test failure. You can enforce a custom test failure by calling `t.Fatal(err)` inside the factory._
+
+- Bumped the min required TLS version to 1.2 in order to improve the cert reputation score.
+
+- Reduced the default JSVM prewarmed pool size to 25 to reduce the initial memory consumptions (_you can manually adjust the pool size with `--hooksPool=50` if you need to, but the default should suffice for most cases_).
+
+- Update `gocloud.dev` dependency to v0.34 and explicitly set the new `NoTempDir` fileblob option to prevent the cross-device link error introduced with v0.33.
+
+- Other minor Admin UI and docs improvements.
+
+
+## v0.17.7
+
+- Fixed the autogenerated `down` migrations to properly revert the old collection rules in case a change was made in `up` ([#3192](https://github.com/pocketbase/pocketbase/pull/3192); thanks @impact-merlinmarek).
+ _Existing `down` migrations can't be fixed but that should be ok as usually the `down` migrations are rarely used against prod environments since they can cause data loss and, while not ideal, the previous old behavior of always setting the rules to `null/nil` is safer than not updating the rules at all._
+
+- Updated some Go deps.
+
+
+## v0.17.6
+
+- Fixed JSVM `require()` file path error when using Windows-style path delimiters ([#3163](https://github.com/pocketbase/pocketbase/issues/3163#issuecomment-1685034438)).
+
+
+## v0.17.5
+
+- Added quotes around the wrapped view query columns introduced with v0.17.4.
+
+
+## v0.17.4
+
+- Fixed Views record retrieval when numeric id is used ([#3110](https://github.com/pocketbase/pocketbase/issues/3110)).
+ _With this fix we also now properly recognize `CAST(... as TEXT)` and `CAST(... as BOOLEAN)` as `text` and `bool` fields._
+
+- Fixed `relation` "Cascade delete" tooltip message ([#3098](https://github.com/pocketbase/pocketbase/issues/3098)).
+
+- Fixed jsvm error message prefix on failed migrations ([#3103](https://github.com/pocketbase/pocketbase/pull/3103); thanks @nzhenev).
+
+- Disabled the initial Admin UI admins counter cache when there are no initial admins to allow detecting externally created accounts (eg. with the `admin` command) ([#3106](https://github.com/pocketbase/pocketbase/issues/3106)).
+
+- Downgraded `google/go-cloud` dependency to v0.32.0 until v0.34.0 is released to prevent the `os.TempDir` `cross-device link` errors as too many users complained about it.
+
+
+## v0.17.3
+
+- Fixed Docker `cross-device link` error when creating `pb_data` backups on a local mounted volume ([#3089](https://github.com/pocketbase/pocketbase/issues/3089)).
+
+- Fixed the error messages for relation to views ([#3090](https://github.com/pocketbase/pocketbase/issues/3090)).
+
+- Always reserve space for the scrollbar to reduce the layout shifts in the Admin UI records listing due to the deprecated `overflow: overlay`.
+
+- Enabled lazy loading for the Admin UI thumb images.
+
+
+## v0.17.2
+
+- Soft-deprecated `$http.send({ data: object, ... })` in favour of `$http.send({ body: rawString, ... })`
+ to allow sending non-JSON body with the request ([#3058](https://github.com/pocketbase/pocketbase/discussions/3058)).
+ The existing `data` prop will still work, but it is recommended to use `body` instead (_to send JSON you can use `JSON.stringify(...)` as body value_).
+
+- Added `core.RealtimeConnectEvent.IdleTimeout` field to allow specifying a different realtime idle timeout duration per client basis ([#3054](https://github.com/pocketbase/pocketbase/discussions/3054)).
+
+- Fixed `apis.RequestData` deprecation log note ([#3068](https://github.com/pocketbase/pocketbase/pull/3068); thanks @gungjodi).
+
+
+## v0.17.1
+
+- Use relative path when redirecting to the OAuth2 providers page in the Admin UI to support subpath deployments ([#3026](https://github.com/pocketbase/pocketbase/pull/3026); thanks @sonyarianto).
+
+- Manually trigger the `OnBeforeServe` hook for `tests.ApiScenario` ([#3025](https://github.com/pocketbase/pocketbase/discussions/3025)).
+
+- Trigger the JSVM `cronAdd()` handler only on app `serve` to prevent unexpected (and eventually duplicated) cron handler calls when custom console commands are used ([#3024](https://github.com/pocketbase/pocketbase/discussions/3024#discussioncomment-6592703)).
+
+- The `console.log()` messages are now written to the `stdout` instead of `stderr`.
+
+
+## v0.17.0
+
+- New more detailed guides for using PocketBase as framework (both Go and JS).
+ _If you find any typos or issues with the docs please report them in https://github.com/pocketbase/site._
+
+- Added new experimental JavaScript app hooks binding via [goja](https://github.com/dop251/goja).
+ They are available by default with the prebuilt executable if you create `*.pb.js` file(s) in the `pb_hooks` directory.
+ Lower your expectations because the integration comes with some limitations. For more details please check the [Extend with JavaScript](https://pocketbase.io/docs/js-overview/) guide.
+ Optionally, you can also enable the JS app hooks as part of a custom Go build for dynamic scripting but you need to register the `jsvm` plugin manually:
+ ```go
+ jsvm.MustRegister(app core.App, config jsvm.Config{})
+ ```
+
+- Added Instagram OAuth2 provider ([#2534](https://github.com/pocketbase/pocketbase/pull/2534); thanks @pnmcosta).
+
+- Added VK OAuth2 provider ([#2533](https://github.com/pocketbase/pocketbase/pull/2533); thanks @imperatrona).
+
+- Added Yandex OAuth2 provider ([#2762](https://github.com/pocketbase/pocketbase/pull/2762); thanks @imperatrona).
+
+- Added new fields to `core.ServeEvent`:
+ ```go
+ type ServeEvent struct {
+ App App
+ Router *echo.Echo
+ // new fields
+ Server *http.Server // allows adjusting the HTTP server config (global timeouts, TLS options, etc.)
+ CertManager *autocert.Manager // allows adjusting the autocert options (cache dir, host policy, etc.)
+ }
+ ```
+
+- Added `record.ExpandedOne(rel)` and `record.ExpandedAll(rel)` helpers to retrieve casted single or multiple expand relations from the already loaded "expand" Record data.
+
+- Added rule and filter record `Dao` helpers:
+ ```go
+ app.Dao().FindRecordsByFilter("posts", "title ~ 'lorem ipsum' && visible = true", "-created", 10)
+ app.Dao().FindFirstRecordByFilter("posts", "slug='test' && active=true")
+ app.Dao().CanAccessRecord(record, requestInfo, rule)
+ ```
+
+- Added `Dao.WithoutHooks()` helper to create a new `Dao` from the current one but without the create/update/delete hooks.
+
+- Use a default fetch function that will return all relations in case the `fetchFunc` argument of `Dao.ExpandRecord(record, expands, fetchFunc)` and `Dao.ExpandRecords(records, expands, fetchFunc)` is `nil`.
+
+- For convenience it is now possible to call `Dao.RecordQuery(collectionModelOrIdentifier)` with just the collection id or name.
+ In case an invalid collection id/name string is passed the query will be resolved with cancelled context error.
+
+- Refactored `apis.ApiError` validation errors serialization to allow `map[string]error` and `map[string]any` when generating the public safe formatted `ApiError.Data`.
+
+- Added support for wrapped API errors (_in case Go 1.20+ is used with multiple wrapped errors, the first `apis.ApiError` takes precedence_).
+
+- Added `?download=1` file query parameter to the file serving endpoint to force the browser to always download the file and not show its preview.
+
+- Added new utility `github.com/pocketbase/pocketbase/tools/template` subpackage to assist with rendering HTML templates using the standard Go `html/template` and `text/template` syntax.
+
+- Added `types.JsonMap.Get(k)` and `types.JsonMap.Set(k, v)` helpers for the cases where the type aliased direct map access is not allowed (eg. in [goja](https://pkg.go.dev/github.com/dop251/goja#hdr-Maps_with_methods)).
+
+- Soft-deprecated `security.NewToken()` in favor of `security.NewJWT()`.
+
+- `Hook.Add()` and `Hook.PreAdd` now returns a unique string identifier that could be used to remove the registered hook handler via `Hook.Remove(handlerId)`.
+
+- Changed the after* hooks to be called right before writing the user response, allowing users to return response errors from the after hooks.
+ There is also no longer need for returning explicitly `hook.StopPropagtion` when writing custom response body in a hook because we will skip the finalizer response body write if a response was already "committed".
+
+- ⚠️ Renamed `*Options{}` to `Config{}` for consistency and replaced the unnecessary pointers with their value equivalent to keep the applied configuration defaults isolated within their function calls:
+ ```go
+ old: pocketbase.NewWithConfig(config *pocketbase.Config) *pocketbase.PocketBase
+ new: pocketbase.NewWithConfig(config pocketbase.Config) *pocketbase.PocketBase
+
+ old: core.NewBaseApp(config *core.BaseAppConfig) *core.BaseApp
+ new: core.NewBaseApp(config core.BaseAppConfig) *core.BaseApp
+
+ old: apis.Serve(app core.App, options *apis.ServeOptions) error
+ new: apis.Serve(app core.App, config apis.ServeConfig) (*http.Server, error)
+
+ old: jsvm.MustRegisterMigrations(app core.App, options *jsvm.MigrationsOptions)
+ new: jsvm.MustRegister(app core.App, config jsvm.Config)
+
+ old: ghupdate.MustRegister(app core.App, rootCmd *cobra.Command, options *ghupdate.Options)
+ new: ghupdate.MustRegister(app core.App, rootCmd *cobra.Command, config ghupdate.Config)
+
+ old: migratecmd.MustRegister(app core.App, rootCmd *cobra.Command, options *migratecmd.Options)
+ new: migratecmd.MustRegister(app core.App, rootCmd *cobra.Command, config migratecmd.Config)
+ ```
+
+- ⚠️ Changed the type of `subscriptions.Message.Data` from `string` to `[]byte` because `Data` usually is a json bytes slice anyway.
+
+- ⚠️ Renamed `models.RequestData` to `models.RequestInfo` and soft-deprecated `apis.RequestData(c)` in favor of `apis.RequestInfo(c)` to avoid the stuttering with the `Data` field.
+ _The old `apis.RequestData()` method still works to minimize the breaking changes but it is recommended to replace it with `apis.RequestInfo(c)`._
+
+- ⚠️ Changes to the List/Search APIs
+ - Added new query parameter `?skipTotal=1` to skip the `COUNT` query performed with the list/search actions ([#2965](https://github.com/pocketbase/pocketbase/discussions/2965)).
+ If `?skipTotal=1` is set, the response fields `totalItems` and `totalPages` will have `-1` value (this is to avoid having different JSON responses and to differentiate from the zero default).
+ With the latest JS SDK 0.16+ and Dart SDK v0.11+ versions `skipTotal=1` is set by default for the `getFirstListItem()` and `getFullList()` requests.
+
+ - The count and regular select statements also now executes concurrently, meaning that we no longer perform normalization over the `page` parameter and in case the user
+ request a page that doesn't exist (eg. `?page=99999999`) we'll return empty `items` array.
+
+ - Reverted the default `COUNT` column to `id` as there are some common situations where it can negatively impact the query performance.
+ Additionally, from this version we also set `PRAGMA temp_store = MEMORY` so that also helps with the temp B-TREE creation when `id` is used.
+ _There are still scenarios where `COUNT` queries with `rowid` executes faster, but the majority of the time when nested relations lookups are used it seems to have the opposite effect (at least based on the benchmarks dataset)._
+
+- ⚠️ Disallowed relations to views **from non-view** collections ([#3000](https://github.com/pocketbase/pocketbase/issues/3000)).
+ The change was necessary because I wasn't able to find an efficient way to track view changes and the previous behavior could have too many unexpected side-effects (eg. view with computed ids).
+ There is a system migration that will convert the existing view `relation` fields to `json` (multiple) and `text` (single) fields.
+ This could be a breaking change if you have `relation` to view and use `expand` or some of the `relation` view fields as part of a collection rule.
+
+- ⚠️ Added an extra `action` argument to the `Dao` hooks to allow skipping the default persist behavior.
+ In preparation for the logs generalization, the `Dao.After*Func` methods now also allow returning an error.
+
+- Allowed `0` as `RelationOptions.MinSelect` value to avoid the ambiguity between 0 and non-filled input value ([#2817](https://github.com/pocketbase/pocketbase/discussions/2817)).
+
+- Fixed zero-default value not being used if the field is not explicitly set when manually creating records ([#2992](https://github.com/pocketbase/pocketbase/issues/2992)).
+ Additionally, `record.Get(field)` will now always return normalized value (the same as in the json serialization) for consistency and to avoid ambiguities with what is stored in the related DB table.
+ The schema fields columns `DEFAULT` definition was also updated for new collections to ensure that `NULL` values can't be accidentally inserted.
+
+- Fixed `migrate down` not returning the correct `lastAppliedMigrations()` when the stored migration applied time is in seconds.
+
+- Fixed realtime delete event to be called after the record was deleted from the DB (_including transactions and cascade delete operations_).
+
+- Other minor fixes and improvements (typos and grammar fixes, updated dependencies, removed unnecessary 404 error check in the Admin UI, etc.).
+
+
+## v0.16.10
+
+- Added multiple valued fields (`relation`, `select`, `file`) normalizations to ensure that the zero-default value of a newly created multiple field is applied for already existing data ([#2930](https://github.com/pocketbase/pocketbase/issues/2930)).
+
+
+## v0.16.9
+
+- Register the `eagerRequestInfoCache` middleware only for the internal `api` group routes to avoid conflicts with custom route handlers ([#2914](https://github.com/pocketbase/pocketbase/issues/2914)).
+
+
+## v0.16.8
+
+- Fixed unique validator detailed error message not being returned when camelCase field name is used ([#2868](https://github.com/pocketbase/pocketbase/issues/2868)).
+
+- Updated the index parser to allow no space between the table name and the columns list ([#2864](https://github.com/pocketbase/pocketbase/discussions/2864#discussioncomment-6373736)).
+
+- Updated go deps.
+
+
+## v0.16.7
+
+- Minor optimization for the list/search queries to use `rowid` with the `COUNT` statement when available.
+ _This eliminates the temp B-TREE step when executing the query and for large datasets (eg. 150k) it could have 10x improvement (from ~580ms to ~60ms)._
+
+
+## v0.16.6
+
+- Fixed collection index column sort normalization in the Admin UI ([#2681](https://github.com/pocketbase/pocketbase/pull/2681); thanks @SimonLoir).
+
+- Removed unnecessary admins count in `apis.RequireAdminAuthOnlyIfAny()` middleware ([#2726](https://github.com/pocketbase/pocketbase/pull/2726); thanks @svekko).
+
+- Fixed `multipart/form-data` request bind not populating map array values ([#2763](https://github.com/pocketbase/pocketbase/discussions/2763#discussioncomment-6278902)).
+
+- Upgraded npm and Go dependencies.
+
+
+## v0.16.5
+
+- Fixed the Admin UI serialization of implicit relation display fields ([#2675](https://github.com/pocketbase/pocketbase/issues/2675)).
+
+- Reset the Admin UI sort in case the active sort collection field is renamed or deleted.
+
+
+## v0.16.4
+
+- Fixed the selfupdate command not working on Windows due to missing `.exe` in the extracted binary path ([#2589](https://github.com/pocketbase/pocketbase/discussions/2589)).
+ _Note that the command on Windows will work from v0.16.4+ onwards, meaning that you still will have to update manually one more time to v0.16.4._
+
+- Added `int64`, `int32`, `uint`, `uint64` and `uint32` support when scanning `types.DateTime` ([#2602](https://github.com/pocketbase/pocketbase/discussions/2602))
+
+- Updated dependencies.
+
+
+## v0.16.3
+
+- Fixed schema fields sort not working on Safari/Gnome Web ([#2567](https://github.com/pocketbase/pocketbase/issues/2567)).
+
+- Fixed default `PRAGMA`s not being applied for new connections ([#2570](https://github.com/pocketbase/pocketbase/discussions/2570)).
+
+
+## v0.16.2
+
+- Fixed backups archive not excluding the local `backups` directory on Windows ([#2548](https://github.com/pocketbase/pocketbase/discussions/2548#discussioncomment-5979712)).
+
+- Changed file field to not use `dataTransfer.effectAllowed` when dropping files since it is not reliable and consistent across different OS and browsers ([#2541](https://github.com/pocketbase/pocketbase/issues/2541)).
+
+- Auto register the initial generated snapshot migration to prevent incorrectly reapplying the snapshot on Docker restart ([#2551](https://github.com/pocketbase/pocketbase/discussions/2551)).
+
+- Fixed missing view id field error message typo.
+
+
+## v0.16.1
+
+- Fixed backup restore not working in a container environment when `pb_data` is mounted as volume ([#2519](https://github.com/pocketbase/pocketbase/issues/2519)).
+
+- Fixed Dart SDK realtime API preview example ([#2523](https://github.com/pocketbase/pocketbase/pull/2523); thanks @xFrann).
+
+- Fixed typo in the backups create panel ([#2526](https://github.com/pocketbase/pocketbase/pull/2526); thanks @dschissler).
+
+- Removed unnecessary slice length check in `list.ExistInSlice` ([#2527](https://github.com/pocketbase/pocketbase/pull/2527); thanks @KunalSin9h).
+
+- Avoid mutating the cached request data on OAuth2 user create ([#2535](https://github.com/pocketbase/pocketbase/discussions/2535)).
+
+- Fixed Export Collections "Download as JSON" ([#2540](https://github.com/pocketbase/pocketbase/issues/2540)).
+
+- Fixed file field drag and drop not working in Firefox and Safari ([#2541](https://github.com/pocketbase/pocketbase/issues/2541)).
+
+
+## v0.16.0
+
+- Added automated backups (_+ cron rotation_) APIs and UI for the `pb_data` directory.
+ The backups can be also initialized programmatically using `app.CreateBackup("backup.zip")`.
+ There is also experimental restore method - `app.RestoreBackup("backup.zip")` (_currently works only on UNIX systems as it relies on execve_).
+ The backups can be stored locally or in external S3 storage (_it has its own configuration, separate from the file uploads storage filesystem_).
+
+- Added option to limit the returned API fields using the `?fields` query parameter.
+ The "fields picker" is applied for `SearchResult.Items` and every other JSON response. For example:
+ ```js
+ // original: {"id": "RECORD_ID", "name": "abc", "description": "...something very big...", "items": ["id1", "id2"], "expand": {"items": [{"id": "id1", "name": "test1"}, {"id": "id2", "name": "test2"}]}}
+ // output: {"name": "abc", "expand": {"items": [{"name": "test1"}, {"name": "test2"}]}}
+ const result = await pb.collection("example").getOne("RECORD_ID", {
+ expand: "items",
+ fields: "name,expand.items.name",
+ })
+ ```
+
+- Added new `./pocketbase update` command to selfupdate the prebuilt executable (with option to generate a backup of your `pb_data`).
+
+- Added new `./pocketbase admin` console command:
+ ```sh
+ // creates new admin account
+ ./pocketbase admin create test@example.com 123456890
+
+ // changes the password of an existing admin account
+ ./pocketbase admin update test@example.com 0987654321
+
+ // deletes single admin account (if exists)
+ ./pocketbase admin delete test@example.com
+ ```
+
+- Added `apis.Serve(app, options)` helper to allow starting the API server programmatically.
+
+- Updated the schema fields Admin UI for "tidier" fields visualization.
+
+- Updated the logs "real" user IP to check for `Fly-Client-IP` header and changed the `X-Forward-For` header to use the first non-empty leftmost-ish IP as it the closest to the "real IP".
+
+- Added new `tools/archive` helper subpackage for managing archives (_currently works only with zip_).
+
+- Added new `tools/cron` helper subpackage for scheduling task using cron-like syntax (_this eventually may get exported in the future in a separate repo_).
+
+- Added new `Filesystem.List(prefix)` helper to retrieve a flat list with all files under the provided prefix.
+
+- Added new `App.NewBackupsFilesystem()` helper to create a dedicated filesystem abstraction for managing app data backups.
+
+- Added new `App.OnTerminate()` hook (_executed right before app termination, eg. on `SIGTERM` signal_).
+
+- Added `accept` file field attribute with the field MIME types ([#2466](https://github.com/pocketbase/pocketbase/pull/2466); thanks @Nikhil1920).
+
+- Added support for multiple files sort in the Admin UI ([#2445](https://github.com/pocketbase/pocketbase/issues/2445)).
+
+- Added support for multiple relations sort in the Admin UI.
+
+- Added `meta.isNew` to the OAuth2 auth JSON response to indicate a newly OAuth2 created PocketBase user.
diff --git a/core/pb/LICENSE.md b/core/pb/LICENSE.md
new file mode 100644
index 0000000..e3b8465
--- /dev/null
+++ b/core/pb/LICENSE.md
@@ -0,0 +1,17 @@
+The MIT License (MIT)
+Copyright (c) 2022 - present, Gani Georgiev
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software
+and associated documentation files (the "Software"), to deal in the Software without restriction,
+including without limitation the rights to use, copy, modify, merge, publish, distribute,
+sublicense, and/or sell copies of the Software, and to permit persons to whom the Software
+is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or
+substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
+BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
+DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/client/pb/README.md b/core/pb/README.md
similarity index 100%
rename from client/pb/README.md
rename to core/pb/README.md
diff --git a/client/pb/pb_hooks/main.pb.js b/core/pb/pb_hooks/main.pb.js
similarity index 100%
rename from client/pb/pb_hooks/main.pb.js
rename to core/pb/pb_hooks/main.pb.js
diff --git a/client/pb/pb_migrations/1712449900_created_article_translation.js b/core/pb/pb_migrations/1712449900_created_article_translation.js
similarity index 100%
rename from client/pb/pb_migrations/1712449900_created_article_translation.js
rename to core/pb/pb_migrations/1712449900_created_article_translation.js
diff --git a/client/pb/pb_migrations/1712450012_created_articles.js b/core/pb/pb_migrations/1712450012_created_articles.js
similarity index 100%
rename from client/pb/pb_migrations/1712450012_created_articles.js
rename to core/pb/pb_migrations/1712450012_created_articles.js
diff --git a/client/pb/pb_migrations/1712450207_updated_article_translation.js b/core/pb/pb_migrations/1712450207_updated_article_translation.js
similarity index 100%
rename from client/pb/pb_migrations/1712450207_updated_article_translation.js
rename to core/pb/pb_migrations/1712450207_updated_article_translation.js
diff --git a/client/pb/pb_migrations/1712450442_created_insights.js b/core/pb/pb_migrations/1712450442_created_insights.js
similarity index 100%
rename from client/pb/pb_migrations/1712450442_created_insights.js
rename to core/pb/pb_migrations/1712450442_created_insights.js
diff --git a/client/pb/pb_migrations/1713321985_created_roleplays.js b/core/pb/pb_migrations/1713321985_created_roleplays.js
similarity index 100%
rename from client/pb/pb_migrations/1713321985_created_roleplays.js
rename to core/pb/pb_migrations/1713321985_created_roleplays.js
diff --git a/client/pb/pb_migrations/1713322324_created_sites.js b/core/pb/pb_migrations/1713322324_created_sites.js
similarity index 100%
rename from client/pb/pb_migrations/1713322324_created_sites.js
rename to core/pb/pb_migrations/1713322324_created_sites.js
diff --git a/client/pb/pb_migrations/1713328405_updated_sites.js b/core/pb/pb_migrations/1713328405_updated_sites.js
similarity index 100%
rename from client/pb/pb_migrations/1713328405_updated_sites.js
rename to core/pb/pb_migrations/1713328405_updated_sites.js
diff --git a/client/pb/pb_migrations/1713329959_updated_sites.js b/core/pb/pb_migrations/1713329959_updated_sites.js
similarity index 100%
rename from client/pb/pb_migrations/1713329959_updated_sites.js
rename to core/pb/pb_migrations/1713329959_updated_sites.js
diff --git a/core/pb/pb_migrations/1714803585_updated_articles.js b/core/pb/pb_migrations/1714803585_updated_articles.js
new file mode 100644
index 0000000..453e21f
--- /dev/null
+++ b/core/pb/pb_migrations/1714803585_updated_articles.js
@@ -0,0 +1,44 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ // update
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "iorna912",
+ "name": "content",
+ "type": "text",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ // update
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "iorna912",
+ "name": "content",
+ "type": "text",
+ "required": true,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ }))
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1714835361_updated_insights.js b/core/pb/pb_migrations/1714835361_updated_insights.js
new file mode 100644
index 0000000..eb29b5b
--- /dev/null
+++ b/core/pb/pb_migrations/1714835361_updated_insights.js
@@ -0,0 +1,31 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ // add
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "d13734ez",
+ "name": "tag",
+ "type": "text",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ // remove
+ collection.schema.removeField("d13734ez")
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1714955881_updated_articles.js b/core/pb/pb_migrations/1714955881_updated_articles.js
new file mode 100644
index 0000000..1989cb4
--- /dev/null
+++ b/core/pb/pb_migrations/1714955881_updated_articles.js
@@ -0,0 +1,31 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ // add
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "pwy2iz0b",
+ "name": "source",
+ "type": "text",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ // remove
+ collection.schema.removeField("pwy2iz0b")
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715823361_created_tags.js b/core/pb/pb_migrations/1715823361_created_tags.js
new file mode 100644
index 0000000..d252a58
--- /dev/null
+++ b/core/pb/pb_migrations/1715823361_created_tags.js
@@ -0,0 +1,51 @@
+///
+migrate((db) => {
+ const collection = new Collection({
+ "id": "nvf6k0yoiclmytu",
+ "created": "2024-05-16 01:36:01.108Z",
+ "updated": "2024-05-16 01:36:01.108Z",
+ "name": "tags",
+ "type": "base",
+ "system": false,
+ "schema": [
+ {
+ "system": false,
+ "id": "0th8uax4",
+ "name": "name",
+ "type": "text",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ },
+ {
+ "system": false,
+ "id": "l6mm7m90",
+ "name": "activated",
+ "type": "bool",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {}
+ }
+ ],
+ "indexes": [],
+ "listRule": null,
+ "viewRule": null,
+ "createRule": null,
+ "updateRule": null,
+ "deleteRule": null,
+ "options": {}
+ });
+
+ return Dao(db).saveCollection(collection);
+}, (db) => {
+ const dao = new Dao(db);
+ const collection = dao.findCollectionByNameOrId("nvf6k0yoiclmytu");
+
+ return dao.deleteCollection(collection);
+})
diff --git a/core/pb/pb_migrations/1715824265_updated_insights.js b/core/pb/pb_migrations/1715824265_updated_insights.js
new file mode 100644
index 0000000..dd7d152
--- /dev/null
+++ b/core/pb/pb_migrations/1715824265_updated_insights.js
@@ -0,0 +1,52 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ // remove
+ collection.schema.removeField("d13734ez")
+
+ // add
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "j65p3jji",
+ "name": "tag",
+ "type": "relation",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "collectionId": "nvf6k0yoiclmytu",
+ "cascadeDelete": false,
+ "minSelect": null,
+ "maxSelect": null,
+ "displayFields": null
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ // add
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "d13734ez",
+ "name": "tag",
+ "type": "text",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ }))
+
+ // remove
+ collection.schema.removeField("j65p3jji")
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715852342_updated_insights.js b/core/pb/pb_migrations/1715852342_updated_insights.js
new file mode 100644
index 0000000..6a6f8c2
--- /dev/null
+++ b/core/pb/pb_migrations/1715852342_updated_insights.js
@@ -0,0 +1,16 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ collection.listRule = "@request.auth.id != \"\" && @request.auth.tag:each ?~ tag:each"
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ collection.listRule = null
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715852638_updated_insights.js b/core/pb/pb_migrations/1715852638_updated_insights.js
new file mode 100644
index 0000000..42efa86
--- /dev/null
+++ b/core/pb/pb_migrations/1715852638_updated_insights.js
@@ -0,0 +1,16 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ collection.viewRule = "@request.auth.id != \"\" && @request.auth.tag:each ?~ tag:each"
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ collection.viewRule = null
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715852847_updated_users.js b/core/pb/pb_migrations/1715852847_updated_users.js
new file mode 100644
index 0000000..bfe64a3
--- /dev/null
+++ b/core/pb/pb_migrations/1715852847_updated_users.js
@@ -0,0 +1,33 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("_pb_users_auth_")
+
+ // add
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "8d9woe75",
+ "name": "tag",
+ "type": "relation",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "collectionId": "nvf6k0yoiclmytu",
+ "cascadeDelete": false,
+ "minSelect": null,
+ "maxSelect": null,
+ "displayFields": null
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("_pb_users_auth_")
+
+ // remove
+ collection.schema.removeField("8d9woe75")
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715852924_updated_articles.js b/core/pb/pb_migrations/1715852924_updated_articles.js
new file mode 100644
index 0000000..ff0501c
--- /dev/null
+++ b/core/pb/pb_migrations/1715852924_updated_articles.js
@@ -0,0 +1,33 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ // add
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "famdh2fv",
+ "name": "tag",
+ "type": "relation",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "collectionId": "nvf6k0yoiclmytu",
+ "cascadeDelete": false,
+ "minSelect": null,
+ "maxSelect": null,
+ "displayFields": null
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ // remove
+ collection.schema.removeField("famdh2fv")
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715852932_updated_articles.js b/core/pb/pb_migrations/1715852932_updated_articles.js
new file mode 100644
index 0000000..29b0cca
--- /dev/null
+++ b/core/pb/pb_migrations/1715852932_updated_articles.js
@@ -0,0 +1,18 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ collection.listRule = "@request.auth.id != \"\" && @request.auth.tag:each ?~ tag:each"
+ collection.viewRule = "@request.auth.id != \"\" && @request.auth.tag:each ?~ tag:each"
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("lft7642skuqmry7")
+
+ collection.listRule = null
+ collection.viewRule = null
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715852952_updated_article_translation.js b/core/pb/pb_migrations/1715852952_updated_article_translation.js
new file mode 100644
index 0000000..f960931
--- /dev/null
+++ b/core/pb/pb_migrations/1715852952_updated_article_translation.js
@@ -0,0 +1,33 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("bc3g5s66bcq1qjp")
+
+ // add
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "lbxw5pra",
+ "name": "tag",
+ "type": "relation",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "collectionId": "nvf6k0yoiclmytu",
+ "cascadeDelete": false,
+ "minSelect": null,
+ "maxSelect": null,
+ "displayFields": null
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("bc3g5s66bcq1qjp")
+
+ // remove
+ collection.schema.removeField("lbxw5pra")
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1715852974_updated_article_translation.js b/core/pb/pb_migrations/1715852974_updated_article_translation.js
new file mode 100644
index 0000000..b597bea
--- /dev/null
+++ b/core/pb/pb_migrations/1715852974_updated_article_translation.js
@@ -0,0 +1,18 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("bc3g5s66bcq1qjp")
+
+ collection.listRule = "@request.auth.id != \"\" && @request.auth.tag:each ?~ tag:each"
+ collection.viewRule = "@request.auth.id != \"\" && @request.auth.tag:each ?~ tag:each"
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("bc3g5s66bcq1qjp")
+
+ collection.listRule = null
+ collection.viewRule = null
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1716165809_updated_tags.js b/core/pb/pb_migrations/1716165809_updated_tags.js
new file mode 100644
index 0000000..7a9baf6
--- /dev/null
+++ b/core/pb/pb_migrations/1716165809_updated_tags.js
@@ -0,0 +1,44 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("nvf6k0yoiclmytu")
+
+ // update
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "0th8uax4",
+ "name": "name",
+ "type": "text",
+ "required": true,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("nvf6k0yoiclmytu")
+
+ // update
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "0th8uax4",
+ "name": "name",
+ "type": "text",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "min": null,
+ "max": null,
+ "pattern": ""
+ }
+ }))
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1716168332_updated_insights.js b/core/pb/pb_migrations/1716168332_updated_insights.js
new file mode 100644
index 0000000..aa03a18
--- /dev/null
+++ b/core/pb/pb_migrations/1716168332_updated_insights.js
@@ -0,0 +1,48 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ // update
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "j65p3jji",
+ "name": "tag",
+ "type": "relation",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "collectionId": "nvf6k0yoiclmytu",
+ "cascadeDelete": false,
+ "minSelect": null,
+ "maxSelect": 1,
+ "displayFields": null
+ }
+ }))
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("h3c6pqhnrfo4oyf")
+
+ // update
+ collection.schema.addField(new SchemaField({
+ "system": false,
+ "id": "j65p3jji",
+ "name": "tag",
+ "type": "relation",
+ "required": false,
+ "presentable": false,
+ "unique": false,
+ "options": {
+ "collectionId": "nvf6k0yoiclmytu",
+ "cascadeDelete": false,
+ "minSelect": null,
+ "maxSelect": null,
+ "displayFields": null
+ }
+ }))
+
+ return dao.saveCollection(collection)
+})
diff --git a/core/pb/pb_migrations/1717321896_updated_tags.js b/core/pb/pb_migrations/1717321896_updated_tags.js
new file mode 100644
index 0000000..9ddbbf8
--- /dev/null
+++ b/core/pb/pb_migrations/1717321896_updated_tags.js
@@ -0,0 +1,18 @@
+///
+migrate((db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("nvf6k0yoiclmytu")
+
+ collection.listRule = "@request.auth.id != \"\""
+ collection.viewRule = "@request.auth.id != \"\""
+
+ return dao.saveCollection(collection)
+}, (db) => {
+ const dao = new Dao(db)
+ const collection = dao.findCollectionByNameOrId("nvf6k0yoiclmytu")
+
+ collection.listRule = null
+ collection.viewRule = null
+
+ return dao.saveCollection(collection)
+})
diff --git a/client/backend/scrapers/README.md b/core/scrapers/README.md
similarity index 100%
rename from client/backend/scrapers/README.md
rename to core/scrapers/README.md
diff --git a/core/scrapers/__init__.py b/core/scrapers/__init__.py
new file mode 100644
index 0000000..008a714
--- /dev/null
+++ b/core/scrapers/__init__.py
@@ -0,0 +1,6 @@
+from .mp_crawler import mp_crawler
+from .simple_crawler import simple_crawler
+from .general_scraper import llm_crawler
+
+
+scraper_map = {}
diff --git a/client/backend/scrapers/general_scraper.py b/core/scrapers/general_scraper.py
similarity index 93%
rename from client/backend/scrapers/general_scraper.py
rename to core/scrapers/general_scraper.py
index e753929..40030a5 100644
--- a/client/backend/scrapers/general_scraper.py
+++ b/core/scrapers/general_scraper.py
@@ -6,12 +6,16 @@ import httpx
from bs4 import BeautifulSoup
from bs4.element import Comment
from llms.dashscope_wrapper import dashscope_llm
+# from llms.openai_wrapper import openai_llm
+# from llms.siliconflow_wrapper import sfa_llm
from datetime import datetime, date
from requests.compat import urljoin
import chardet
-from general_utils import extract_and_convert_dates
+from utils.general_utils import extract_and_convert_dates
+model = "qwen-long"
+# model = "deepseek-chat"
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/604.1 Edg/112.0.100.0'}
@@ -62,7 +66,7 @@ def parse_html_content(out: str) -> dict:
# qwen1.5-72b解析json格式太容易出错,网页上的情况太多,比如经常直接使用英文的",这样后面json.loads就容易出错……
-sys_info = '''你是一个html网页解析器,你将接收一段用户从网页html文件中提取的文本,请解析出其标题、摘要、内容和发布日期,发布日期格式为YYYY-MM-DD。
+sys_info = '''你是一个html解析器,你将接收一段html代码,请解析出其标题、摘要、内容和发布日期,发布日期格式为YYYY-MM-DD。
结果请按照以下格式返回(整体用三引号包裹):
"""
标题||摘要||内容||发布日期XXXX-XX-XX
@@ -105,7 +109,8 @@ def llm_crawler(url: str | Path, logger) -> (int, dict):
{"role": "system", "content": sys_info},
{"role": "user", "content": html_text}
]
- llm_output = dashscope_llm(messages, "qwen1.5-72b-chat", logger=logger)
+ llm_output = dashscope_llm(messages, model=model, logger=logger)
+ # llm_output = openai_llm(messages, model=model, logger=logger)
try:
info = parse_html_content(llm_output)
except Exception:
diff --git a/core/scrapers/mp_crawler.py b/core/scrapers/mp_crawler.py
new file mode 100644
index 0000000..9f50f35
--- /dev/null
+++ b/core/scrapers/mp_crawler.py
@@ -0,0 +1,109 @@
+import httpx
+from bs4 import BeautifulSoup
+from datetime import datetime
+import re
+
+
+header = {
+ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/604.1 Edg/112.0.100.0'}
+
+
+def mp_crawler(url: str, logger) -> (int, dict):
+ if not url.startswith('https://mp.weixin.qq.com') and not url.startswith('http://mp.weixin.qq.com'):
+ logger.warning(f'{url} is not a mp url, you should not use this function')
+ return -5, {}
+
+ url = url.replace("http://", "https://", 1)
+
+ try:
+ with httpx.Client() as client:
+ response = client.get(url, headers=header, timeout=30)
+ except Exception as e:
+ logger.warning(f"cannot get content from {url}\n{e}")
+ return -7, {}
+
+ soup = BeautifulSoup(response.text, 'html.parser')
+
+ # Get the original release date first
+ pattern = r"var createTime = '(\d{4}-\d{2}-\d{2}) \d{2}:\d{2}'"
+ match = re.search(pattern, response.text)
+
+ if match:
+ date_only = match.group(1)
+ publish_time = date_only.replace('-', '')
+ else:
+ publish_time = datetime.strftime(datetime.today(), "%Y%m%d")
+
+ # Get description content from < meta > tag
+ try:
+ meta_description = soup.find('meta', attrs={'name': 'description'})
+ summary = meta_description['content'].strip() if meta_description else ''
+ card_info = soup.find('div', id='img-content')
+ # Parse the required content from the < div > tag
+ rich_media_title = soup.find('h1', id='activity-name').text.strip() \
+ if soup.find('h1', id='activity-name') \
+ else soup.find('h1', class_='rich_media_title').text.strip()
+ profile_nickname = card_info.find('strong', class_='profile_nickname').text.strip() \
+ if card_info \
+ else soup.find('div', class_='wx_follow_nickname').text.strip()
+ except Exception as e:
+ logger.warning(f"not mp format: {url}\n{e}")
+ return -7, {}
+
+ if not rich_media_title or not profile_nickname:
+ logger.warning(f"failed to analysis {url}, no title or profile_nickname")
+ # For mp.weixin.qq.com types, mp_crawler won't work, and most likely neither will the other two
+ return -7, {}
+
+ # Parse text and image links within the content interval
+ # Todo This scheme is compatible with picture sharing MP articles, but the pictures of the content cannot be obtained,
+ # because the structure of this part is completely different, and a separate analysis scheme needs to be written
+ # (but the proportion of this type of article is not high).
+ texts = []
+ images = set()
+ content_area = soup.find('div', id='js_content')
+ if content_area:
+ # 提取文本
+ for section in content_area.find_all(['section', 'p'], recursive=False): # 遍历顶级section
+ text = section.get_text(separator=' ', strip=True)
+ if text and text not in texts:
+ texts.append(text)
+
+ for img in content_area.find_all('img', class_='rich_pages wxw-img'):
+ img_src = img.get('data-src') or img.get('src')
+ if img_src:
+ images.add(img_src)
+ cleaned_texts = [t for t in texts if t.strip()]
+ content = '\n'.join(cleaned_texts)
+ else:
+ logger.warning(f"failed to analysis contents {url}")
+ return 0, {}
+ if content:
+ content = f"({profile_nickname} 文章){content}"
+ else:
+ # If the content does not have it, but the summary has it, it means that it is an mp of the picture sharing type.
+ # At this time, you can use the summary as the content.
+ content = f"({profile_nickname} 文章){summary}"
+
+ # Get links to images in meta property = "og: image" and meta property = "twitter: image"
+ og_image = soup.find('meta', property='og:image')
+ twitter_image = soup.find('meta', property='twitter:image')
+ if og_image:
+ images.add(og_image['content'])
+ if twitter_image:
+ images.add(twitter_image['content'])
+
+ if rich_media_title == summary or not summary:
+ abstract = ''
+ else:
+ abstract = f"({profile_nickname} 文章){rich_media_title}——{summary}"
+
+ return 11, {
+ 'title': rich_media_title,
+ 'author': profile_nickname,
+ 'publish_time': publish_time,
+ 'abstract': abstract,
+ 'content': content,
+ 'images': list(images),
+ 'url': url,
+ }
diff --git a/client/backend/scrapers/simple_crawler.py b/core/scrapers/simple_crawler.py
similarity index 97%
rename from client/backend/scrapers/simple_crawler.py
rename to core/scrapers/simple_crawler.py
index 3f6593d..26e70c5 100644
--- a/client/backend/scrapers/simple_crawler.py
+++ b/core/scrapers/simple_crawler.py
@@ -3,7 +3,7 @@ import httpx
from bs4 import BeautifulSoup
from datetime import datetime
from pathlib import Path
-from general_utils import extract_and_convert_dates
+from utils.general_utils import extract_and_convert_dates
import chardet
diff --git a/core/tasks.py b/core/tasks.py
new file mode 100644
index 0000000..df87646
--- /dev/null
+++ b/core/tasks.py
@@ -0,0 +1,152 @@
+"""
+通过编辑这个脚本,可以自定义需要的后台任务
+"""
+import schedule
+import time
+from topnews import pipeline
+from loguru import logger
+from utils.pb_api import PbTalker
+import os
+from utils.general_utils import get_logger_level
+from datetime import datetime, timedelta
+import pytz
+import requests
+
+
+project_dir = os.environ.get("PROJECT_DIR", "")
+if project_dir:
+ os.makedirs(project_dir, exist_ok=True)
+logger_file = os.path.join(project_dir, 'tasks.log')
+dsw_log = get_logger_level()
+logger.add(
+ logger_file,
+ level=dsw_log,
+ backtrace=True,
+ diagnose=True,
+ rotation="50 MB"
+)
+
+pb = PbTalker(logger)
+utc_now = datetime.now(pytz.utc)
+# 减去一天得到前一天的UTC时间
+utc_yesterday = utc_now - timedelta(days=1)
+utc_last = utc_yesterday.strftime("%Y-%m-%d %H:%M:%S")
+
+
+def task():
+ """
+ global counter
+ sites = pb.read('sites', filter='activated=True')
+ urls = []
+ for site in sites:
+ if not site['per_hours'] or not site['url']:
+ continue
+ if counter % site['per_hours'] == 0:
+ urls.append(site['url'])
+ logger.info(f'\033[0;32m task execute loop {counter}\033[0m')
+ logger.info(urls)
+ if urls:
+ sp(sites=urls)
+ else:
+ if counter % 24 == 0:
+ sp()
+ else:
+ print('\033[0;33mno work for this loop\033[0m')
+ counter += 1
+ """
+ global utc_last
+ logger.debug(f'last_collect_time: {utc_last}')
+ datas = pb.read(collection_name='insights', filter=f'updated>="{utc_last}"', fields=['id', 'content', 'tag', 'articles'])
+ logger.debug(f"got {len(datas)} items")
+ utc_last = datetime.now(pytz.utc).strftime("%Y-%m-%d %H:%M:%S")
+ logger.debug(f'now_utc_time: {utc_last}')
+
+ tags = pb.read(collection_name='tags', filter=f'activated=True')
+ tags_dict = {item["id"]: item["name"] for item in tags if item["name"]}
+ top_news = {}
+ for id, name in tags_dict.items():
+ logger.debug(f'tag: {name}')
+ data = [item for item in datas if item['tag'] == id]
+ topnew = pipeline(data, logger)
+ if not topnew:
+ logger.debug(f'no top news for {name}')
+ continue
+
+ top_news[id] = {}
+ for content, articles in topnew.items():
+ content_urls = [pb.read('articles', filter=f'id="{a}"', fields=['url'])[0]['url'] for a in articles]
+ # 去除重叠内容
+ # 如果发现重叠内容,哪个标签长就把对应的从哪个标签删除
+ to_skip = False
+ for k, v in top_news.items():
+ to_del_key = None
+ for c, u in v.items():
+ if not set(content_urls).isdisjoint(set(u)):
+ if len(topnew) > len(v):
+ to_skip = True
+ else:
+ to_del_key = c
+ break
+ if to_del_key:
+ del top_news[k][to_del_key]
+ if to_skip:
+ break
+ if not to_skip:
+ top_news[id][content] = content_urls
+
+ if not top_news[id]:
+ del top_news[id]
+
+ if not top_news:
+ logger.info("no top news today")
+ return
+
+ # 序列化为字符串
+ top_news_text = {"#党建引领基层治理": [],
+ "#数字社区": [],
+ "#优秀活动案例": []}
+
+ for id, v in top_news.items():
+ # top_news[id] = {content: '\n\n'.join(urls) for content, urls in v.items()}
+ top_news[id] = {content: urls[0] for content, urls in v.items()}
+ if id == 's3kqj9ek8nvtthr':
+ top_news_text["#数字社区"].append("\n".join(f"{content}\n{urls}" for content, urls in top_news[id].items()))
+ elif id == 'qpcgotbqyz3a617':
+ top_news_text["#优秀活动案例"].append("\n".join(f"{content}\n{urls}" for content, urls in top_news[id].items()))
+ else:
+ top_news_text["#党建引领基层治理"].append("\n".join(f"{content}\n{urls}" for content, urls in top_news[id].items()))
+
+ top_news_text = {k: "\n".join(v) for k, v in top_news_text.items()}
+ top_news_text = "\n\n".join(f"{k}\n{v}" for k, v in top_news_text.items())
+ logger.info(top_news_text)
+
+ data = {
+ "wxid": "R:10860349446619856",
+ "content": top_news_text
+ }
+ try:
+ response = requests.post("http://localhost:8088/api/sendtxtmsg", json=data)
+ if response.status_code == 200:
+ logger.info("send message to wechat success")
+ time.sleep(1)
+ data = {
+ "wxid": "R:10860349446619856",
+ "content": "[太阳] 今日份的临小助内参来啦!",
+ "atlist": ["@all"]
+ }
+ try:
+ response = requests.post("http://localhost:8088/api/sendtxtmsg", json=data)
+ if response.status_code == 200:
+ logger.info("send notify to wechat success")
+ except Exception as e:
+ logger.error(f"send notify to wechat failed: {e}")
+ except Exception as e:
+ logger.error(f"send message to wechat failed: {e}")
+
+
+schedule.every().day.at("07:38").do(task)
+
+task()
+while True:
+ schedule.run_pending()
+ time.sleep(60)
diff --git a/core/utils/__init__.py b/core/utils/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/client/backend/general_utils.py b/core/utils/general_utils.py
similarity index 82%
rename from client/backend/general_utils.py
rename to core/utils/general_utils.py
index e5e03aa..6fc3a2e 100644
--- a/client/backend/general_utils.py
+++ b/core/utils/general_utils.py
@@ -6,6 +6,7 @@ from urllib.parse import urlparse
import time
import os
import re
+import jieba
def isURL(string):
@@ -13,6 +14,15 @@ def isURL(string):
return result.scheme != '' and result.netloc != ''
+def extract_urls(text):
+ url_pattern = re.compile(r'https?://[-A-Za-z0-9+&@#/%?=~_|!:.;]+[-A-Za-z0-9+&@#/%=~_|]')
+ urls = re.findall(url_pattern, text)
+
+ # 过滤掉那些只匹配到 'www.' 而没有后续内容的情况,并尝试为每个URL添加默认的http协议前缀以便解析
+ cleaned_urls = [url for url in urls if isURL(url)]
+ return cleaned_urls
+
+
def isChinesePunctuation(char):
# 定义中文标点符号的Unicode编码范围
chinese_punctuations = set(range(0x3000, 0x303F)) | set(range(0xFF00, 0xFFEF))
@@ -162,6 +172,29 @@ def get_logger_level() -> str:
return level_map.get(level, 'info')
+def compare_phrase_with_list(target_phrase, phrase_list, threshold):
+ """
+ 比较一个目标短语与短语列表中每个短语的相似度。
+
+ :param target_phrase: 目标短语 (str)
+ :param phrase_list: 短语列表 (list of str)
+ :param threshold: 相似度阈值 (float)
+ :return: 满足相似度条件的短语列表 (list of str)
+ """
+ # 检查目标短语是否为空
+ if not target_phrase:
+ return [] # 目标短语为空,直接返回空列表
+
+ # 预处理:对目标短语和短语列表中的每个短语进行分词
+ target_tokens = set(jieba.lcut(target_phrase))
+ tokenized_phrases = {phrase: set(jieba.lcut(phrase)) for phrase in phrase_list}
+
+ # 比较并筛选
+ similar_phrases = [phrase for phrase, tokens in tokenized_phrases.items()
+ if len(target_tokens & tokens) / min(len(target_tokens), len(tokens)) > threshold]
+
+ return similar_phrases
+
"""
# from InternLM/huixiangdou
# another awsome work
diff --git a/client/backend/pb_api.py b/core/utils/pb_api.py
similarity index 69%
rename from client/backend/pb_api.py
rename to core/utils/pb_api.py
index 2c6fac8..7f73f6e 100644
--- a/client/backend/pb_api.py
+++ b/core/utils/pb_api.py
@@ -13,28 +13,28 @@ class PbTalker:
self.client = PocketBase(url)
auth = os.environ.get('PB_API_AUTH', '')
if not auth or "|" not in auth:
- self.logger.warning(f"invalid email|password found, will handle with not auth, make sure you have set the collection rule by anyone")
+ self.logger.warnning("invalid email|password found, will handle with not auth, make sure you have set the collection rule by anyone")
else:
email, password = auth.split('|')
- _ = self.client.admins.auth_with_password(email, password)
- if _:
- self.logger.info(f"pocketbase ready authenticated as admin - {email}")
- else:
- raise Exception(f"pocketbase auth failed")
+ try:
+ admin_data = self.client.admins.auth_with_password(email, password)
+ if admin_data:
+ self.logger.info(f"pocketbase ready authenticated as admin - {email}")
+ except:
+ user_data = self.client.collection("users").auth_with_password(email, password)
+ if user_data:
+ self.logger.info(f"pocketbase ready authenticated as user - {email}")
+ else:
+ raise Exception("pocketbase auth failed")
def read(self, collection_name: str, fields: list[str] = None, filter: str = '', skiptotal: bool = True) -> list:
results = []
for i in range(1, 10):
try:
- if fields:
- res = self.client.collection(collection_name).get_list(i, 500,
- {"filter": filter,
- "fields": ','.join(fields),
- "skiptotal": skiptotal})
- else:
- res = self.client.collection(collection_name).get_list(i, 500,
- {"filter": filter,
- "skiptotal": skiptotal})
+ res = self.client.collection(collection_name).get_list(i, 500,
+ {"filter": filter,
+ "fields": ','.join(fields) if fields else '',
+ "skiptotal": skiptotal})
except Exception as e:
self.logger.error(f"pocketbase get list failed: {e}")
@@ -79,3 +79,11 @@ class PbTalker:
self.logger.error(f"pocketbase update failed: {e}")
return ''
return res.id
+
+ def view(self, collection_name: str, item_id: str, fields: list[str] = None) -> dict:
+ try:
+ res = self.client.collection(collection_name).get_one(item_id,{"fields": ','.join(fields) if fields else ''})
+ return vars(res)
+ except Exception as e:
+ self.logger.error(f"pocketbase view item failed: {e}")
+ return {}
diff --git a/dashboard/README.md b/dashboard/README.md
new file mode 100644
index 0000000..644c128
--- /dev/null
+++ b/dashboard/README.md
@@ -0,0 +1,71 @@
+**Included Web Dashboard Example**: This is optional. If you only use the data processing functions or have your own downstream task program, you can ignore everything in this folder!
+
+## Main Features
+
+1.Daily Insights Display
+2.Daily Article Display
+3.Appending Search for Specific Hot Topics (using Sogou engine)
+4.Generating Word Reports for Specific Hot Topics
+
+**Note: The code here cannot be used directly. It is adapted to an older version of the backend. You need to study the latest backend code in the `core` folder and make changes, especially in parts related to database integration!**
+
+-----------------------------------------------------------------
+
+附带的web Dashboard 示例,并非必须,如果你只是使用数据处理功能,或者你有自己的下游任务程序,可以忽略这个文件夹内的一切!
+
+## 主要功能
+
+1. 每日insights展示
+2. 每日文章展示
+3. 指定热点追加搜索(使用sougou引擎)
+4. 指定热点生成word报告
+
+**注意:这里的代码并不能直接使用,它适配的是旧版本的后端程序,你需要研究core文件夹下的最新后端代码,进行更改,尤其是跟数据库对接的部分!**
+
+-----------------------------------------------------------------
+
+**付属のWebダッシュボードのサンプル**:これは必須ではありません。データ処理機能のみを使用する場合、または独自の下流タスクプログラムを持っている場合は、このフォルダ内のすべてを無視できます!
+
+## 主な機能
+
+1. 毎日のインサイト表示
+
+2. 毎日の記事表示
+
+3. 特定のホットトピックの追加検索(Sogouエンジンを使用)
+
+4. 特定のホットトピックのWordレポートの生成
+
+**注意:ここにあるコードは直接使用できません。古いバージョンのバックエンドに適合しています。`core`フォルダ内の最新のバックエンドコードを調べ、特にデータベースとの連携部分について変更を行う必要があります!**
+
+-----------------------------------------------------------------
+
+**Exemple de tableau de bord Web inclus** : Ceci est facultatif. Si vous n'utilisez que les fonctions de traitement des données ou si vous avez votre propre programme de tâches en aval, vous pouvez ignorer tout ce qui se trouve dans ce dossier !
+
+## Fonctions principales
+
+1. Affichage des insights quotidiens
+
+2. Affichage des articles quotidiens
+
+3. Recherche supplémentaire pour des sujets populaires spécifiques (en utilisant le moteur Sogou)
+
+4. Génération de rapports Word pour des sujets populaires spécifiques
+
+**Remarque : Le code ici ne peut pas être utilisé directement. Il est adapté à une version plus ancienne du backend. Vous devez étudier le code backend le plus récent dans le dossier `core` et apporter des modifications, en particulier dans les parties relatives à l'intégration de la base de données !**
+
+-----------------------------------------------------------------
+
+**Beispiel eines enthaltenen Web-Dashboards**: Dies ist optional. Wenn Sie nur die Datenverarbeitungsfunktionen verwenden oder Ihr eigenes Downstream-Aufgabenprogramm haben, können Sie alles in diesem Ordner ignorieren!
+
+## Hauptfunktionen
+
+1. Tägliche Einblicke anzeigen
+
+2. Tägliche Artikel anzeigen
+
+3. Angehängte Suche nach spezifischen Hot Topics (unter Verwendung der Sogou-Suchmaschine)
+
+4. Erstellen von Word-Berichten für spezifische Hot Topics
+
+**Hinweis: Der Code hier kann nicht direkt verwendet werden. Er ist an eine ältere Version des Backends angepasst. Sie müssen den neuesten Backend-Code im `core`-Ordner studieren und Änderungen vornehmen, insbesondere in den Teilen, die die Datenbankintegration betreffen!**
diff --git a/client/backend/__init__.py b/dashboard/__init__.py
similarity index 85%
rename from client/backend/__init__.py
rename to dashboard/__init__.py
index 0f5f128..ced14f9 100644
--- a/client/backend/__init__.py
+++ b/dashboard/__init__.py
@@ -2,7 +2,6 @@ import os
import time
import json
import uuid
-from pb_api import PbTalker
from get_report import get_report, logger, pb
from get_search import search_insight
from tranlsation_volcengine import text_translate
@@ -22,12 +21,6 @@ class BackendService:
logger.info('backend service init success.')
def report(self, insight_id: str, topics: list[str], comment: str) -> dict:
- """
- :param insight_id: insight在pb中的id
- :param topics: 书写报告的主题和大纲,必传,第一个值是标题,后面是段落标题,可以传空列表,AI就自由发挥
- :param comment: 修改意见,可以传‘’
- :return: 成功的话返回更新后的insight_id(其实跟原id一样), 不成功返回空字符
- """
logger.debug(f'got new report request insight_id {insight_id}')
insight = pb.read('insights', filter=f'id="{insight_id}"')
if not insight:
@@ -48,8 +41,6 @@ class BackendService:
return self.build_out(-2, f'{insight_id} has no valid articles')
content = insight[0]['content']
- # 这里把所有相关文章的content都要翻译成中文了,分析使用中文,因为涉及到部分专有词汇维护在火山的账户词典上,大模型并不了解
- # 发现翻译为中文后,引发灵积模型敏感词检测概率增加了,暂时放弃……
if insight_id in self.memory:
memory = self.memory[insight_id]
else:
@@ -78,11 +69,7 @@ class BackendService:
def translate(self, article_ids: list[str]) -> dict:
"""
- :param article_ids: 待翻译的文章id列表
- :return: 成功的话flag 11。负数为报错,但依然可能部分任务完成,可以稍后再次调用
- 返回中的msg记录了可能的错误
- 这个函数的作用是遍历列表中的id, 如果对应article——id中没有translation_result,则触发翻译,并更新article——id记录
- 执行本函数后,如果收到flag 11,则可以再次从pb中请求article-id对应的translation_result
+ just for chinese users
"""
logger.debug(f'got new translate task {article_ids}')
flag = 11
@@ -155,10 +142,6 @@ class BackendService:
return self.build_out(flag, msg)
def more_search(self, insight_id: str) -> dict:
- """
- :param insight_id: insight在pb中的id
- :return: 成功的话返回更新后的insight_id(其实跟原id一样), 不成功返回空字符
- """
logger.debug(f'got search request for insight: {insight_id}')
insight = pb.read('insights', filter=f'id="{insight_id}"')
if not insight:
diff --git a/client/backend/backend.sh b/dashboard/backend.sh
similarity index 100%
rename from client/backend/backend.sh
rename to dashboard/backend.sh
diff --git a/dashboard/general_utils.py b/dashboard/general_utils.py
new file mode 100644
index 0000000..6e909b5
--- /dev/null
+++ b/dashboard/general_utils.py
@@ -0,0 +1,65 @@
+from urllib.parse import urlparse
+import os
+import re
+
+
+def isURL(string):
+ result = urlparse(string)
+ return result.scheme != '' and result.netloc != ''
+
+
+def isChinesePunctuation(char):
+ # 定义中文标点符号的Unicode编码范围
+ chinese_punctuations = set(range(0x3000, 0x303F)) | set(range(0xFF00, 0xFFEF))
+ # 检查字符是否在上述范围内
+ return ord(char) in chinese_punctuations
+
+
+def is_chinese(string):
+ """
+ 使用火山引擎其实可以支持更加广泛的语言检测,未来可以考虑 https://www.volcengine.com/docs/4640/65066
+ 判断字符串中大部分是否是中文
+ :param string: {str} 需要检测的字符串
+ :return: {bool} 如果大部分是中文返回True,否则返回False
+ """
+ pattern = re.compile(r'[^\u4e00-\u9fa5]')
+ non_chinese_count = len(pattern.findall(string))
+ # It is easy to misjudge strictly according to the number of bytes less than half. English words account for a large number of bytes, and there are punctuation marks, etc
+ return (non_chinese_count/len(string)) < 0.68
+
+
+def extract_and_convert_dates(input_string):
+ # Define regular expressions that match different date formats
+ patterns = [
+ r'(\d{4})-(\d{2})-(\d{2})', # YYYY-MM-DD
+ r'(\d{4})/(\d{2})/(\d{2})', # YYYY/MM/DD
+ r'(\d{4})\.(\d{2})\.(\d{2})', # YYYY.MM.DD
+ r'(\d{4})\\(\d{2})\\(\d{2})', # YYYY\MM\DD
+ r'(\d{4})(\d{2})(\d{2})' # YYYYMMDD
+ ]
+
+ matches = []
+ for pattern in patterns:
+ matches = re.findall(pattern, input_string)
+ if matches:
+ break
+ if matches:
+ return ''.join(matches[0])
+ return None
+
+
+def get_logger_level() -> str:
+ level_map = {
+ 'silly': 'CRITICAL',
+ 'verbose': 'DEBUG',
+ 'info': 'INFO',
+ 'warn': 'WARNING',
+ 'error': 'ERROR',
+ }
+ level: str = os.environ.get('WS_LOG', 'info').lower()
+ if level not in level_map:
+ raise ValueError(
+ 'WiseFlow LOG should support the values of `silly`, '
+ '`verbose`, `info`, `warn`, `error`'
+ )
+ return level_map.get(level, 'info')
diff --git a/client/backend/get_report.py b/dashboard/get_report.py
similarity index 99%
rename from client/backend/get_report.py
rename to dashboard/get_report.py
index d3e6e17..c405ac6 100644
--- a/client/backend/get_report.py
+++ b/dashboard/get_report.py
@@ -1,7 +1,7 @@
import random
import re
import os
-from llms.dashscope_wrapper import dashscope_llm
+from backend.llms.dashscope_wrapper import dashscope_llm
from docx import Document
from docx.oxml.ns import qn
from docx.shared import Pt, RGBColor
diff --git a/client/backend/get_search.py b/dashboard/get_search.py
similarity index 66%
rename from client/backend/get_search.py
rename to dashboard/get_search.py
index 7b2c19b..12454ac 100644
--- a/client/backend/get_search.py
+++ b/dashboard/get_search.py
@@ -1,31 +1,21 @@
-from scrapers.simple_crawler import simple_crawler
+from .simple_crawler import simple_crawler
+from .mp_crawler import mp_crawler
from typing import Union
from pathlib import Path
import requests
import re
-import json
-from urllib.parse import quote, urlparse
+from urllib.parse import quote
from bs4 import BeautifulSoup
+import time
-# 国内的应用场景,sogou搜索应该不错了,还支持weixin、百科搜索
-# 海外的应用场景可以考虑使用duckduckgo或者google_search的sdk
-# 尽量还是不要自己host一个搜索引擎吧,虽然有类似https://github.com/StractOrg/stract/tree/main的开源方案,但毕竟这是两套工程
def search_insight(keyword: str, logger, exist_urls: list[Union[str, Path]], knowledge: bool = False) -> (int, list):
- """
- 搜索网页
- :param keyword: 要搜索的主题
- :param exist_urls: 已经存在的url列表,即这些网页已经存在,搜索结果中如果出现则跳过
- :param knowledge: 是否搜索知识
- :param logger: 日志
- :return: 返回文章信息列表list[dict]和flag,负数为报错,0为没有结果,11为成功
- """
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.44",
}
- # 如果knowledge参数为真,则意味着搜索概念知识,这时只会搜索sogou百科
- # 默认是搜索新闻资讯,同时搜索sogou网页和资讯
+ # If the knowledge parameter is true, it means searching for conceptual knowledge, then only sogou encyclopedia will be searched
+ # The default is to search for news information, and search for sogou pages and information at the same time
if knowledge:
url = f"https://www.sogou.com/sogou?query={keyword}&insite=baike.sogou.com"
else:
@@ -74,24 +64,24 @@ def search_insight(keyword: str, logger, exist_urls: list[Union[str, Path]], kno
if not relist:
return -7, []
- # 这里仅使用simple_crawler, 因为search行为要快
results = []
for url in relist:
if url in exist_urls:
continue
exist_urls.append(url)
- flag, value = simple_crawler(url, logger)
+ if url.startswith('https://mp.weixin.qq.com') or url.startswith('http://mp.weixin.qq.com'):
+ flag, article = mp_crawler(url, logger)
+ if flag == -7:
+ logger.info(f"fetch {url} failed, try to wait 1min and try again")
+ time.sleep(60)
+ flag, article = mp_crawler(url, logger)
+ else:
+ flag, article = simple_crawler(url, logger)
+
if flag != 11:
continue
- from_site = urlparse(url).netloc
- if from_site.startswith('www.'):
- from_site = from_site.replace('www.', '')
- from_site = from_site.split('.')[0]
- if value['abstract']:
- value['abstract'] = f"({from_site} 报道){value['abstract']}"
- value['content'] = f"({from_site} 报道){value['content']}"
- value['images'] = json.dumps(value['images'])
- results.append(value)
+
+ results.append(article)
if results:
return 11, results
@@ -102,7 +92,7 @@ def redirect_url(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
}
- r = requests.get(url, headers=headers, allow_redirects=False) # 不允许重定向
+ r = requests.get(url, headers=headers, allow_redirects=False)
if r.status_code == 302:
real_url = r.headers.get('Location')
else:
diff --git a/client/backend/main.py b/dashboard/main.py
similarity index 73%
rename from client/backend/main.py
rename to dashboard/main.py
index 161ae0a..377bc0f 100644
--- a/client/backend/main.py
+++ b/dashboard/main.py
@@ -1,4 +1,3 @@
-# 这是后端服务的fastapi框架程序
from fastapi import FastAPI
from pydantic import BaseModel
from __init__ import BackendService
@@ -17,13 +16,13 @@ class TranslateRequest(BaseModel):
class ReportRequest(BaseModel):
insight_id: str
- toc: list[str] = [""] # 第一个元素为大标题,其余为段落标题。第一个元素必须存在,可以是空字符,llm会自动拟标题。
+ toc: list[str] = [""] # The first element is a headline, and the rest are paragraph headings. The first element must exist, can be a null character, and llm will automatically make headings.
comment: str = ""
app = FastAPI(
- title="首席情报官 Backend Server",
- description="From DSW Team.",
+ title="wiseflow Backend Server",
+ description="From WiseFlow Team.",
version="0.2",
openapi_url="/openapi.json"
)
@@ -36,13 +35,12 @@ app.add_middleware(
allow_headers=["*"],
)
-# 如果有多个后端服务,可以在这里定义多个后端服务的实例
bs = BackendService()
@app.get("/")
def read_root():
- msg = "Hello, 欢迎使用首席情报官 Backend."
+ msg = "Hello, This is WiseFlow Backend."
return {"msg": msg}
diff --git a/dashboard/mp_crawler.py b/dashboard/mp_crawler.py
new file mode 100644
index 0000000..9f50f35
--- /dev/null
+++ b/dashboard/mp_crawler.py
@@ -0,0 +1,109 @@
+import httpx
+from bs4 import BeautifulSoup
+from datetime import datetime
+import re
+
+
+header = {
+ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/604.1 Edg/112.0.100.0'}
+
+
+def mp_crawler(url: str, logger) -> (int, dict):
+ if not url.startswith('https://mp.weixin.qq.com') and not url.startswith('http://mp.weixin.qq.com'):
+ logger.warning(f'{url} is not a mp url, you should not use this function')
+ return -5, {}
+
+ url = url.replace("http://", "https://", 1)
+
+ try:
+ with httpx.Client() as client:
+ response = client.get(url, headers=header, timeout=30)
+ except Exception as e:
+ logger.warning(f"cannot get content from {url}\n{e}")
+ return -7, {}
+
+ soup = BeautifulSoup(response.text, 'html.parser')
+
+ # Get the original release date first
+ pattern = r"var createTime = '(\d{4}-\d{2}-\d{2}) \d{2}:\d{2}'"
+ match = re.search(pattern, response.text)
+
+ if match:
+ date_only = match.group(1)
+ publish_time = date_only.replace('-', '')
+ else:
+ publish_time = datetime.strftime(datetime.today(), "%Y%m%d")
+
+ # Get description content from < meta > tag
+ try:
+ meta_description = soup.find('meta', attrs={'name': 'description'})
+ summary = meta_description['content'].strip() if meta_description else ''
+ card_info = soup.find('div', id='img-content')
+ # Parse the required content from the < div > tag
+ rich_media_title = soup.find('h1', id='activity-name').text.strip() \
+ if soup.find('h1', id='activity-name') \
+ else soup.find('h1', class_='rich_media_title').text.strip()
+ profile_nickname = card_info.find('strong', class_='profile_nickname').text.strip() \
+ if card_info \
+ else soup.find('div', class_='wx_follow_nickname').text.strip()
+ except Exception as e:
+ logger.warning(f"not mp format: {url}\n{e}")
+ return -7, {}
+
+ if not rich_media_title or not profile_nickname:
+ logger.warning(f"failed to analysis {url}, no title or profile_nickname")
+ # For mp.weixin.qq.com types, mp_crawler won't work, and most likely neither will the other two
+ return -7, {}
+
+ # Parse text and image links within the content interval
+ # Todo This scheme is compatible with picture sharing MP articles, but the pictures of the content cannot be obtained,
+ # because the structure of this part is completely different, and a separate analysis scheme needs to be written
+ # (but the proportion of this type of article is not high).
+ texts = []
+ images = set()
+ content_area = soup.find('div', id='js_content')
+ if content_area:
+ # 提取文本
+ for section in content_area.find_all(['section', 'p'], recursive=False): # 遍历顶级section
+ text = section.get_text(separator=' ', strip=True)
+ if text and text not in texts:
+ texts.append(text)
+
+ for img in content_area.find_all('img', class_='rich_pages wxw-img'):
+ img_src = img.get('data-src') or img.get('src')
+ if img_src:
+ images.add(img_src)
+ cleaned_texts = [t for t in texts if t.strip()]
+ content = '\n'.join(cleaned_texts)
+ else:
+ logger.warning(f"failed to analysis contents {url}")
+ return 0, {}
+ if content:
+ content = f"({profile_nickname} 文章){content}"
+ else:
+ # If the content does not have it, but the summary has it, it means that it is an mp of the picture sharing type.
+ # At this time, you can use the summary as the content.
+ content = f"({profile_nickname} 文章){summary}"
+
+ # Get links to images in meta property = "og: image" and meta property = "twitter: image"
+ og_image = soup.find('meta', property='og:image')
+ twitter_image = soup.find('meta', property='twitter:image')
+ if og_image:
+ images.add(og_image['content'])
+ if twitter_image:
+ images.add(twitter_image['content'])
+
+ if rich_media_title == summary or not summary:
+ abstract = ''
+ else:
+ abstract = f"({profile_nickname} 文章){rich_media_title}——{summary}"
+
+ return 11, {
+ 'title': rich_media_title,
+ 'author': profile_nickname,
+ 'publish_time': publish_time,
+ 'abstract': abstract,
+ 'content': content,
+ 'images': list(images),
+ 'url': url,
+ }
diff --git a/dashboard/simple_crawler.py b/dashboard/simple_crawler.py
new file mode 100644
index 0000000..21e2900
--- /dev/null
+++ b/dashboard/simple_crawler.py
@@ -0,0 +1,60 @@
+from gne import GeneralNewsExtractor
+import httpx
+from bs4 import BeautifulSoup
+from datetime import datetime
+from pathlib import Path
+from utils.general_utils import extract_and_convert_dates
+import chardet
+
+
+extractor = GeneralNewsExtractor()
+header = {
+ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/604.1 Edg/112.0.100.0'}
+
+
+def simple_crawler(url: str | Path, logger) -> (int, dict):
+ """
+ Return article information dict and flag, negative number is error, 0 is no result, 11 is success
+ """
+ try:
+ with httpx.Client() as client:
+ response = client.get(url, headers=header, timeout=30)
+ rawdata = response.content
+ encoding = chardet.detect(rawdata)['encoding']
+ text = rawdata.decode(encoding)
+ result = extractor.extract(text)
+ except Exception as e:
+ logger.warning(f"cannot get content from {url}\n{e}")
+ return -7, {}
+
+ if not result:
+ logger.error(f"gne cannot extract {url}")
+ return 0, {}
+
+ if len(result['title']) < 4 or len(result['content']) < 24:
+ logger.info(f"{result} not valid")
+ return 0, {}
+
+ if result['title'].startswith('服务器错误') or result['title'].startswith('您访问的页面') or result['title'].startswith('403')\
+ or result['content'].startswith('This website uses cookies') or result['title'].startswith('出错了'):
+ logger.warning(f"can not get {url} from the Internet")
+ return -7, {}
+
+ date_str = extract_and_convert_dates(result['publish_time'])
+ if date_str:
+ result['publish_time'] = date_str
+ else:
+ result['publish_time'] = datetime.strftime(datetime.today(), "%Y%m%d")
+
+ soup = BeautifulSoup(text, "html.parser")
+ try:
+ meta_description = soup.find("meta", {"name": "description"})
+ if meta_description:
+ result['abstract'] = meta_description["content"].strip()
+ else:
+ result['abstract'] = ''
+ except Exception:
+ result['abstract'] = ''
+
+ result['url'] = str(url)
+ return 11, result
diff --git a/client/backend/tranlsation_volcengine.py b/dashboard/tranlsation_volcengine.py
similarity index 81%
rename from client/backend/tranlsation_volcengine.py
rename to dashboard/tranlsation_volcengine.py
index c66de2b..7ea7bbe 100644
--- a/client/backend/tranlsation_volcengine.py
+++ b/dashboard/tranlsation_volcengine.py
@@ -1,10 +1,11 @@
-# 使用火山引擎进行翻译的接口封装
-# 通过环境变量设置VOLC_KEY,格式为AK|SK
-# AK-SK 需要手机号注册并实名认证,具体见这里https://console.volcengine.com/iam/keymanage/ (自助接入)
-# 费用:每月免费额度200万字符(1个汉字、1个外语字母、1个数字、1个符号或空格都计为一个字符),超出后49元/每百万字符
-# 图片翻译:每月免费100张,超出后0.04元/张
-# 文本翻译并发限制,每个batch最多16个,总文本长度不超过5000字符,max QPS为10
-# 术语库管理:https://console.volcengine.com/translate
+# Interface encapsulation for translation using Volcano Engine
+# Set VOLC_KEY by environment variables in the format AK | SK
+# AK-SK requires mobile phone number registration and real-name authentication, see here https://console.volcengine.com/iam/keymanage/(self-service access)
+# Cost: Monthly free limit 2 million characters (1 Chinese character, 1 foreign language letter, 1 number, 1 symbol or space are counted as one character),
+# exceeding 49 yuan/per million characters
+# Picture translation: 100 pieces per month for free, 0.04 yuan/piece after exceeding
+# Text translation concurrency limit, up to 16 per batch, the total text length does not exceed 5000 characters, max QPS is 10
+# Terminology database management: https://console.volcengine.com/translate
import json
@@ -18,7 +19,7 @@ from volcengine.base.Service import Service
VOLC_KEY = os.environ.get('VOLC_KEY', None)
if not VOLC_KEY:
- raise Exception('请设置环境变量 VOLC_KEY,格式为AK|SK')
+ raise Exception('Please set environment variables VOLC_KEY format as AK | SK')
k_access_key, k_secret_key = VOLC_KEY.split('|')
diff --git a/client/web/.env.development b/dashboard/web/.env.development
similarity index 100%
rename from client/web/.env.development
rename to dashboard/web/.env.development
diff --git a/client/web/.env.production b/dashboard/web/.env.production
similarity index 100%
rename from client/web/.env.production
rename to dashboard/web/.env.production
diff --git a/client/web/.eslintrc.cjs b/dashboard/web/.eslintrc.cjs
similarity index 100%
rename from client/web/.eslintrc.cjs
rename to dashboard/web/.eslintrc.cjs
diff --git a/client/web/.gitignore b/dashboard/web/.gitignore
similarity index 100%
rename from client/web/.gitignore
rename to dashboard/web/.gitignore
diff --git a/client/web/README.md b/dashboard/web/README.md
similarity index 100%
rename from client/web/README.md
rename to dashboard/web/README.md
diff --git a/client/web/components.json b/dashboard/web/components.json
similarity index 100%
rename from client/web/components.json
rename to dashboard/web/components.json
diff --git a/client/web/index.html b/dashboard/web/index.html
similarity index 100%
rename from client/web/index.html
rename to dashboard/web/index.html
diff --git a/client/web/package.json b/dashboard/web/package.json
similarity index 100%
rename from client/web/package.json
rename to dashboard/web/package.json
diff --git a/client/web/pnpm-lock.yaml b/dashboard/web/pnpm-lock.yaml
similarity index 100%
rename from client/web/pnpm-lock.yaml
rename to dashboard/web/pnpm-lock.yaml
diff --git a/client/web/postcss.config.js b/dashboard/web/postcss.config.js
similarity index 100%
rename from client/web/postcss.config.js
rename to dashboard/web/postcss.config.js
diff --git a/client/web/public/vite.svg b/dashboard/web/public/vite.svg
similarity index 100%
rename from client/web/public/vite.svg
rename to dashboard/web/public/vite.svg
diff --git a/client/web/src/App.css b/dashboard/web/src/App.css
similarity index 100%
rename from client/web/src/App.css
rename to dashboard/web/src/App.css
diff --git a/client/web/src/App.jsx b/dashboard/web/src/App.jsx
similarity index 100%
rename from client/web/src/App.jsx
rename to dashboard/web/src/App.jsx
diff --git a/client/web/src/assets/react.svg b/dashboard/web/src/assets/react.svg
similarity index 100%
rename from client/web/src/assets/react.svg
rename to dashboard/web/src/assets/react.svg
diff --git a/client/web/src/components/article-list.jsx b/dashboard/web/src/components/article-list.jsx
similarity index 100%
rename from client/web/src/components/article-list.jsx
rename to dashboard/web/src/components/article-list.jsx
diff --git a/client/web/src/components/layout/step.jsx b/dashboard/web/src/components/layout/step.jsx
similarity index 100%
rename from client/web/src/components/layout/step.jsx
rename to dashboard/web/src/components/layout/step.jsx
diff --git a/client/web/src/components/screen/articles.jsx b/dashboard/web/src/components/screen/articles.jsx
similarity index 100%
rename from client/web/src/components/screen/articles.jsx
rename to dashboard/web/src/components/screen/articles.jsx
diff --git a/client/web/src/components/screen/insights.jsx b/dashboard/web/src/components/screen/insights.jsx
similarity index 100%
rename from client/web/src/components/screen/insights.jsx
rename to dashboard/web/src/components/screen/insights.jsx
diff --git a/client/web/src/components/screen/login.jsx b/dashboard/web/src/components/screen/login.jsx
similarity index 100%
rename from client/web/src/components/screen/login.jsx
rename to dashboard/web/src/components/screen/login.jsx
diff --git a/client/web/src/components/screen/report.jsx b/dashboard/web/src/components/screen/report.jsx
similarity index 100%
rename from client/web/src/components/screen/report.jsx
rename to dashboard/web/src/components/screen/report.jsx
diff --git a/client/web/src/components/screen/start.jsx b/dashboard/web/src/components/screen/start.jsx
similarity index 100%
rename from client/web/src/components/screen/start.jsx
rename to dashboard/web/src/components/screen/start.jsx
diff --git a/client/web/src/components/screen/steps.jsx b/dashboard/web/src/components/screen/steps.jsx
similarity index 100%
rename from client/web/src/components/screen/steps.jsx
rename to dashboard/web/src/components/screen/steps.jsx
diff --git a/client/web/src/components/ui/accordion.jsx b/dashboard/web/src/components/ui/accordion.jsx
similarity index 100%
rename from client/web/src/components/ui/accordion.jsx
rename to dashboard/web/src/components/ui/accordion.jsx
diff --git a/client/web/src/components/ui/banner.jsx b/dashboard/web/src/components/ui/banner.jsx
similarity index 100%
rename from client/web/src/components/ui/banner.jsx
rename to dashboard/web/src/components/ui/banner.jsx
diff --git a/client/web/src/components/ui/button-loading.jsx b/dashboard/web/src/components/ui/button-loading.jsx
similarity index 100%
rename from client/web/src/components/ui/button-loading.jsx
rename to dashboard/web/src/components/ui/button-loading.jsx
diff --git a/client/web/src/components/ui/button.jsx b/dashboard/web/src/components/ui/button.jsx
similarity index 100%
rename from client/web/src/components/ui/button.jsx
rename to dashboard/web/src/components/ui/button.jsx
diff --git a/client/web/src/components/ui/form.jsx b/dashboard/web/src/components/ui/form.jsx
similarity index 100%
rename from client/web/src/components/ui/form.jsx
rename to dashboard/web/src/components/ui/form.jsx
diff --git a/client/web/src/components/ui/input.jsx b/dashboard/web/src/components/ui/input.jsx
similarity index 100%
rename from client/web/src/components/ui/input.jsx
rename to dashboard/web/src/components/ui/input.jsx
diff --git a/client/web/src/components/ui/label.jsx b/dashboard/web/src/components/ui/label.jsx
similarity index 100%
rename from client/web/src/components/ui/label.jsx
rename to dashboard/web/src/components/ui/label.jsx
diff --git a/client/web/src/components/ui/textarea.jsx b/dashboard/web/src/components/ui/textarea.jsx
similarity index 100%
rename from client/web/src/components/ui/textarea.jsx
rename to dashboard/web/src/components/ui/textarea.jsx
diff --git a/client/web/src/components/ui/toast.jsx b/dashboard/web/src/components/ui/toast.jsx
similarity index 100%
rename from client/web/src/components/ui/toast.jsx
rename to dashboard/web/src/components/ui/toast.jsx
diff --git a/client/web/src/components/ui/toaster.jsx b/dashboard/web/src/components/ui/toaster.jsx
similarity index 100%
rename from client/web/src/components/ui/toaster.jsx
rename to dashboard/web/src/components/ui/toaster.jsx
diff --git a/client/web/src/components/ui/use-toast.js b/dashboard/web/src/components/ui/use-toast.js
similarity index 100%
rename from client/web/src/components/ui/use-toast.js
rename to dashboard/web/src/components/ui/use-toast.js
diff --git a/client/web/src/index.css b/dashboard/web/src/index.css
similarity index 100%
rename from client/web/src/index.css
rename to dashboard/web/src/index.css
diff --git a/client/web/src/lib/utils.js b/dashboard/web/src/lib/utils.js
similarity index 100%
rename from client/web/src/lib/utils.js
rename to dashboard/web/src/lib/utils.js
diff --git a/client/web/src/main.jsx b/dashboard/web/src/main.jsx
similarity index 100%
rename from client/web/src/main.jsx
rename to dashboard/web/src/main.jsx
diff --git a/client/web/src/store.js b/dashboard/web/src/store.js
similarity index 100%
rename from client/web/src/store.js
rename to dashboard/web/src/store.js
diff --git a/client/web/tailwind.config.js b/dashboard/web/tailwind.config.js
similarity index 100%
rename from client/web/tailwind.config.js
rename to dashboard/web/tailwind.config.js
diff --git a/client/web/tsconfig.json b/dashboard/web/tsconfig.json
similarity index 100%
rename from client/web/tsconfig.json
rename to dashboard/web/tsconfig.json
diff --git a/client/web/vite.config.js b/dashboard/web/vite.config.js
similarity index 100%
rename from client/web/vite.config.js
rename to dashboard/web/vite.config.js