Ethan's Blog

RAG系列-基础RAG（Simple RAG）

作者 — Wed, 18 Jun 2025 00:00:00 +0800

01. 基础RAG（Simple RAG）

方法简介

基础RAG（Retrieval-Augmented Generation）是最简单的检索增强生成方法。它通过向量化检索获取与用户查询最相关的文档片段，并将这些片段作为上下文输入给大语言模型进行答案生成。

核心代码

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160


import fitz
import os
import numpy as np
import json
from openai import OpenAI

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks

    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

def create_embeddings(text, model="BAAI/bge-en-icl"):
    """
    Creates embeddings for the given text using the specified OpenAI model.

    Args:
    text (str): The input text for which embeddings are to be created.
    model (str): The model to be used for creating embeddings. Default is "BAAI/bge-en-icl".

    Returns:
    dict: The response from the OpenAI API containing the embeddings.
    """
    # Create embeddings for the input text using the specified model
    response = client.embeddings.create(
        model=model,
        input=text
    )

    return response  # Return the response containing the embeddings

def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def semantic_search(query, text_chunks, embeddings, k=5):
    """
    Performs semantic search on the text chunks using the given query and embeddings.

    Args:
    query (str): The query for the semantic search.
    text_chunks (List[str]): A list of text chunks to search through.
    embeddings (List[dict]): A list of embeddings for the text chunks.
    k (int): The number of top relevant text chunks to return. Default is 5.

    Returns:
    List[str]: A list of the top k most relevant text chunks based on the query.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []  # Initialize a list to store similarity scores

    # Calculate similarity scores between the query embedding and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        similarity_scores.append((i, similarity_score))  # Append the index and similarity score

    # Sort the similarity scores in descending order
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # Get the indices of the top k most similar text chunks
    top_indices = [index for index, _ in similarity_scores[:k]]
    # Return the top k most relevant text chunks
    return [text_chunks[index] for index in top_indices]

def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "meta-llama/Llama-2-7B-chat-hf".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# 完整调用流程
def simple_rag_pipeline(pdf_path, query):
    # 1. 提取PDF文本
    extracted_text = extract_text_from_pdf(pdf_path)

    # 2. 分块处理
    text_chunks = chunk_text(extracted_text, 1000, 200)

    # 3. 创建嵌入
    response = create_embeddings(text_chunks)

    # 4. 语义搜索
    top_chunks = semantic_search(query, text_chunks, response.data, k=2)

    # 5. 生成回答
    system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"
    user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
    user_prompt = f"{user_prompt}\nQuestion: {query}"

    ai_response = generate_response(system_prompt, user_prompt)
    return ai_response.choices[0].message.content

代码讲解

文档处理：使用PyMuPDF提取PDF文本，按字符数分块
嵌入生成：使用BAAI/bge-en-icl模型生成文本嵌入
语义搜索：计算查询与文档块的余弦相似度，返回最相关的k个片段
答案生成：将检索到的上下文与用户问题输入LLM生成答案

主要特点

实现简单，易于理解和扩展
使用余弦相似度进行语义检索
支持PDF文档处理
可配置的检索数量k

使用场景

FAQ自动问答
小型企业知识库
结构化文档检索增强
基础文档问答系统

RAG系列-语义分块RAG（Semantic Chunking RAG）

作者 — Wed, 18 Jun 2025 00:00:00 +0800

02. 语义分块RAG（Semantic Chunking RAG）

方法简介

语义分块RAG通过计算句子间的语义相似度来智能分块，而不是简单的固定长度分块。它使用百分位数、标准差或四分位距等方法找到语义断点，将文本分割成语义连贯的块，提升检索精度。

核心代码

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212


import fitz
import os
import numpy as np
import json
from openai import OpenAI

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page in mypdf:
        # Extract text from the current page and add spacing
        all_text += page.get_text("text") + " "

    # Return the extracted text, stripped of leading/trailing whitespace
    return all_text.strip()

# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

def get_embedding(text, model="BAAI/bge-en-icl"):
    """
    Creates an embedding for the given text using OpenAI.

    Args:
    text (str): Input text.
    model (str): Embedding model name.

    Returns:
    np.ndarray: The embedding vector.
    """
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)

def cosine_similarity(vec1, vec2):
    """
    Computes cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): First vector.
    vec2 (np.ndarray): Second vector.

    Returns:
    float: Cosine similarity.
    """
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def compute_breakpoints(similarities, method="percentile", threshold=90):
    """
    Computes chunking breakpoints based on similarity drops.

    Args:
    similarities (List[float]): List of similarity scores between sentences.
    method (str): 'percentile', 'standard_deviation', or 'interquartile'.
    threshold (float): Threshold value (percentile for 'percentile', std devs for 'standard_deviation').

    Returns:
    List[int]: Indices where chunk splits should occur.
    """
    # Determine the threshold value based on the selected method
    if method == "percentile":
        # Calculate the Xth percentile of the similarity scores
        threshold_value = np.percentile(similarities, threshold)
    elif method == "standard_deviation":
        # Calculate the mean and standard deviation of the similarity scores
        mean = np.mean(similarities)
        std_dev = np.std(similarities)
        # Set the threshold value to mean minus X standard deviations
        threshold_value = mean - (threshold * std_dev)
    elif method == "interquartile":
        # Calculate the first and third quartiles (Q1 and Q3)
        q1, q3 = np.percentile(similarities, [25, 75])
        # Set the threshold value using the IQR rule for outliers
        threshold_value = q1 - 1.5 * (q3 - q1)
    else:
        # Raise an error if an invalid method is provided
        raise ValueError("Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.")

    # Identify indices where similarity drops below the threshold value
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]

def split_into_chunks(sentences, breakpoints):
    """
    Splits sentences into semantic chunks.

    Args:
    sentences (List[str]): List of sentences.
    breakpoints (List[int]): Indices where chunking should occur.

    Returns:
    List[str]: List of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    start = 0  # Initialize the start index

    # Iterate through each breakpoint to create chunks
    for bp in breakpoints:
        # Append the chunk of sentences from start to the current breakpoint
        chunks.append(". ".join(sentences[start:bp + 1]) + ".")
        start = bp + 1  # Update the start index to the next sentence after the breakpoint

    # Append the remaining sentences as the last chunk
    chunks.append(". ".join(sentences[start:]))
    return chunks  # Return the list of chunks

def create_embeddings(text_chunks):
    """
    Creates embeddings for each text chunk.

    Args:
    text_chunks (List[str]): List of text chunks.

    Returns:
    List[np.ndarray]: List of embedding vectors.
    """
    # Generate embeddings for each text chunk using the get_embedding function
    return [get_embedding(chunk) for chunk in text_chunks]

def semantic_search(query, text_chunks, chunk_embeddings, k=5):
    """
    Finds the most relevant text chunks for a query.

    Args:
    query (str): Search query.
    text_chunks (List[str]): List of text chunks.
    chunk_embeddings (List[np.ndarray]): List of chunk embeddings.
    k (int): Number of top results to return.

    Returns:
    List[str]: Top-k relevant chunks.
    """
    # Generate an embedding for the query
    query_embedding = get_embedding(query)

    # Calculate cosine similarity between the query embedding and each chunk embedding
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]

    # Get the indices of the top-k most similar chunks
    top_indices = np.argsort(similarities)[-k:][::-1]

    # Return the top-k most relevant text chunks
    return [text_chunks[i] for i in top_indices]

def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "meta-llama/Llama-2-7B-chat-hf".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# 完整调用流程
def semantic_chunking_rag_pipeline(pdf_path, query):
    # 1. 提取PDF文本
    extracted_text = extract_text_from_pdf(pdf_path)

    # 2. 按句子分割
    sentences = extracted_text.split(". ")

    # 3. 生成句子嵌入
    embeddings = [get_embedding(sentence) for sentence in sentences]

    # 4. 计算句子间相似度
    similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]

    # 5. 计算断点（使用百分位数方法）
    breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)

    # 6. 分割成语义块
    text_chunks = split_into_chunks(sentences, breakpoints)

    # 7. 创建块嵌入
    chunk_embeddings = create_embeddings(text_chunks)

    # 8. 语义搜索
    top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)

    # 9. 生成回答
    system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"
    user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
    user_prompt = f"{user_prompt}\nQuestion: {query}"

    ai_response = generate_response(system_prompt, user_prompt)
    return ai_response.choices[0].message.content

代码讲解

句子分割：按句号分割文本成句子
嵌入生成：为每个句子生成向量表示
相似度计算：计算相邻句子的余弦相似度
断点检测：使用百分位数方法找到语义断点
语义分块：根据断点将句子组合成语义块
检索生成：基于语义块进行检索和答案生成

主要特点

基于语义相似度的智能分块
支持多种断点检测方法（百分位数、标准差、四分位距）
保持语义连贯性
比固定长度分块更精准

使用场景

长文档处理
需要保持语义完整性的场景
复杂问答系统
学术论文、技术文档等结构化文本

2025年展望

作者 — Wed, 01 Jan 2025 00:00:00 +0800

2024年回顾

1月-2月：安家落户

终于完成了人生中的一件大事 - 买房。拿到房产证的那一刻，我和妻子都感到无比欣喜。这个新家不仅是一个住所，更是我们对未来生活的美好期待。

3月-4月：平稳前行

这段时间主要是日常工作和还贷。虽然每个月能存下的钱不多，但生活依然充满欢乐。我们学会了在有限的预算中寻找生活的乐趣。

5月-6月：职场动荡

公司经历了裁员风波，虽然我幸免于难，但这次事件让我对公司的未来产生了疑虑。这段时间充满了迷茫和不确定性，尝试了很多事情但进展不大。

7月-10月：装修新家

新房进入装修阶段，虽然经济压力较大，但每周回长沙监工的过程充满了期待和喜悦。看着新家一点点成型，所有的辛苦都值得。

11月-12月：健康与AI探索

体检发现患有桥本甲状腺炎，这让我开始更加关注身体健康。同时，AI技术的快速发展让我产生了强烈的危机感。经过深入思考和实践，我决定拥抱AI而不是恐惧它。

这段时间，我深入探索了多种AI工具：

Cursor
Windsurf（开通了Pro版）
Cline
Aider
Zed AI

通过实践，我发现每种工具都有其独特优势，于是开始尝试多种工具结合使用。12月中旬，我开始利用AI接一些外包项目，主要目的有两个：

缓解经济压力
深入探索AI能力，提升工作效率

2025年展望

在新的一年里，我为自己设定了以下目标：

AI SOP优化：总结出一套适合自己的AI使用流程和标准操作程序
产品开发：借助AI工具，完成第一个产品的MVP（最小可行产品）
全栈开发：开始探索全栈式开发，提升技术广度
AI辅助学习：建立高效的AI辅助学习体系，加速知识获取
事业基础：为未来的事业发展打下坚实基础

2025年将是充满挑战和机遇的一年。我相信，通过合理利用AI工具，持续学习和自我提升，我能够实现这些目标，为未来创造更多可能性。

“未来属于那些相信梦想之美的人。” - 埃莉诺·罗斯福

windsurf编码体验

作者 — Fri, 06 Dec 2024 17:32:26 +0800

最近，我有幸体验了windsurf编辑器的AI辅助编程功能。这款编辑器的编程体验令人印象深刻，在某些方面甚至超越了我使用过的其他主流编辑器。然而，经过一天的深入使用后，我发现它仍存在一些需要改进的地方。

最初，我期望AI能够完全接管代码编写工作，或者至少大幅减少我在低级编码任务上的时间投入，让我能够专注于更高层次的架构设计。然而，现实与期望存在差距。在持续使用过程中，我注意到AI的解决问题的能力会随着会话时间的延长而逐渐下降。

经过分析，我认为这可能与windsurf的架构设计有关。该编辑器的AI辅助功能基于agent模式开发，在使用过程中我频繁遇到响应错误，且这种错误会随着使用时间的增加而变得更加频繁。更令人困扰的是，有时AI虽然能够正确回答问题，却无法实际修改代码来解决问题。我推测这可能是windsurf中存在的一个bug，导致AI无法持续执行代码修改任务。

值得注意的是，AI辅助编程的核心能力并非来自其"智能"，而是源于其对代码的深度分析和理解能力。AI之所以能够快速解决问题，是因为它能够高效地分析代码之间的关联，理解代码的逻辑结构和生命周期。然而，随着会话的持续，这种分析和理解能力似乎会逐渐衰减，这可能是由于软件性能优化不足导致的。

要实现真正优雅的AI辅助编程，还需要在多个方面进行优化和改进。虽然短期内可能无法完全解决这些问题，但不可否认的是，AI辅助编程仍然是一个强大的工具。它能够帮助我们快速学习和掌握新的编程语言，显著提升学习效率和工作效能。相信随着技术的不断进步，这些问题终将得到解决，为开发者带来更优质的编程体验。

读付鹏和高善文对当前经济评论

作者 — Tue, 03 Dec 2024 18:07:13 +0800

今天读了付鹏先生在HSBC内部演讲的文稿，后面相继听了高善文先生的演讲。在阅读了高善文和付鹏关于中国经济形势的深刻分析后，我获得了对当前和未来经济趋势的更全面理解。

通过深入阅读高善文的分析，我深刻认识到经济转型和周期性压力是塑造中国经济未来的关键力量。经济转型引发的结构性变化是深远而持久的，它要求我们不断适应新的发展模式，比如从劳动密集型向技术密集型转变，从投资驱动向消费驱动转变。与此同时，周期性压力则在短期内对经济产生显著影响，如需求波动、市场信心变化等，这些都可能对我们的职业和财务状况产生直接或间接的影响。

这种理解使我意识到在不同的经济周期阶段，需要采取不同的应对策略。目前，我们的职业生涯将长期处于这个经济周期的尾声阶段，这意味着整体经济环境、就业环境以及收入增长潜力都不能与经济高速发展时期同日而语。为了适应这些变化，我们需要认真评估自己所在行业在经济转型中的位置，以及未来可能的发展趋势。如果行业前景黯淡，可能需要考虑转行或提升技能以适应新兴行业的需求。在经济增速放缓的背景下，我们也需要更加谨慎地管理个人财务，包括减少不必要的债务、增加储蓄和投资于相对稳定的资产。然后调整消费习惯，避免过度消费，尤其是在经济不确定性较高时期，理性消费变得更加重要。也是时候考虑要开启副业，增加职业以外的收入，以应对可能的经济波动。希望能够更好地应对经济转型和周期性压力带来的挑战，同时也为未来可能出现的新机遇做好准备。

作为普通人，我们需要建立更为全面和深入的经济理解，以便在不断变化的经济环境中做出明智的决策。这两篇文章为我未来的财务规划和职业发展提供了宝贵的指导。

2024年第46周, 患上桥本了

作者 — Thu, 14 Nov 2024 21:20:46 +0800

由于明天要去团建，后天一大早就要赶火车回长沙。所以周报今天先完成。后面每周都会写下我对生活的思考。

本周的生活概述：

周六，参加了公司组织的年度体检。今年我对去年发现的甲状腺结节问题尤为关注，特意增加了甲功三项B专项检查。体检过程中，医生还建议我增加两项指标检测：抗甲状腺球蛋白抗体(TG-Ab)和甲状腺过氧化物酶抗体(TPO-Ab)，用于诊断是否患有桥本甲状腺炎。当天下午，血液检查结果就可以通过小程序同步查看结果显示我的TG-Ab高达78.14(IU/ml 正常范围0-4.11), TPO-Ab高达28.6(IU/ml 正常范围0-5.63)。

这对我来说，就是暴击，无疑就已经宣判了我患桥本了。后面仔细想了想，我体检前一天晚上没怎么睡好，且前一阵子不是吃烧烤就是出去喝奶茶，加上从媳妇老家带过来的辣椒酱爱不释手，可能这两个指标飙升和自身的生活习惯有关。在阅读了和桥本相关的医学知识后，感觉我从此要和辣椒无缘了，我可是正宗的湖南人啊，没有辣椒我能活？媳妇还在一旁不停的讲风凉话。不过我媳妇，也就是讲讲，心理比谁都更加重视我的健康。才30岁的我，身体就已经开始下滑，这让我开始反思自己。在这之前，我从来认为吃饭不就是一项任务？随意吃一点就好。以为自己很年轻，有更多的事情比吃饭，睡觉更加重要。现在想想，我真的有点大错特错了，对于现在的我们来说，其实最重要的是照顾好身体，身体才是我们的本钱，没有本钱，怎么去实现自己的价值呢？

成长与学习：

~~阅读完成《真需求》梁宁~~
阅读《亲密关系》罗兰.米勒 20%
阅读《桥本甲状腺炎90天治疗方案》20%

健康与自我关爱：

圆环闭合情况：

自从检查出桥本后，我基本每天早上半小时运动，中午半小时运动，晚饭后半小时运动，然后调整饮食，一个月后再去复查，看看指标有没有下降或好转的可能。

下周的计划：

下周准备回长沙，搞开荒，然后软装进场。终于房子装修告一段落了。诶，从买房到现在已经月光了一整年。希望月光的时间赶紧过去，然后尽自己最大的能力存钱。

乐趣与感恩：

从看到诊断结果到现在整整一周了，她每天坚决执行清淡饮食，督促我早睡早起，每天早上出门锻炼30分钟。在这里我非常感谢我媳妇在背后对我的支持。

当下的事情

作者 — Thu, 14 Nov 2024 08:10:11 +0800

当下

本页记录当下我需要专注的事情。更新于2024/12/02 于中国武汉

生活

日常工作：练习专注，寻找目标感
- 项目稳步推进
- 测试同学的挑衅淡定对待，工作而已
业余生活：稳定作息，健康生活
- 坚持做饭
- 有节奏的作息，拒绝熬夜
运动健身：提高基础代谢
- 开始跑步，每周至少两次
- 继续羽毛球运动

学习

读书：
- 阅读《build a large language model from scratch》 60%
- ~~阅读《真需求》梁宁~~
- 阅读《亲密关系》罗兰.米勒 80%
- 阅读《桥本甲状腺炎90天治疗方案》20%
- 阅读《learning-ebpf》5%
技术：
- 学习深度包解析技术
- 学习TCP协议相关知识
- 学习rust相关知识
写作：
- 提升写作方面的能力

项目

流量采集器

认证订阅

作者 — Thu, 24 Oct 2024 00:00:00 +0800

This message is used to verify that this feed (feedId:71916462721158158) belongs to me (userId:45764741539537920). Join me in enjoying the next generation information browser https://follow.is.

写在28岁的中点

作者 — Sun, 08 Jan 2023 00:00:00 +0800

Dear Ethan:

又到了充满期待的新的一年，过去的一年你过得还好吗？在往年的年终总结中，你会对即将到来的一年许下满满的期待。很明显，去年的你又是欠下满满债务的你。选择在你28岁的中点写一封信给自己，这更私人，但也更贴近你的内心。

命途多舛，何以不甘

又一年的时间，你经历了不同的事情，遇到了不同的人，了解了不同的故事，现在轮到你说一说自己的故事了。也许都听过关于西西弗斯的故事，他的一生就是不断将巨石推到山顶，又不得不经受巨石滚落，再将石头推向山顶，这样一个荒诞的周而复始的故事。这也许，也是我们每一个人所需要经历的人生。

三月到五月的你，在小论文、毕业论文修改和实验中度过，那时的你，年轻气盛，因为一点小小的观点和老师争吵的面红耳赤。六月份经历了毕业的狂欢，阔别了昔日一起学习的良师益友，与好友约定毕业旅行，因囊中羞涩与疫情的封控而取消。急忙奔赴职场，结实新的朋友，重新投入到自我的升华之中。总的来说，去年的你，经历了人生中的两件大事：毕业和工作，再次完成学生到职场人身份的转变，其它的都是一地鸡毛。

这一年中失去的东西太多太多，任何一点细小的死亡与崩坏都会变得不可承受，这大概就是去年的一个缩影吧，巨石一次次的滚动，我们一次次的再上路。真的很想努力，但满满的无力感。这种无力感，年复一年，细细沉思，最早可追溯到2015年，那是我第一次深刻体会这种无力感。如今七年已过，你仍旧在与这种无力感继续搏斗着。

此前的每一个人生阶段—-初中，高中，大学，似乎总是被安排着走的，大的方向永远是一年比一年好。那份不甘于现实的热情，还能继续保持，也许正是因为不曾经历大的挫折。仔细回忆过往的人生，之前的你确实保持着点自我。那会儿呢，只需要考虑自己就已经足够了，家人永远是不断给予付出的那一方，所以那会儿做什么事都是那么天真吧！这份自我得益于你的少年意气，得益于家庭给你的支撑，也得益于时代的滚滚向前。但人生或命运从来就没有承诺过谁，总会往更好的方向发展。巨石总会滚落，而明天一早睁眼，我们依旧需要推着巨石往上。

肩负起自己的责任

去年的你，每一天都在慌慌张张中度过，连家人都没能好好陪伴，也没有很好的意识到，父母的年纪已经到了颐养天年的时刻，我们需要无时不刻的关注着，陪伴着他们。而你每一天都在焦虑中挣扎，却无法鼓起勇气，让现在的你有所改观，因为你此刻内心是害怕的，害怕试错的代价太大，害怕失败，害怕被人嘲笑。可是，正如上面所说，人生或命运从来就没有承诺过谁，总会往更好的方向发展，所以今年的你，一定要鼓起勇气做出一点改变啦！

我知道在过去的一年，你无数次打开B站，似乎想要寻找什么答案，可是刷了很久，焦虑一点没减少。事实一次次的告诉你，既然别人无法明确的告诉你，那你就要学会戴着镣铐和生活共舞，不是吗？毛姆在写《月亮与六便士》的时候，大概忘了在理想和现实中间还有责任。他没有告诉你站在路口，抬头是月亮，低头是捡硬币，责任在肩膀上压着，那你该往哪儿走。你唯一确定的是，你想负起这个责任。因为曾经家人的支持是你的底气，你今天，同样想成为家人的底气。

所谓成长，接受自我

直到现在我才真正的意识到，所谓的成长就是认知的不断升级。只有当你明白这个道理，这个世界才开始真正的展开在你的眼前，原来以前认为错误的事情，原来也可以是对的「之前和老师争论的面红耳赤」。你不再为某一个你不认同的点去争论，慢慢的学会去理解别人，尊重别人，倾听大家的声音，不再自我，这已经是你最大的成长了。到了这个年纪才谈成长，这也许是一件过于奢侈的事情了。有很多很多的人，已经过早的品尝了世间的滋味。但对于刚入社会的我来说，考验才刚刚开始。成长不是随着年龄的增长，被社会打磨成一样的世故和圆滑，而是在生命的成熟中，仍有一颗纯真的童心和一颗善良的爱心。你想得到月亮，即使如此的平凡，不能起飞，也要努力的走着，跑着，伸手去够，去摘。即使经历过种种不顺，还是会有好事发生，会有新的缘分，新的身份，新的挑战，我不认输，你也不要，好吗？

寄语未来

2023年，愿你在不平和焦虑的时候，能记起你的初心和梦想，然后大踏步的坚持走向明天！！！

读《程序员修炼之道》

作者 — Mon, 02 Jan 2023 00:00:00 +0800

务实的哲学

团队信任对于创造力和协作至关重要，关键时刻信任的破坏几乎无法修复
提供选择，别找借口– 小黄鸭编程
破窗理论– 不要因为一些危急的事情，造成附加伤害，尽可能控制软件的熵
人们都觉得，加入一个推进中的成功项目更容易一些（煮石头汤的故事）
永远审视项目，不要做温水青蛙，先养成仔细观察周围环境的习惯，然后再项目中这样做
知识和经验是你最重要的资产，但是它们是时效资产，学习新事物的能力是你最重要的战略资产。知识组合：
1. 定期投资–安排一个固定的时间和地点学习
  - 每年学习一门新语言
  - 每月读一本技术书
  - 读非技术书
  - 上课– 了解公司之外的人都在做什么
  - 尝试不同的环境
  - 与时俱进–关心最新技术的进展
  想法的交叉是很重要的批判性思维–批判性思考独到的和听到的东西
2. 多样化– 熟悉的技能越多越好
3. 风险管理–不同技术在高风险高回报到低风险低回报区间均匀分布，不要把技术鸡蛋放在一个篮子里
4. 低买高卖–在一项新兴技术流行之前就开始学习，不过这是押宝
5. 重新评估调整–不断刷新自己的知识库
批判性思维
1. 五次为什么
2. 谁从中收益
3. 有什么背景
4. 什么时候在哪里工作可以工作起来
5. 为什么是这个问题
写一个大纲，问自己：这是否用正确的方式表达了我想表达的东西，以及现在是表达这个东西的好时机吗？

务实的方法

ETC（easy to change）

核心知道思想

DRY(Don’t repeat yourself)
正交性良好设计中，数据库相关代码应该和用户界面保持正交，当系统的组件相互之间高度依赖时，就没有局部修理这回事。

Ethan's Blog

RAG系列-基础RAG（Simple RAG）

01. 基础RAG（Simple RAG）

方法简介

核心代码

代码讲解

主要特点

使用场景

RAG系列-语义分块RAG（Semantic Chunking RAG）

02. 语义分块RAG（Semantic Chunking RAG）

方法简介

核心代码

代码讲解

主要特点

使用场景

2025年展望

2024年回顾

1月-2月：安家落户

3月-4月：平稳前行

5月-6月：职场动荡

7月-10月：装修新家

11月-12月：健康与AI探索

2025年展望

windsurf编码体验

读付鹏和高善文对当前经济评论

2024年第46周, 患上桥本了

本周的生活概述 ：

成长与学习 ：

健康与自我关爱 ：

圆环闭合情况：

下周的计划 ：

乐趣与感恩 ：

当下的事情

当下

生活

学习

项目

认证订阅

写在28岁的中点

命途多舛，何以不甘

肩负起自己的责任

所谓成长，接受自我

寄语未来

读《程序员修炼之道》

务实的哲学

务实的方法

ETC（easy to change）

本周的生活概述：

成长与学习：

健康与自我关爱：

下周的计划：

乐趣与感恩：