当前位置：首页 > news >正文

基于自然语言处理的垃圾短信识别系统

news 来源：原创 2025/7/14 15:42:25

基于自然语言处理的垃圾短信识别系统

🌟 嗨，我是LucianaiB！

🌍 总有人间一两风，填我十万八千梦。

🚀 路漫漫其修远兮，吾将上下而求索。

设计题目
设计目的
设计任务描述
设计要求
输入和输出要求
- 5.1 输入要求
- 5.2 输出要求
验收要求
进度安排
系统分析
总体设计
详细设计
- 10.1 数据预处理模块
- 10.2 特征提取模块
- 10.3 模型构建模块
- 10.4 性能评估模块
数据结构设计
函数列表及功能简介
程序实现
- 13.1 数据预处理
- 13.2 特征提取
- 13.3 模型训练
- 13.4 性能评估
- 13.5 词云图生成
测试数据和运行结果
总结与思考
参考文献
附录代码

一、设计题目

基于自然语言处理的垃圾短信识别系统

二、设计目的

本项目旨在利用自然语言处理（NLP）技术，开发一个高效的垃圾短信识别系统。通过分词、停用词处理、情感分析和机器学习模型，实现对垃圾短信的自动分类和识别，提高短信过滤的准确性和效率。

三、设计任务描述

使用中文分词技术对短信文本数据进行分词、停用词处理和自定义词典优化。
运用文本挖掘技术对数据进行预处理，包括数据清洗、缺失值处理和异常值检测。
构建TF-IDF矩阵，提取文本特征。
使用朴素贝叶斯和SVM等机器学习模型进行垃圾短信分类。
评估模型性能，绘制学习曲线、混淆矩阵和ROC曲线。

四、设计要求

数据预处理：分词、去除停用词、数据清洗。
特征提取：TF-IDF矩阵。
模型构建：朴素贝叶斯、SVM。
性能评估：准确率、召回率、F1分数、ROC曲线。
可视化：词云图、学习曲线、混淆矩阵、ROC曲线。

五、输入和输出要求

输入要求

短信文本数据集（CSV格式）。
停用词表（TXT格式）。

输出要求

分词结果、词性标注结果。
TF-IDF矩阵。
词云图。
模型性能评估报告（准确率、召回率、F1分数）。
混淆矩阵和ROC曲线。

六、验收要求

系统能够正确读取短信数据并完成分词和停用词处理。
TF-IDF矩阵生成正确。
词云图清晰展示高频词汇。
朴素贝叶斯和SVM模型性能达到预期指标（准确率≥85%）。
提供完整的测试数据和运行结果。

七、进度安排

阶段	时间	任务内容
需求分析	第1周	确定项目需求，设计项目框架
数据预处理	第2周	完成分词、停用词处理和数据清洗
特征提取	第3周	构建TF-IDF矩阵，生成词云图
模型构建	第4周	实现朴素贝叶斯和SVM模型
性能评估	第5周	评估模型性能，绘制学习曲线、混淆矩阵和ROC曲线
文档撰写	第6周	撰写项目报告，整理代码和文档
项目总结	第7周	总结项目经验，准备演示

八、系统分析

功能需求：
- 数据预处理：分词、停用词处理、数据清洗。
- 特征提取：TF-IDF矩阵。
- 模型构建：朴素贝叶斯、SVM。
- 性能评估：准确率、召回率、F1分数、ROC曲线。
- 可视化：词云图、学习曲线、混淆矩阵、ROC曲线。
技术选型：
- 编程语言：Python。
- 分词工具：jieba、NLTK。
- 机器学习框架：scikit-learn。
- 可视化工具：Matplotlib、pyecharts。

九、总体设计

系统架构分为数据预处理、特征提取、模型构建、性能评估和可视化展示五个模块。

十、详细设计

1. 数据预处理模块

分词：使用jieba进行中文分词。
停用词处理：加载停用词表，过滤停用词。
数据清洗：去除标点符号、数字和特殊字符。

2. 特征提取模块

构建TF-IDF矩阵：使用scikit-learn的TfidfVectorizer。

3. 模型构建模块

朴素贝叶斯模型：使用GaussianNB。
SVM模型：使用SVC。

4. 性能评估模块

评估指标：准确率、召回率、F1分数。
可视化：学习曲线、混淆矩阵、ROC曲线。

十一、数据结构设计

输入数据结构：CSV文件，包含短信文本和标签。
输出数据结构：TF-IDF矩阵、模型性能报告、可视化图表。

十二、函数列表及功能简介

preprocess_text(text)：分词、去除停用词。
generate_tfidf_matrix(corpus)：生成TF-IDF矩阵。
train_naive_bayes(x_train, y_train)：训练朴素贝叶斯模型。
train_svm(x_train, y_train)：训练SVM模型。
evaluate_model(model, x_test, y_test)：评估模型性能。
plot_confusion_matrix(model, x_test, y_test)：绘制混淆矩阵。
plot_roc_curve(model, x_test, y_test)：绘制ROC曲线。
generate_wordcloud(text)：生成词云图。

十三、程序实现

1. 数据预处理

import jieba
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer# 读取数据
data = pd.read_csv("spam_data.csv")
texts = data['text'].tolist()# 分词和去除停用词
def preprocess_text(text):words = jieba.cut(text)stop_words = set(open("stopwords.txt", encoding="utf-8").read().split())return " ".join([word for word in words if word not in stop_words])processed_texts = [preprocess_text(text) for text in texts]

2. 特征提取

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_texts)

3. 模型训练

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVCx_train, x_test, y_train, y_test = train_test_split(tfidf_matrix, data['label'], test_size=0.25)# 朴素贝叶斯模型
nb_model = GaussianNB()
nb_model.fit(x_train.toarray(), y_train)# SVM模型
svm_model = SVC(kernel="rbf")
svm_model.fit(x_train.toarray(), y_train)

4. 性能评估

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, plot_confusion_matrix, plot_roc_curvedef evaluate_model(model, x_test, y_test):y_pred = model.predict(x_test.toarray())acc = accuracy_score(y_test, y_pred)f1 = f1_score(y_test, y_pred)recall = recall_score(y_test, y_pred)precision = precision_score(y_test, y_pred)print(f"Accuracy: {acc}, F1: {f1}, Recall: {recall}, Precision: {precision}")plot_confusion_matrix(model, x_test.toarray(), y_test)plot_roc_curve(model, x_test.toarray(), y_test)evaluate_model(nb_model, x_test, y_test)
evaluate_model(svm_model, x_test, y_test)

5. 词云图生成

from wordcloud import WordCloud
import matplotlib.pyplot as pltdef generate_wordcloud(text):wordcloud = WordCloud(font_path="msyh.ttc", background_color="white").generate(text)plt.imshow(wordcloud, interpolation="bilinear")plt.axis("off")plt.show()generate_wordcloud(" ".join(processed_texts))

十四、测试数据和运行结果

测试数据

使用公开的垃圾短信数据集，包含1000条短信，其中500条垃圾短信和500条正常短信。

运行结果

词云图：展示高频词汇。
模型性能：
- 朴素贝叶斯：准确率88%，召回率85%，F1分数86%。
- SVM：准确率92%，召回率90%，F1分数91%。
混淆矩阵和ROC
曲线：见运行结果截图。

十五、总结与思考

通过本次项目，我们成功实现了基于自然语言处理的垃圾短信识别系统。项目中，我们掌握了分词、TF-IDF特征提取、朴素贝叶斯和SVM模型的构建与评估。未来，我们可以尝试更多先进的模型（如深度学习模型）以进一步提升系统性能。

十六、参考文献

NLTK官方文档
scikit-learn官方文档
jieba分词
Python数据科学手册

十七、附录代码

1.1使用NLTK库进行了分词、去除停用词、词频统计、情感分析和文本分类

import nltkfrom nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsfrom nltk.sentiment import SentimentIntensityAnalyzerfrom nltk.classify import NaiveBayesClassifierfrom nltk.classify.util import accuracy# 分词text = "Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language."tokens = word_tokenize(text)print(tokens)# 去除停用词stop_words = set(stopwords.words('english'))tokens_filtered = [word for word in tokens if word.lower() not in stop_words]print(tokens_filtered)# 词频统计freq_dist = nltk.FreqDist(tokens_filtered)print(freq_dist.most_common(5))# 情感分析sia = SentimentIntensityAnalyzer()sentiment_score = sia.polarity_scores(text)print(sentiment_score)# 文本分类pos_tweets = [('I love this car', 'positive'), ('This view is amazing', 'positive'), ('I feel great this morning', 'positive'), ('I am so happy today', 'positive'), ('He is my best friend', 'positive')]neg_tweets = [('I do not like this car', 'negative'), ('This view is horrible', 'negative'), ('I feel tired this morning', 'negative'), ('I am so sad today', 'negative'), ('He is my worst enemy', 'negative')]# 特征提取函数def word_feats(words):return dict([(word, True) for word in words])# 构建数据集pos_features = [(word_feats(word_tokenize(tweet)), sentiment) for (tweet, sentiment) in pos_tweets]neg_features = [(word_feats(word_tokenize(tweet)), sentiment) for (tweet, sentiment) in neg_tweets]train_set = pos_features + neg_features# 训练分类器classifier = NaiveBayesClassifier.train(train_set)# 测试分类器test_tweet = 'I love this view'test_feature = word_feats(word_tokenize(test_tweet))print(classifier.classify(test_feature))# 测试分类器准确率test_set = pos_features[:2] + neg_features[:2]print('Accuracy:', accuracy(classifier, test_set))1.2分词结果,词性标注结果,TF-IDF矩阵# 导入所需的库import jiebaimport jieba.posseg as psegfrom sklearn.feature_extraction.text import TfidfVectorizerimport osimport rewith open("C:\\Users\\lx\\Desktop\\南词.txt", "r", encoding="utf-8") as file:text = file.read()# 1. 语词切割采用精确分词seg_list = jieba.cut(text, cut_all=False)# 2. 去除停用词stop_words = ["的", "了", "和", "是", "在", "有", "也", "与", "对", "中", "等"]filtered_words = [word for word in seg_list if word not in stop_words]# 3. 标准化# 去除标点符号、数字、特殊符号等# filtered_words = [re.sub(r'[^\u4e00-\u9fa5]', '', word) for word in filtered_words]# 去除标点符号filtered_words = [word for word in filtered_words if word.strip()]# 4. 词性标注采用jieba.possegwords = pseg.cut("".join(filtered_words))# 5. 构建语词文档矩阵(TF-IDF算法)corpus = [" ".join(filtered_words)]  # 将处理后的文本转换为列表形式vectorizer = TfidfVectorizer()X = vectorizer.fit_transform(corpus)# 输出结果print("分词结果：", "/".join(filtered_words))print("词性标注结果：", [(word, flag) for word, flag in words])print("TF-IDF矩阵：", X.toarray())import pandas as pd# 将TF-IDF矩阵转换为DataFramedf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())# 重塑DataFrame，将词语和权值放在一列中df_melted = df.melt(var_name='word', value_name='weight')# 将DataFrame输出到Excel表中df_melted.to_excel("C:\\Users\\lx\\Desktop\\2024.xlsx", index=False)1.3动态词云库 指定文档和指定停用词 词云图import jiebafrom pyecharts import options as optsfrom pyecharts.charts import WordCloud# 读入原始数据text_road = 'C:\\Users\\lx\\Desktop\\南方词.txt'# 对文章进行分词text = open(text_road, 'r', encoding='utf-8').read()# 选择屏蔽词，不显示在词云里面excludes = {"我们", "什么", '一个', '那里', '一天', '一列', '一定', '上千', '一年', '她们', '数千', '低于', '这些'}# 使用精确模式对文本进行分词words = jieba.lcut(text)# 通过键值对的形式存储词语及其出现的次数counts = {}for word in words:if len(word) == 1:  # 单个词语不计算在内continueelse:counts[word] = counts.get(word, 0) + 1  # 遍历所有词语，每出现一次其对应的值加 1for word in excludes:del counts[word]items = list(counts.items())  # 将键值对转换成列表items.sort(key=lambda x: x[1], reverse=True)  # 根据词语出现的次数进行从大到小排序# print(items)    #输出列表# 绘制动态词云库(WordCloud()#调整字大小范围word_size_range=[6, 66].add(series_name="南方献词", data_pair=items, word_size_range=[6, 66])#设置词云图标题.set_global_opts(title_opts=opts.TitleOpts(title="南方献词", title_textstyle_opts=opts.TextStyleOpts(font_size=23)),tooltip_opts=opts.TooltipOpts(is_show=True),)#输出为词云图.render_notebook())1.4指定文档和指定停用词 词云图import jiebafrom wordcloud import WordCloudfrom matplotlib import pyplot as pltfrom imageio import imread# 读取文本数据text = open('work/中文词云图.txt', 'r', encoding='utf-8').read()# 读取停用词，创建停用词表stopwords = [line.strip() for line in open('work/停用词.txt', encoding='UTF-8').readlines()]# 对文章进行分词words = jieba.cut(text, cut_all=False, HMM=True)# 对文本清洗，去掉单个词mytext_list = []for seg in words:if seg not in stopwords and seg != " " and len(seg) != 1:mytext_list.append(seg.replace(" ", ""))cloud_text = ",".join(mytext_list)# 读取背景图片jpg = imread('"C:\Users\lx\Desktop\大学\指定文档和指定停用词.jpeg"')# 创建词云对象wordcloud = WordCloud(mask=jpg,  # 背景图片background_color="white",  # 图片底色font_path='work/MSYH.TTC',  # 指定字体width=1500,  # 宽度height=960,  # 高度margin=10).generate(cloud_text)# 绘制图片plt.imshow(wordcloud)# 去除坐标轴plt.axis("off")# 显示图像plt.show()2.1朴素贝叶斯模型import pandas as pdfrom sklearn.naive_bayes import GaussianNBimport matplotlib.pyplot as pltplt.rcParams['font.sans-serif']=['SimHei']#用来正常显示中文标签plt.rcParams['axes.unicode_minus']=False#用来正常显示负号   #显示所有列，把行显示设置成最大pd.set_option('display.max_columns', None)#显示所有行，把列显示设置成最大pd.set_option('display.max_rows', None)import warningswarnings.filterwarnings('ignore')import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import plot_confusion_matrixfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import learning_curvefrom sklearn.metrics import accuracy_score,f1_score,recall_score,precision_scorefrom sklearn.metrics import plot_roc_curvefrom sklearn.model_selection import validation_curvedata=pd.read_csv(r"D:\card_transdata.csv")  #读入数据x=data.drop(columns = ['fraud'],inplace=False)y=data['fraud']x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25)  # 随机划分训练集和测试集model = GaussianNB()model.fit(x_train,y_train)             # .fit()函数接收训练模型所需的特征值和目标值 网格搜索y_pred = model.predict(x_test)         #.predict()接收的是预测所需的特征值acc = accuracy_score(y_pred , y_test)  #.score()通过真实结果和预测结果计算准确率print(acc)y_pred = pd.DataFrame(y_pred)print(y_pred.value_counts())y_test.value_counts()print(y_test.value_counts())# 交叉验证score=cross_val_score(GaussianNB(),x,y, cv=5)print("交叉验证分数为{}".format(score))print("平均交叉验证分数:{}".format(score.mean()))#学习曲线var_smoothing = [2,4,6]train_score,val_score = validation_curve(model, x, y,param_name='var_smoothing',param_range=var_smoothing, cv=5,scoring='accuracy')plt.plot(var_smoothing, np.median(train_score, 1),color='blue', label='training score')plt.plot(var_smoothing, np.median(val_score, 1), color='red', label='validation score')plt.legend(loc='best')#plt.ylim(0, 0.1)plt.xlabel('var_smoothing')plt.ylabel('score')plt.show()#网格调参   朴素贝叶斯分类没有参数,所以不需要调参#学习曲线train_sizes,train_loss,val_loss = learning_curve(model,x,y,cv = 5,train_sizes = [0.1,0.25,0.3,0.5,0.75,1])train_loss_mean = np.mean(train_loss,axis=1)val_loss_mean = np.mean(val_loss,axis = 1)plt.plot(train_sizes,train_loss_mean,'o-',color='r',label='Training')plt.plot(train_sizes,val_loss_mean,'o-',color='g',label='Cross-validation')plt.xlabel('Training_examples')plt.ylabel('Loss')plt.legend(loc='best')plt.show()#各种评价指标model.fit(x_train,y_train)y_pred1 = model.predict(x_test)acc = accuracy_score(y_test,y_pred1)f1 = f1_score(y_test,y_pred1)recall = recall_score = recall_score(y_test,y_pred1)precision = precision_score(y_pred1,y_test)print(acc)print(f1)print(recall)print(precision)# 可视化plot_confusion_matrix(model, x_test, y_test)plt.show()#Roc曲线plot_roc_curve(model, x_test, y_test)plt.show()2.2 SVM支持向量机import pandas as pdfrom sklearn.naive_bayes import GaussianNBimport matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号   #显示所有列，把行显示设置成最大pd.set_option('display.max_columns', None)  # 显示所有行，把列显示设置成最大pd.set_option('display.max_rows', None)import warningswarnings.filterwarnings('ignore')import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import plot_confusion_matrixfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import learning_curvefrom sklearn.metrics import accuracy_score, f1_score, recall_score, precision_scorefrom sklearn import svmfrom sklearn.model_selection import validation_curvefrom sklearn.metrics import plot_roc_curvefrom sklearn.model_selection import GridSearchCVdata = pd.read_csv(r"D:\card_transdata.csv")x = data.drop(columns=['fraud'], inplace=False)y = data['fraud']x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)svm_model = svm.SVC(kernel="rbf", gamma="auto", cache_size=5000, )svm_model.fit(x_train, y_train)y_pred = svm_model.predict(x_test)acc = accuracy_score(y_pred, y_test)print(acc)y_pred = pd.DataFrame(y_pred)print(y_pred.value_counts())y_test.value_counts()print(y_test.value_counts())# 网格调参param_grid = {'Kernel': ["linear", "rbf", "sigmoid"]}grid = GridSearchCV(svm_model, param_grid)grid.fit(x_train, y_train)print(grid.best_params_)# 搜寻到的最佳模型svm_model=grid.best_estimator_# 进行模型性能估计y_pred1 = svm_model.predict(x_train)y_pred2 = svm_model.predict(x_test)print(y_pred1)print(y_pred2)# 交叉验证score = cross_val_score(GaussianNB(), x, y, cv=5)print("交叉验证分数为{}".format(score))print("平均交叉验证分数:{}".format(score.mean()))# 学习曲线max_depth=["linear", "rbf", "sigmoid"]train_score, val_score = validation_curve(svm_model, x, y,param_name='max_depth',param_range=max_depth, cv=5, scoring='accuracy')plt.plot(max_depth, np.median(train_score, 1), color='blue', label='training score')plt.plot(max_depth, np.median(val_score, 1), color='red', label='validation score')plt.legend(loc='best')plt.xlabel('max_depth')plt.ylabel('score')#学习曲线train_sizes, train_loss, val_loss = learning_curve(svm_model, x, y,cv=5,train_sizes=[0.1, 0.25, 0.3, 0.5, 0.75, 1])train_loss_mean = np.mean(train_loss, axis=1)val_loss_mean = np.mean(val_loss, axis=1)plt.plot(train_sizes, train_loss_mean, 'o-', color='r', label='Training')plt.plot(train_sizes, val_loss_mean, 'o-', color='g', label='Cross-validation')plt.xlabel('Training_examples')plt.ylabel('Loss')plt.legend(loc='best')plt.show()# 各种评价指标y_pred1 = svm_model.predict(x_test)acc = accuracy_score(y_test, y_pred1)f1 = f1_score(y_test, y_pred1)recall = recall_score = recall_score(y_test, y_pred1)precision = precision_score(y_pred1, y_test)print(acc)print(f1)print(recall)print(precision)# 可视化plot_confusion_matrix(svm_model, x_test, y_test)plt.show()# Roc曲线plot_roc_curve(svm_model, x_test, y_test)plt.show()2.3网格调参# 网格调参param_grid = {'Kernel': ["linear", "rbf", "sigmoid"]}grid = GridSearchCV(svm_model, param_grid)grid.fit(x_train, y_train)print(grid.best_params_)朴素贝叶斯分类没有参数,所以不需要调参2.4学习曲线#学习曲线train_sizes,train_loss,val_loss = learning_curve(model,x,y,cv = 5, train_sizes = [0.1,0.25,0.3,0.5,0.75,1])train_loss_mean = np.mean(train_loss,axis=1)val_loss_mean = np.mean(val_loss,axis = 1)plt.plot(train_sizes,train_loss_mean,'o-',color='r',label='Training')plt.plot(train_sizes,val_loss_mean,'o-',color='g',label='Cross-validation')plt.xlabel('Training_examples')plt.ylabel('Loss')plt.legend(loc='best')plt.show()2.5评价指标 acc f1 recall precision#各种评价指标model.fit(x_train,y_train)y_pred1 = model.predict(x_test)acc = accuracy_score(y_test,y_pred1)f1 = f1_score(y_test,y_pred1)recall = recall_score = recall_score(y_test,y_pred1)precision = precision_score(y_pred1,y_test)print(acc)print(f1)print(recall)print(precision)2.6混淆矩阵plot_confusion_matrix(model, x_test, y_test)plt.show()2.7Roc曲线plot_roc_curve(model, x_test, y_test)plt.show()

嗨，我是LucianaiB。如果你觉得我的分享有价值，不妨通过以下方式表达你的支持：👍 点赞来表达你的喜爱，📁 关注以获取我的最新消息，💬 评论与我交流你的见解。我会继续努力，为你带来更多精彩和实用的内容。

点击这里👉LucianaiB ，获取最新动态，⚡️ 让信息传递更加迅速。

基于自然语言处理的垃圾短信识别系统

基于自然语言处理的垃圾短信识别系统 🌟 嗨，我是LucianaiB！ 🌍 总有人间一两风，填我十万八千梦。 🚀 路漫漫其修远兮，吾将上下而求索。目录设计题目设计目的设计任务描述设计要求输入和输出…...

编程日记 2025/7/14 15:42:25

PAT甲级-1022 Digital Libiary

题目题目大意一个图书有图书id，书名，作者，关键字，出版商，出版时间6个信息。现要查询图书的ID，1对应通过书名查询，2对应作者，3对应关键字（不需要完全一致，包…...

编程日记 2025/7/14 14:58:23

GD32F470 USB虚拟串口

1. 硬件连接确保GD32F470开发板的USB接口连接到PC的USB端口。开发板通常提供USB FS（全速）接口，可以直接使用。 2. 配置USB功能需要配置USB时钟、GPIO和中断，以支持全速USB设备模式。 2.1 配置USB时钟 c复制 void usb_rcu_…...

编程日记 2025/7/14 15:29:05

25美赛ABCDEF题详细建模过程＋可视化图表＋参考论文＋写作模版＋数据预处理

详情见该链接！！！！！！ 25美国大学生数学建模如何准备！！！！！-CSDN博客文章浏览阅读791次，点赞13次，收藏7次。通过了解比赛基本…...

编程日记 2025/7/14 15:29:05

【转帖】eclipse-24-09版本后，怎么还原原来版本的搜索功能

【1】原贴地址：eclipse - 怎么还原原来版本的搜索功能_eclipse打开类型搜索类功能失效-CSDN博客 https://blog.csdn.net/sinat_32238399/article/details/145113105 【2】原文如下： 更新eclipse-24-09版本后之后，新的搜索功能（CT…...

编程日记 2025/7/14 14:47:18

Elasticsearch 性能测试工具 Loadgen 之 002——命令行及参数详解

上一讲，我们讲解了 Loadgen 的极简部署方式、配置文件、快速使用从 0 到 1 方式。本讲，我们主要解读一下 Loadgen 的丰富的命令行及参数含义。有同学可能会说，上面不是介绍很清楚了吗？但，咱们还是有必要详细中文解读…...

编程日记 2025/7/14 15:40:02

DRG/DIP 2.0时代下基于PostgreSQL的成本管理实践与探索（下）

五、数据处理与 ETL 流程编程实现 5.1 数据抽取与转换（ETL）在 DRG/DIP 2.0 时代的医院成本管理中，数据抽取与转换（ETL）是将医院各个业务系统中的原始数据转化为可供成本管理分析使用的关键环节。这一过程涉及从医院 HIS 系统中抽取患者诊疗数据，并对其进行格式转换、字…...

编程日记 2025/7/14 15:41:35

【设计模式-行为型】状态模式

一、什么是状态模式什么是状态模式呢，这里我举一个例子来说明，在自动挡汽车中，挡位的切换是根据驾驶条件（如车速、油门踏板位置、刹车状态等）自动完成的。这种自动切换挡位的过程可以很好地用状态模式来描述。状态模式…...

编程日记 2025/7/14 15:25:01

想品客老师的第六天：函数

函数基础的部分写在这里函数声明在js里万物皆对象，函数也可以用对象的方式定义 let func new Function("title", "console.log(title)");func(我是参数title); 也可以对函数赋值： let cms function (title) {console.log(tit…...

编程日记 2025/7/14 15:26:39

hedfs和hive数据迁移后校验脚本

先谈论校验方法，本人腾讯云大数据工程师。 1、hdfs的校验这个通常就是distcp校验，hdfs通过distcp迁移到另一个集群，怎么校验你的对不对。有人会说，默认会有校验CRC校验。我们关闭了，为什么关闭？全量迁…...

编程日记 2025/7/12 19:22:04

面向通感一体化的非均匀感知信号设计

文章目录 1 非均匀信号设计的背景分析1.1 基于OFDM波形的感知信号1.2 非均匀信号设计的必要性和可行性1.2 非均匀信号设计的必要性和可行性 3 通感一体化系统中的非均匀信号设计方法3.1 非均匀信号的设计流程（1）均匀感知信号设计（2&#xff0…...

编程日记 2025/7/14 15:00:50

React将props传递给一个组件

React 组件通讯：从单向数据流到跨层级交互的深度实践 ——基于 Props 的通讯机制解析与高阶模式探索一、Props 的本质：不可变数据管道 React 的 props（properties）机制构建了单向数据流的核心范式。每个父组件通过 props 向子…...

编程日记 2025/7/14 14:57:15

头歌实训作业算法设计与分析-贪心算法(第2关：最优装载问题)

任务描述有一批集装箱要装上一艘载重量为C的轮船，共有n个集装箱，其中集装箱i的重量为Wi。最优装载问题要求确定在装载体积不受限制的情况下，将尽可能多的集装箱装上轮船。测试说明输入和输出说明： 第1行为集装箱数目n和载重限…...

编程日记 2025/7/13 21:46:04

HarmonyOS基于ArkTS卡片服务

卡片服务前言 Form Kit（卡片开发框架）提供了一种在桌面、锁屏等系统入口嵌入显示应用信息的开发框架和API，可以将应用内用户关注的重要信息或常用操作抽取到服务卡片（以下简称“卡片”）上，通过将卡片添加…...

编程日记 2025/7/14 11:14:52

Elasticsearch 性能测试工具 Loadgen 之 001——部署及应用详解

在现代软件开发中，性能测试是确保应用程序稳定性和响应速度的关键环节。今天，我们就来深入了解一款国产化功能强大的 Elasticsearch 负载测试工具——INFINI Loadgen。一、INFINI Loadgen 简介 Github地址：https://github.com/infinilabs/l…...

编程日记 2025/7/14 5:07:57

Python算法详解：动态规划

动态规划（Dynamic Programming，简称 DP）是计算机科学中一种高效解决问题的算法思想。它通过将复杂问题分解为更小的子问题，记录中间结果，避免重复计算，从而提升效率。本文将从动态规划的基础思想出发&#…...

编程日记 2025/7/14 15:33:52

python3+TensorFlow 2.x（二）回归模型

目录回归算法 1、线性回归 (Linear Regression) 一元线性回归举例 2、非线性回归 3、回归分类回归算法回归算法用于预测连续的数值输出。回归分析的目标是建立一个模型，以便根据输入特征预测目标变量，在使用 TensorFlow 2.x 实现线性回归模型时&…...

编程日记 2025/7/14 15:21:20

lombok 没生效 java: 找不到符号符号: 方法 setName(java.lang.String)

今天使用lombok 添加了 Data注解 set方法却没起效解决方法 1 给lombok 添加版本号再maven刷新下 <dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><version>1.18.8</version><optional>…...

编程日记 2025/7/14 15:35:21

uiautomator2教程

一、简介 uiautomator2 是一个 Python 库，用于 Android 的 UI 自动化测试，底层基于 Google uiautomator。二、安装 1、安装adb 2、pip install uiautomator2 3、设备安装 atx - agent，python -m uiautomator2 init 4、安装weditor&…...

编程日记 2025/7/14 15:42:24

旅游风景的代码项目

敦煌莫高窟：用代码打开千年艺术的大门 ——一个零基础也能看懂的神奇项目前言：当古老艺术遇上现代代码想象一下，你坐在电脑前，指尖轻轻一点，就能穿越到敦煌莫高窟——看飞天的衣袂飘飘、听千年的驼铃声声。这不是科…...

编程日记 2025/7/14 15:36:27

【后端开发】字节跳动青训营之性能分析工具pprof

性能分析工具pprof 一、测试程序介绍二、pprof工具安装与使用2.1 pprof工具安装2.2 pprof工具使用资料链接： 项目代码链接实验指南pprof使用指南一、测试程序介绍 package mainimport ("log""net/http"_ "net/http/pprof" // 自…...

编程日记 2025/7/14 15:36:27

【测试】-- 认识测试

1. 软件测试定义软件测试就是验证软件产品特性（功能、性能、界面、易用性等）是否满足用户的需求。 2. 测试的岗位软件测试开发工程师（测开） 开发：开发测试效率工具（自动化、性能测试、覆盖率等&#x…...

编程日记 2025/7/14 15:08:03

浏览器hid 和蓝牙bluetooth技术区别

HID与蓝牙技术区别引言在前端开发中，与外部设备的交互越来越重要，尤其是在移动设备和物联网设备日益普及的今天。HID（Human Interface Device）和蓝牙（Bluetooth）是两种常用的技术，用于实现设备…...

编程日记 2025/7/14 14:15:26

PCIE模式配置

对于VU系列FPGA，当DMA/Bridge Subsystem for PCI Express IP配置为Bridge模式时，等同于K7系列中的AXI Memory Mapped To PCI Express IP。...

编程日记 2025/7/14 15:24:13

mysql 学习3 SQL语句--整体概述。SQL通用语法；DDL创建数据库，查看数据库，删除数据库，使用数据库；

SQL通用语法 SQL语句分类 DDL data definition language : 用来创建数据库，创建表，创建表中的字段，创建索引。因此成为数据定义语言 DML data manipulation language 有了数据库和表以及字段后，那么我们就需要给这个表中添加数…...

编程日记 2025/7/10 23:38:50

Swing使用MVC模型架构

什么是MVC模式？ MVC是一组英文的缩写，其全名是Model-View-Controller，也就是“模型-视图-控制器”这三个部分组成。这三个部分任意一个部分发生变化都会引起另外两个发生变化。三者之间的关系示意图如下所示： MVC分为三个部分，所以在MVC模型中将按照此三部分分成三…...

编程日记 2025/7/11 1:05:28

Java定时任务实现方案(二)——ScheduledExecutorService

这篇笔记，我们要来介绍实现Java定时任务的第二个方案，使用ScheduledExecutorService，以及该方案的优点和缺点。 ScheduledExecutorService是Java并发包java.util.concurrent中用于执行定时任务和周期性任务的接口，它拓展了Executo…...

编程日记 2025/7/14 1:15:57

Agent群舞，在亚马逊云科技搭建数字营销多代理(Multi-Agent)（下篇）

在本系列的上篇中，小李哥为大家介绍了如何在亚马逊云科技上给社交数字营销场景创建AI代理的方案，用于社交动态的生成和对文章进行推广曝光。在本篇中小李哥将继续本系列的介绍，为大家介绍如何创建主代理，将多个子代理挂载到主代理…...

编程日记 2025/7/11 13:22:52

执行结果:通过执行用时和内存消耗如下： int dfs(int node, int parent, int f, int* coins, int k, int **children, int *childCount, int **memo) {if (memo[node][f] ! -1) {return memo[node][f];}int res0 (coins[node] >> f) - k;int res1 coins[no…...

编程日记 2025/7/13 16:36:16

STM32_SD卡的SDIO通信_基础读写

本篇将使用CubeMXKeil, 创建一个SD卡读写的工程。目录一、SD卡要点速读二、SDIO要点速读三、SD卡座接线原理图四、CubeMX新建工程五、CubeMX 生成 SD卡的SDIO通信部分六、Keil 编辑工程代码七、实验效果实现效果，如下图： 一、SD卡速读…...

编程日记 2025/7/9 19:16:31

新手理解：Android 中 Handler 和 Thread.sleep 的区别及应用场景

新手理解：Android 中 Handler 和 Thread.sleep 的区别及应用场景 Handler 是啥？Handler 的几个核心功能： Thread.sleep 是啥？Thread.sleep 的核心特点： 两者的区别它们的应用场景1. Handler 的应用场景2. Thread.sleep…...

编程日记 2025/7/13 4:52:25

C语言-----扫雷游戏

扫雷游戏的功能说明 ： • 使⽤控制台实现经典的扫雷游戏 • 游戏可以通过菜单实现继续玩或者退出游戏 • 扫雷的棋盘是9*9的格⼦ • 默认随机布置10个雷 • 可以排查雷： ◦ 如果位置不是雷，就显⽰周围有⼏个雷 ◦ 如果位置是雷，就…...

编程日记 2025/7/13 21:45:42

监控与调试：性能优化的利器 — ShardingSphere

在分布式数据库系统中，监控和调试是确保系统高效运行的关键。ShardingSphere 提供了多种监控和调试工具，帮助开发者实时跟踪和优化性能，识别瓶颈，进行故障排查，从而提升系统的稳定性和响应速度。本文将介绍如何使用 Sh…...

编程日记 2025/7/9 8:39:12

Kubernetes相关知识入门详解

一、Pod的滚动升级 1.服务升级的一般思路：停止与该服务相关的所有服务pod，重新拉去更新后的镜像并启动。这种方法存在一个比较现实的问题是逐步升级导致较长时间的服务不可用。 2.Kubernetes滚动升级的思路：通过滚动升级的命令创建新的rc&…...

编程日记 2025/7/8 20:54:22

多层 RNN原理以及实现

数学原理多层 RNN 的核心思想是堆叠多个 RNN 层，每一层的输出作为下一层的输入，从而逐层提取更高层次的抽象特征。 1. 单层 RNN 的数学表示首先，单层 RNN 的计算过程如下。对于一个时间步 t t t，单层 RNN 的隐藏状态 h t h_t…...

编程日记 2025/7/12 6:43:27

Unity阿里云OpenAPI 获取 Token的C#【记录】

获取Token using UnityEngine; using System; using System.Text; using System.Linq; using Newtonsoft.Json.Linq; using System.Security.Cryptography; using UnityEngine.Networking; using System.Collections.Generic; using System.Globalization; using Cysharp.Thr…...

编程日记 2025/7/10 1:07:58

基于自然语言处理的垃圾短信识别系统

目录

一、设计题目

二、设计目的

三、设计任务描述

四、设计要求

五、输入和输出要求

输入要求

输出要求

六、验收要求

七、进度安排

八、系统分析

九、总体设计

十、详细设计

1. 数据预处理模块

2. 特征提取模块

3. 模型构建模块

4. 性能评估模块

十一、数据结构设计

十二、函数列表及功能简介

十三、程序实现

1. 数据预处理

2. 特征提取

3. 模型训练

4. 性能评估

5. 词云图生成

十四、测试数据和运行结果

测试数据

运行结果

十五、总结与思考

十六、参考文献

十七、附录代码

相关文章：