INT303 Big Data Analytics 笔记
Lecture1 Introduction
不考!
“Data Mining is the study of collecting, processing, analyzing, and gaining useful insights from data”
EXPLORATORY ANALYSIS
Make measurements to understand what the data looks like
first steps when collecting data. – Metrics: Deciding what to measure is important
– How do we compute similarity?
– How do we group similar users? Clustering
-- Similar users like items similarly: Recommendation systems
Triadic closure principle: Links are created in a way that usually closes a triangle
– If both Bob and Charlie know Alice, then they are likely to meet at some point.
predictions
types:
– Predicting a real value (e.g. number of clicks): Regression
– Predicting a YES/NO value (e.g., will the user click?): Binary classification
– Predicting over multiple classes (e.g., what is the topic of a post): Classification
Lecture2 What is data and big data
Data
Collection of data objects and their attributes
An attribute is a property or characteristic of an object
• Examples: name, date of birth, height, occupation.
• Attribute is also known as variable, field, characteristic, or feature
For each object the attributes take some values.
The collection of attribute-value pairs describes a specific object
• Object is also known as record, point, case, sample, entity, or instance
Attribute type 属性类型
Numeric
• Examples: dates, temperature, time, length, value, count.
• Discrete (counts) vs Continuous (temperature)
• Special case: Binary/Boolean attributes (yes/no, exists/not exists)
Categorical
• Examples: eye color, zip codes, strings, rankings (e.g, good, fair, bad), height in {tall, medium, short}
• Nominal (no order or comparison) vs Ordinal (order but not comparable)
Relational data 关系型数据
• The term comes from DataBases, where we assume data is stored in a relational table with a fixed schema (fixed set of attributes)
In Databases, it is usually assumed that the table is dense (few null values)
• There are a lot of data in this form
• There are also a lot of data which do not fit well in this form
Sparse data: Many missing values
Not easy to define a fixed schema
Numeric Relational Data
• If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points/vectors in a multi-dimensional space, where each dimension represents a distinct attribute
• Such data set can be represented by an n-by-d data matrix, where there are n rows, one for each object, and d columns, one for each attribute
Numeric Data
• Thinking of numeric data as points or vectors is very convenient
- For small dimensions we can plot the data
- We can use geometric analogues to define concepts like distance or similarity
- We can use linear algebra to process the data matrix
Categorical Relational Data
consists of a fixed set of categorical attributes
Mixed Relational Data
consists of a fixed set of both numeric and categorical attributes
注意:
zip code 这里其实是 categorical
Boolean既可以 numerical 也可以 categorical,Sometimes it is convenient to represent categorical attributes as Boolean.
Sometimes it is convenient to represent numerical attributes as categorical.
Binning/Bucketization 分类**
• Idea: split the range of the domain of the numerical attribute into bins (intervals).
• Every bucket defines a categorical value
decide number of bins?
• Depends on the granularity of the data that we want
• Sometimes domain knowledge is also used
decide the size of the bucket?
Depends on the data and our application
• Equi-width bins(等宽分箱): All bins have the same size 等宽分箱将数据范围划分成大小相等的多个区间
Example: split time into decades
Problem: some bins may be very sparse or empty
• Equi-size/depth/frequency bins(等大小分箱): Select the bins so that they all contain the
same number of elements 每个桶包含相同数量的数据点
This splits data into quantiles: top-10%, second 10% etc
Problem: Some bins may be very small
• Optimized bins(优化分箱): 优化分箱通过使用单维聚类算法来创建更适合数据分布的分箱,减少无意义的空桶或稀疏桶
Use a 1-dimensional clustering algorithm to create the bins
e.g
数据:[2, 3, 5, 6, 7, 9, 10, 12, 15, 18]
,分为 3 个分箱
Equal Width 分箱:[2, 7), [7, 12), [12, 17)
数据分布:[2, 3, 5, 6], [7, 9, 10], [12, 15, 18]
Equal Size 分箱:
数据分布:[2, 3, 5], [6, 7, 9], [10, 12, 15, 18]
Set data 集合形数据
• Each record is a set of items from a space of possible items
• Sets can also be represented as binary vectors, or vectors of counts
Vector representation of Transaction data(market-basket data)
Vector representation of Document data(bag-of-words)
Dependent data
• In some cases, there are explicit dependencies between the data
• Ordered/Temporal data: We know the time order of the data
• Spatial data: Data that is placed on specific locations
• Spatiotemporal data: data with location and time
• Networked/Graph data: data with pairwise relationships between entities
Genomic sequence data
long ordered string
e.g GGTTCCGCCTTCAG... (核苷酸序列)
Time series Data
Sequence of ordered (over “time”) numeric values.
e.g 股票
Sequence data
Sequence data: Similar to the time series but in this case, we have
categorical values rather than numerical ones.
• Example: Event logs
Spatial data
Attribute values that can be arranged with geographic co-ordinates
Such data can be nicely visualized
e.g 地图坐标
Spatiotemporal data 时空数据
Data that have both spatial and temporal aspects
• Measurements in different locations over time
Pressure, Temperature, Humidity
• Measurements that move in space over time
Traffic, Trajectories of moving objects
Graph Data
Graph data: a collection of entities and their pairwise relationships
In this case the data consists of pairs: Who links to whom Or directed links
representation : Adjacency matrix-- Very sparse, very wasteful, but useful conceptually
e.g
• Web pages and hyperlinks
• Facebook users and friendships
• The connections between brain neurons
• Genes that regulate each oterh
The data matrix
In almost all types of data we can find a way to transform the data into a matrix,
where the rows correspond to different records, and the columns to numeric attributes
Data Mining Pipeline**
Data Quality
Examples of data quality problems:
• Noise and outliers
• Missing values
• Duplicate data
• Units may be different in different parts of the data (centimeters vs meters, or meters vs feet)
• Time measurements should be brought into the same time zone.
• Normalize names:
• Panayiotis Tsaparas, P. Tsaparas, and P. N. Tsaparas are all the same person
• Financial units should be normalized
• Different currencies
• Prices over time
Preprocessing
steps:
• Reducing the data: Sampling, Dimensionality Reduction
• Data cleaning: deal with missing or inconsistent information
• Feature extraction and selection: create a useful representation of the data by extracting useful features
Sampling**
• main technique employed for data selection often used for both the preliminary investigation and the final data analysis.
• Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.
Types
Simple Random Sampling 简单随机抽样:每个项被选中的概率相等。
Sampling without replacement 不放回抽样:一旦选中一个数据项,它不会再次被选中,因此选中某个特定数据项的概率取决于总选择的数量和数据集大小。
Sampling with replacement 放回抽样:抽样时数据项不会被移出样本集,同一个数据项可以多次被选中,这使概率计算更简单。
Stratified sampling 分层抽样:将数据分成多个组,然后从每组中随机抽取样本。
Biased sampling 偏性抽样:指在抽样过程中,样本选择存在偏倚或不公平的情况。
• Example: When sampling temporal data, we want to increase the probability of sampling recent data
• Introduce recency bias
• Make the sampling probability to be a function of time, or the age of an item
• Typical: Probability decreases exponentially with time
• For item 𝑥! after time 𝑡 select with probability 𝑝 𝑥! ∝ 𝑒"!
Reservoir Sampling 水塘抽样: 是一种在数据流中进行随机抽样的算法,特别适用于无法预先知道流的大小,且内存有限的情况。
Claim: Every item has probability 1/N to be selected after N items have been read.
算法步骤:
- 初始阶段:对于前
k
个数据项,将它们全部保留。 - 后续阶段:从第
k+1
个数据项开始,每次到达第i
个数据项时,以k/i
的概率保留该项,并随机替换掉之前选择的k
个数据项之一(每个项被替换的概率是1/k
)。
这种方法确保,即使在流结束时,最后一个项被选中的概率也是 1/k
,实现了均匀抽样。
Dimensionality Reduction 降维
• Reduce the amount of data
• Extract the useful information.
e.g
Data Cleaning**
Missing Values
• Ignore the data
• Replace with random value
• Not a good idea, but you can understand how the missing value affects the output
• Replace with nearest neighbor value
• Relatively common practice.
• Should be careful for cases where this does not make sense (e.g., year of birth/death)
• Replace with cluster mean
Outliers
• Remove them
• Try to correct them using common sense
• Transform the data
When using the data, we should be careful of cases where our results are too good to be true, or too bad to be true
-
Too Good to Be True:
- You train a machine learning model, and it achieves 99% accuracy on the test set. Upon further inspection, you realize the test set contains samples that were also present in the training set.
-
Too Bad to Be True:
- A model trained on customer purchase data performs poorly, and upon investigation, you find that 30% of the purchase records have missing or incorrect customer IDs, leading to mismatched features and labels.
Lecture 3 Data exploration and statistical analysis
Feature Extraction
We need to extract the features/attributes from the data and build the data matrix
Text Data
Text normalization
First cut
Second cut:Remove stop words
erm Frequency (TF)
captures how often a word appears in a document
𝑇𝐹(t, 𝑑): term frequency of word t in document 𝑑
= (Number of times term t appears in document d) / (Total number of terms in document d)
• Number of times that the word appears in the document
• Natural measure of importance of the word for the document
Inverse Document Frequency (IDF)
Natural measure of the uniqueness of the word w
Important words are the frequent words that are unique to the document (differentiating) compared to the rest of the collection
• All reviews use the word “like”. This is not interesting
• We want the words that characterize the specific restaurant
Document Frequency 𝐷𝐹(𝑤)
fraction of documents that contain word 𝑤.
Inverse Document Frequency 𝐼𝐷𝐹(𝑤):
• Maximum when unique to one document : 𝐼𝐷𝐹(𝑤) = log(𝐷)
• Minimum when the word is common to all documents: 𝐼𝐷𝐹(𝑤) = 0
TF-IDF
The words that are best for describing a document are the ones that are important for the document, but also unique to the document.
𝑇𝐹-𝐼𝐷𝐹(𝑤, 𝑑) = 𝑇𝐹(𝑤, 𝑑) X 𝐼𝐷𝐹(𝑤)
e.g
Document 1: "the cat in the hat"
Document 2: "the quick brown fox"
We want to calculate the TF-IDF for the words "cat" and "quick" in these documents.
1. compute TF
TF(cat,d1)=1/5=0.2
TF(quick,d2)=1/4=0.25
2. compute IDF
IDF(cat): The word "cat" appears in 1 document (Document 1).
IDF(cat)=log(2/1)=log(2)≈0.3010
IDF(quick): The word "quick" appears in 1 document (Document 2).
IDF(quick)=log(2/1)=log(2)≈0.3010
3. compute TF-IDF
TF-IDF(cat,d1)=0.2×0.3010=0.0602
TF-IDF(quick,d2)=0.25×0.3010=0.0753
Third cut
• TF-IDF takes care of stop words as well
• We do not need to remove the stopwords since they will get 𝐼𝐷𝐹(𝑤) = 0
Data Normalization**
Column normalization
用于将数据集中的每一列(特征)缩放到相似的范围或分布,以便不同特征在后续分析或建模中具有相等的权重。通过对每一列单独进行归一化处理,可以减少不同特征值域差异对分析结果的影1.响,使模型更稳定并加快收敛速度。以下是几种用于normalization的方法
1. Divide (the values of a column) by the maximum value for each attribute
• Brings everything in the [0,1] range, maximum is 1
new value = old value / max value in the column
2. Subtract the minimum value and divide by the difference of the maximum value and minimum value for each attribute
• Brings everything in the [0,1] range, maximum is one, minimum is zero
new value = (old value – min column value) / (max col. value –min col. value)
3. Subtract the mean value for each column – centering of features
• All columns have mean zero
new value = (old value – avg column value)
4. Divide with the length of the centered column vector
• All columns are unit vectors
5. Divide with the standard deviation of the column vector
• Computes the z-score
• Number of standard deviations away from the mean
Row normalization
课件解释如下
1. documents similar
Divide by the sum of values for each document (row in the matrix)
• Transform a vector into a distribution*
new value = old value / Σ old values in the row
2. rate in a similar way
Subtract the mean value for each user (row) – centering of data
• Captures the deviation from the average behavior
new value = (old value – mean row value) [/ (max row value –min row value)]
3. Measures the number of standard deviations away from the mean
Softmax
如果我们想要将分数转换为概率分布(所有类别的概率总和为1)
transform the scores into a probability distribution
Logistics Function
如果我们想把分数转换成用户再次光顾这家餐厅的概率probability(表示事件属于正类的概率)。与“用户将在三个餐厅中选择一个的概率”不同。•这不是餐馆的分布,而是事件“会再来”/“不会再来”的分布
• Maps reals to the [0,1] range
• Mimics the step function
• In the class of sigmoid functions
sigmoid function
Logarithm function(对数函数)
Sometimes a data row/column may have a very wide range of values.
Normalizing in this case will obliviate small values.
We can bring the values to the same scale by applying a logarithm to the column values.
对数函数用于指数数据的缩放,值域通常为实数,曲线不断上升但无上限。
Sometimes a data row/column may have a very wide range of values. Normalizing in this case will obliviate small values.We can bring the values to the same scale by applying a logarithm to
the column values.
Normalization vs Standardization
不是课件内的但概念有点混,当然不止我一个人会混
Similarity & Distance**
• For many different problems we need to quantify how close two objects are.
• To solve these problems we need a definition of similarity, or distance.
• The definition depends on the type of data that we have
Similarity
Numerical measure of how alike two data objects are.
• A function that maps pairs of objects to real values
• Higher when objects are more alike.
• Often falls in the range [0,1], sometimes in [-1,1]
Desirable properties for similarity
1. 𝑠(𝑝, 𝑞) = 1 (or maximum similarity) only if 𝑝 = 𝑞. (Identity)
2. 𝑠(𝑝, 𝑞) = 𝑠(𝑞, 𝑝) for all 𝑝 and 𝑞. (Symmetry)
Intersection
Jaccard Similarity(Similarity between sets)
e.g
1. set
A={1,2,3,4}
B={3,4,5,6}
A∩B = {3, 4}, ∣A∩B∣=2
A∪B = {1, 2, 3, 4, 5, 6}, ∣A∪B∣=6
JSim(A,B) = ∣A∩B∣/ ∣A∪B∣ = 2/6 = 0.33
2. documents
sentence1: "I love programming"
sentence2: "I enjoy programming"
A={I,love,programming}
B={I,enjoy,programming}
A∩B={I ,programming} ∣A∩B∣=2
A∪B = {I,love, enjoy,programming} ∣A∩B∣=4
JSim(A,B) = 2/4 = 0.5
Cosine Similarity(Similarity between vectors)
e.g
1.
2.
Cos(D1,D2)
D1==[10,20,0,0]
D2⃗=[30,60,0,0]
其它同理
Cos (D3,D1) = Cos(D3,D2) = 4/5
Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0
Application :
-
Text Similarity: In natural language processing, documents or sentences can be represented as vectors (using methods like TF-IDF or word embeddings). Cosine similarity helps measure the similarity between them.
-
Recommendation Systems: Cosine similarity is used in collaborative filtering, where products/items are represented by user ratings, and similar items can be recommended.
-
Clustering: In clustering, cosine similarity can help group similar data points together, especially when comparing feature vectors.
Correlation Coefficient
The correlation coefficient measures correlation between two random variables
result explaination:
- 1, there's a perfect positive correlation.
- 0, there's no correlation.
- -1, there's a perfect negative correlation.
e.g
CorrCoeff(D1,D2) = 1
D1 = [-5,5,0,0] D2 = [-15,15,0,0]
mean(D1) = 0 = mean(D2)
Numerator :∑(xi−μX)(yi−μY) = (−5−0)(−15−0)+(5−0)(15−0)+(0−0)(0−0)+(0−0)(0−0) =(−5)(−15)+(5)(15)+0+0=75+75+0+0=150
denominator:
CorrCoeff(D1,D2)=150/150=1
后面的省略计算过程
CorrCoeff(D1,D3) = CorrCoeff(D2,D3) = -1
CorrCoeff(D1,D4) = CorrCoeff(D2,D4) = CorrCoeff(D3,D4) = 0
Distance
• Numerical measure of how different two data objects are
• A function that maps pairs of objects to real values
• Lower when objects are more alike
• Higher when two objects are different
• Minimum distance is 0, when comparing an object with itself.
• Upper limit varies
我才知道L1 L2 正则是和距离算法等同的....
Jaccard distance
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)
Cosine distance
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)
Hamming Distance
Hamming distance is the number of positions in which bit-vectors differ.
e.g
p1 = 10101
p2 = 10011.
d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th positions.
The L1 norm for the binary vectors
Hamming distance between two vectors of categorical attributes is the number of positions in which they differ.
Example:
x = (married, low income, cheat)
y = (single, low income, not cheat)
d(x,y) = 2
Distances between distributions ↓↓↓
Variational distance
The 𝐿1 distance between the distribution vectors
用来计算概率!概率!概率!
Dist(D1,D2) = |0.35- 0.4| + |0.5-0.4| + |0.1-0.1| + |0.05-0.1| = 0.05+0.1+0.05 =0.2
Dist(D2,D3) = 0.35+0.35+0.5+ 0.2 = 1.4
Dist(D1,D3) = 0.3+0.45+0.5+ 0.25 = 1.5
Information theoretic distance
Ranking distance
The input in this case is two rankings/orderings of the same N items.
𝑅1 = <𝑥, 𝑦, 𝑧, 𝑤>
𝑅2 = <𝑦, 𝑤, 𝑧, 𝑥>
怎么解释这个图呢,就是x 在r 1上的排名是1, y在r1上的排名是2...
x在r2 的排名上是4, y在r2上的排名是1...
Spearman rank distance: L 1 distance between the ranks
用于评估两个变量之间的单调关系,即它们是否倾向于同时增加或减少,而不考虑它们之间的具体函数形式。
斯皮尔曼等级相关系数的值范围在-1到+1之间,其中:
- +1 表示完全的单调递增关系,即一个变量的增加总是伴随着另一个变量的增加。
- -1 表示完全的单调递减关系,即一个变量的增加总是伴随着另一个变量的减少。
- 0 表示没有单调关系,即两个变量的增减没有明显的一致性。
逐个计算元素的排名位置差值
SR(R1,R2) = |1-4| +|2-1| + |3-3| + |4-2| = 6
所以Spearman Rank Distance 和 Variational Distance 有什么区别呢?
Spearman rank distance is more for ranking data in terms of correlation, while Variational distance is for comparing probability distributions directly.
Eudidean distance
For two points P=(x1,y1)= and Q=(x2,y2), the Euclidean distance d between them is calculated as:
Coclusion
- Jaccard distance 适用于集合数据或二进制特征。
- Cosine distance 适用于高维稀疏数据,如文本向量。
- Hamming distance 适用于二进制数据或固定长度字符串。
- variational distance 适用于概率分布比较。
- Information Theoretic Distance 适用于概率分布的KL散度等比较。
- Ranking Distance 适用于排名数据或顺序数据比较。
- Eulicdeann distance 适用于几何空间中的点距离或数值数据的相似度计算。
Exploratory Data Analysis (EDA)
一些pandas的操作,考试会考代码
High-level viewing
• head() – first N observations
• tail() – last N observations
• columns() – names of the columns
• describe() – statistics of the quantitative data
• dtypes() – the data types of the columns
Accessing/processing:
• df[“column_name”]
• Df.column_name
• .max(), .min(), .idxmax(), .idxmin()
• <dataframe> <conditional statement>
• .loc[] – label-based accessing
• .iloc[row_indices, column_indices] – index-based accessing
• .sort_values()
• .isnull(), .notnull()
• .isna(), .notna
• .any() 只要对象中存在至少一个 True
,那么它就返回 True
。如果没有任何 True
,它才返回 False
• .fillna()- Fill NA/NaN values using the specified method.
• .drop_duplicates()
Grouping/Splitting/Aggregating:
• groupby()-
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
• get_group() 输出按照groupby的结果
• .merge()- Merging the dataset is the process of combining two datasets in one. 总之就是可以根据id合并多个属性
• .concat() 这个直接是按行和列连接的,不需要索引
• .aggegate() 对分组数据或整个 DataFrame 执行聚合操作(如求和、均值等)
Lecture 4 Data collection and Data scraping ( 不考)
这个不考,通过cw1考察,那我还学什么喵,不学了喵
Web Service
• A server is a long running process (also called daemon) which listens on a pre-specified port
• and responds to a request, which is sent using a protocol called HTTP
• A browser parses the url.
Data Scraping
Challenges
Which data?
The internet is dynamic
Data is volatile
Privacy
Netiquette
STEP 1: INSPECT YOUR DATA SOURCE
STEP 2: SCRAPE HTML CONTENT FROM A PAGE
STEP 3: PARSE HTML CODE WITH BEAUTIFUL SOUP
FIND ELEMENTS BY ID
FINDALL VS. FIND
FIND ELEMENTS BY HTML CLASS NAME
EXTRACT TEXT FROM HTML ELEMENTS
FIND ELEMENTS BY CLASS NAME AND TEXT CONTENT
PASS A FUNCTION TO A BEAUTIFUL SOUP METHOD
ACCESS PARENT ELEMENTS
EXTRACT ATTRIBUTES FROM HTML ELEMENTS
Gathering data from APIs
AnyAPI
Lecture 5 Data Virtualization
Visualization motivation
• Identify hidden patterns and trends
• Formulate/test hypotheses
• Communicate any modeling results
–Present information and ideas succinctly
–Provide evidence and support
–Influence and persuade
• Determine the next step in analysis/modeling
Principle of Visualization
1. Maximize data to ink ratio: show the data
2. Don’t lie with scale (Lie Factor)
3. Minimize chart-junk: show data variation, not design
variation
4. Clear, detailed and thorough labeling
Types of Visualization
• Distribution: how a variable or variables in the dataset
distribute over a range of possible values.
• Relationship: how the values of multiple variables in the
dataset relate
• Composition: how a part of your data compares to the whole.
• Comparison: how trends in multiple variable or datasets
compare
Distribution
• When studying how quantitative values are located along an
axis, distribution charts are the way to go.
• By looking at the shape of the data, the user can identify
features such as value range, central tendency and outliers.
Histogram
A histogram is a way to visualize how 1-dimensional data
is distributed across certain values.
Trends in histograms are sensitive to number of bins.
Relationship
• They are used to find correlations, outliers, and clusters in your data.
• While the human eye can only appreciate three dimensions together, you can visualize additional variables by mapping them to the size, color or shape of your data points.
When we are trying to show the relationship between 2 variables, scatter plots or charts are used. When we are trying to show “relationship” between three variables, bubble charts
are used.
Scatter Plot
A scatter plot is a chart used to plot a correlation between two or more variables at the same time. It’s usually used for numeric data.
It is a way to visualize how multi-dimensional data are distributed across certain values. A scatter plot is also a way to visualize the relationship between two different attributes of multi-dimensional data.
3D data
For 3D data, a quantitative attribute can be encoded by size in a bubble chart
Relationships may be easier to spot by producing multiple
plots of lower dimensionality
Comparison
• These are used to compare the magnitude of values to each other and to easily identify the lowest or highest values in the data.
• If you want to compare values over time, line or bar charts are often the best option.
– Bar or column charts àComparisons among items,.
– Line charts àA sense of continuity.
– Pie charts for comparison as well
MULTIPLE HISTOGRAMS
Plotting multiple histograms (and kernel density estimates of the distribution, here) on the same axes is a way to visualize how different variables compare (or how a variable differs over specific
groups).
BOXPLOTS
A boxplot is a simplified visualization to compare a quantitative variable across groups. It highlights the range, quartiles, median and any outliers present in a data set.
Composition
• Composition charts are used to see how a part of your data compares to the whole.
• Show relative and absolute values.
• They can be used to accurately represent both static and time-series data.
PIE CHART
A pie chart is a way to visualize the static composition (aka, distribution) of a variable (or single group).
STACKED AREA GRAPH 堆叠面积图
A stacked area graph is a way to visualize the composition of a group as it changes over time (or some other quantitative variable). This shows the relationship of a categorical variable (AgeGroup) to a quantitative variable (year).
Lecture 6 Infrastructure that supports Big Data processing
What is cluster architecture
集群架构
Distributed file system
– Data kept in “chunks” spread across machines
– Each chunk replicated on different machines
– Seamless recovery from disk or machine failure
MapReduce**
use for:
– Easy parallel programming
– Invisible management of hardware and software failures
– Easy management of very-large-scale data
MapReduce 的核心思想是“分而治之”,将复杂任务分解为许多小任务,分别计算后再汇总。其主要流程包括以下几个步骤:
1. Map: Transform input data into key-value pairs. Each mapper works on a subset of data in parallel.
input: A line from the input List File.
Output: The grocery item as the key and a count of 1 as the value.
整个过程可以理解为拆
2. Group by Key: Sort and shuff. System automatically sorts and groups key-value pairs by their key. 混
3. Reduce: Aggregate or process grouped data to produce the final output.
input: Key-value pairs from the mappers, where thekey is a grocery item and the value is a list.
Output:Key-Value Pairs: The grocery item as the key and the sum of the values as the value.
根据用户需要再把混合的数据组合
structure
Master worker 作为用户代理负责协调整个过程
一些判断正误
1) Each mapper/reducer must generate the same number of output key/value pairs as it receives on the input. (Wrong)
2) The output type of keys/values of mappers/reducers must be of the same type as their input. (Wrong
在许多数据处理任务中,Mapper 和 Reducer 的任务是对输入数据进行转换。例如,Mapper 接收文本数据行,输出单词及其计数 (String, Integer)。输入类型和输出类型不同是因为 MapReduce 的核心目标是对数据进行转换,而不仅仅是简单的传递数据。
灵活性是 MapReduce 编程模型的一大特点,用户可以根据需求自由转换数据的格式和类型。
3) The inputs to reducers are grouped by key. (True)
4) It is possible to start reducers while some mappers are still running . (Wrong)
5) The correct sequence of data flow in MapReduce is InputFormat, Parti-tioning Function, Reducer, Sort and Group, OutputFormat. (Wrong)
The correct sequence is InputFormat, Mapper, Partitioning Function, Shuffle and Sort, Reducer, OutputFormat
6)Reducer maps input key/value pairs to a set of intermediate key/value pairs. (True)
e.g word count
Word Count(单词计数)
- Mapper 输入:
(行号, 文本行)
,类型为(Long, String)
。 - Mapper 输出:
(单词, 1)
,类型为(String, Integer)
。 - Reducer 输出:
(单词, 出现次数)
,类型为(String, Integer)
。
Extends MapReduce
Spark
Apache Spark 是一个开源的分布式数据处理框架,旨在对大规模数据集进行快速的计算。相比于传统的大数据处理框架(如 Hadoop MapReduce),Spark 提供了更高的计算速度、丰富的功能和更友好的编程接口。
Key construct/idea: Resilient Distributed Dataset (RDD)
Higher-level APIs:
·Different APIs for aggregate data, which allowed to introduce SQL support
Hadoop
Hadoop 是一个实现了 MapReduce 模型的开源框架。Hadoop 的 MapReduce 组件用于分布式计算,搭配 HDFS(Hadoop Distributed File System) 实现数据的存储与访问。
Disk-based computation where the results of each step are written to disk.
RDD
- Distributed computing in memory
- Greatly improving the speed of data processing
Key construct/idea: Resilient Distributed Dataset (RDD)
Conclusion
Lecture 7 Dimensionality Reduction
Interaction Terms and Unique Parameterizations
Interaction Terms are combinations of predictor variables used to capture joint effects on the outcome variable. They help the model account for relationships that are not just additive but also dependent on the interaction between variables.
Parameterization参数化 in statistical models refers to how you define and estimate the model's parameters. In the context of regression models (or other models), parameters are the coefficients that quantify the effect of each predictor variable (or interaction term) on the outcome.
Unique Parameterizations refer to the separate parameters that need to be estimated for each variable and interaction term in the model. A model is "unidentifiable" when there aren't enough observations to estimate all these parameters uniquely.
A model is unidentifiable无法辨认的 when there aren't enough observations to estimate all of its parameters. This happens if you try to estimate more parameters than the data can support. For example:
- If you have only 100,000 observations but you have 8.3 million parameters in the model (as mentioned in the passage), the model might not be identifiable, meaning you won't be able to estimate the unique values of all 8.3 million parameters reliably.
- Unidentifiable models are problematic because they can lead to overfitting, poor generalization, and invalid conclusions.
How Interaction Terms Affect Parameterization:
When you add interaction terms to your model, the number of parameters increases because you're essentially introducing new combinations of features to be estimated. This can quickly lead to a high-dimensional model (many parameters) that requires a large amount of data to estimate them accurately.
For example:
- A model with just two variables (Education and Work Experience) will have parameters for each individual variable, but adding an interaction term adds a third parameter.
- As you keep adding more variables and their interactions, the number of parameters grows exponentially, leading to models that may become unidentifiable with limited data.
Big Data and High Dimensionality
A rectangular data set has two dimensions: number of observations (n) and the number of predictors (p). Both can play a part in defining a problem as a Big Data problem.
In a dataset, an observation refers to a single data point or a row in the dataset.Each observation typically represents an individual entity or instance that is being studied.个体,csv数据的每行
A predictor is a variable or feature in your dataset that is used to predict the outcome or dependent variable. These are the columns in your dataset that provide information to help build a model. 特征,csv数据的每列
• n is big (and p is small to moderate)?
Keep in mind, big n doesn’t solve everything 数据要有代表性才有用
• p is big (and n is small to moderate)?
Matrices may not be invertible (issue in OLS).
Multicollinearity is likely to be present
Models are susceptible to overfitting
This situation is called High Dimensionality, and needs to be accounted for when performing data analysis and modeling.
solution↓
Framework For Dimensionality Reduction
One way to reduce the dimensions of the feature space is to create a new, smaller set of predictors by taking linear combinations of the original predictors.
将原始的 p 个特征通过线性组合,生成一个新的、更小维度的特征集 Z1,Z2,...,Zm(m<p)。
- ϕji 是固定的系数(weights),它们决定了原始特征 Xj 在 Zi中的贡献。
- m 是降维后的特征数量。
基于降维后的新特征 Z1,Z2,...,Zm,可以构建一个新的线性回归模型
- Y 是目标变量(待预测的结果)。
- β0,β1,...,βm 是模型的系数。
- ε是误差项。
Principal Components Analysis (PCA)
Principal Components Analysis (PCA) is a method to identify a new set of predictors, as linear combinations of the original ones, that captures the 'maximum amount' of variance in the observed data.
That is, the observed data shows more variance in the direction of Z1than in the direction of Z2.
Toperform dimensionality reduction we select the top m principle components of PCA as our new predictors and express our observed data in terms of these predictors.
Top PCAcomponents capture the most of amount of variation (interesting features) of the data. Each component is a linear combination of the original predictors - we visualize them as vectors in the feature space. Transforming our observed data means projecting our dataset onto the space defined by the top m PCA components, these components are our new predictors.
The Math behind PCA
PCA关键点
1. Capturing Variation
2. p=p, m<p: PCA produces a list of p principal components. The first principal component (Z1) corresponds to the direction along which the data exhibits the highest variance. Subsequent components (Z2, Z3, etc.) capture progressively less variance but are still important sources of variation. To perform dimensionality reduction we select the top m principle components of PCA as our new predictors and express our observed data in terms of these predictors.
3. Linear Combinations:Each principal component (Zi) is a linear combination
of the original predictors, indicating how each predictor contributes to that component‘s direction. The principal components (Z's) are pairwise orthogonal, meaning they are uncorrelated with each other.
PCA过程
虽然ppt上没说的很仔细,但是我看往年考试会考(好吧不考)
用最直观的方式告诉你:什么是主成分分析PCA_哔哩哔哩_bilibili
总之就是通过找新坐标轴同时线性变换保留最大方差Variation
哈哈,线代有点忘干净了打算再去补一遍
covariance 协方差:两个变量在变化过程中是同方向变化还是反方向变化,同向或反向程度如何?
Singular Value Decomposition (SVD)
如果直接协方差矩阵可能计算量比较大,所以可以先分解
pros and cons
PCA is an unsupervised algorithm. It is done independent of the outcome variable.
• Note: the component vectors as predictors might not be ordered from best to worst!
Disadvantage:
1. Direct Interpretation of coefficients in PCRis completely lost. So do not do if interpretation is important.
2. Will often not improve predictive ability of a model.
Advantage:
1. Reducing dimensionality in high dimensional settings.
2. Visualizing how predictive your features can be of your response, especially in the classification setting.
3. Reducing multicollinearity, and thus may improve the computational time of fitting models.
PCA for Regression (PCR)
If we use all p of the new Zj, then we have not improved the dimensionality. Instead, we select the first M PCAvariables, Z1,...,ZM, to use as predictors in a regression model. And we use Cross Validation to check performance on a specified problem.
PCA for Imputation 归算
*自学部分
Imptation
Imputation refers to the process of filling in missing data (or "missing values") in a dataset.
we want our imputations to take into account (1) links between variables and (2) global similarity between individual observations. This is the idea behind the iterative PCA algorithm for imputation.
Iterative PCA
Notice that the 1st PCAcomponent changes less and less with each iteration. We simply iterate this process until we converge.
Algorithm
1. Initialize imputation with reasonable value (e.g., mean)
2. Iterate until convergence:
a. perform PCA on the complete data
b. retain first M components of PCA(in example M =1)
c. project imputation into PCA space
d. update imputation using projection value
we use Cross-validation to select the number of components to use for our projections
In practice this method can overfit, especially in a sparse matrix with many missing values. So often a regularized PCA approach is used which shrinks the singular values of the PCA , making updates less aggressive.
Alternative Interpretation of PCA
An alternative interpretation is that PCA finds a low-dimensional linear surface which is closest to the data points.
Matrix Completion
*自学部分
Matrix Completion is another application of PCA and is suitable for imputing data which are missing at random. This approach is commonly used in recommender systems which deal in very large sparse matrices.
Consider an n x p matrix of movie ratings by Netflix customers where n is the number of customers and p is the number of movies. Such a matrix will certainly contain many missing entries.
Imputing these missing values well is equivalent to predicting what customers will think of movies they haven’t seen yet.
Lecture 8-9 Large-scale Machine Learning
Supervised Learning
In supervised learning, except for the feature variables that describe the
data, we also have a target variable. The goal is to learn a function (model) that can estimate/predict the value of the target variable given the features
Task:
Regression: The target variable (but also the features) is numerical and continuous
Classification: The target variable is discrete
Descriptive modeling: Explanatory tool to understand the data
Regression: How does the change in the value of different factors affect our target variable?
• What factors contribute to the price of a stock?
• What factors contribute to the GDP of a country?
Classification: Understand what attributes distinguish between objects of different classes
• Why people cheat on their taxes?
• What words make a post offensive?
确实很注重解释性
Predictive modeling: Predict a class of a previously unseen record
Regression: What will the life-expectancy of a patient be?
Classification: Is this a cheater or not? Will the stock go up or not. Is this an offensive post?
Regression
• We assume that we have 𝑘 feature variables (numeric):
Also known as covariates, or independent variables
• The target variable is also known as dependent variable.
• We are given a dataset of the form {(𝒙1, 𝑦1) , … , (𝒙𝑛, 𝑦𝑛) } where, 𝒙𝒊 is a 𝑘-dimensional feature vector, and 𝑦𝑖 a real value
• We want to learn a function 𝑓 which given a feature vector 𝒙𝒊 predicts a value 𝑦𝑖′ = 𝑓 𝒙𝒊 that is as close as possible to the value 𝑦𝑖
• Minimize sum of squares:
Linear regression
-
Best suited for:
- Predicting a continuous numerical target (e.g., house prices, sales revenue).
- understanding the effect of the independent variables to the value of the dependent variable
Example: Predicting a company's monthly revenue based on advertising spend.
The simplest form of 𝑓 is a linear function
One-dimensional linear regression
Multiple linear regression:Matrix inversion may be too expensive. Other optimization techniques are often used to find the optimal vector (e.g., Gradient Descent)
Problems
1) Regression is sensitive to outliers -- remove the outliers
2) features may have very different scales -- Normalize the features by replacing the values with the z- scores -- Remove the mean and divide by the standard deviation
3) • The model we have is linear with respect to the parameters 𝒘, but the features we consider may be non-linear functions of the 𝒙𝒊 values. -- To capture more complex relationships, we can take a transformation of the input (e.g., logarithm log xij), or add polynomial terms (e.g., x^2ij).
Logistics Regression
-
Best suited for:
• Produces a probability estimate for the class membership which is often very useful.
• The weights can be useful for understanding the feature importance.
• Works for relatively large datasets.
• Fast to apply.
• Instead of predicting the class of a record we want to predict the probability of the class given the record
• Transform the classification problem into a regression problem.
Estimating the coefficients
Classification
Classification is the task of learning a target function f that maps attribute set x to one of
the predefined class labels y
Evaluation
Classification Techniques ↓
Decision Trees
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
-
Best suited for:
- Problems requiring interpretability.
- Both classification and regression tasks.
- Situations with nonlinear relationships or interactions between variables.
Example: Classifying loan applicants as likely to default or not based on financial history.
Induction
Goal: Find the tree that has low classification error in the training data (training error)
Finding the best decision tree (lowest training error) is NP-hard
Greedy strategy: Split the records based on an attribute test that optimizes certain criterion. (e.g Hunt’s Algorithm (one of the earliest), CART, ID3, C4.5, SLIQ,SPRINT)
Hunt’s Algorithm
• How to Classify a leaf node
• Assign the majority class
• If leaf is empty, assign the default class – the class that has the highest popularity (overall or in the parent node).
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Specify Test Condition
number of ways to split
2-way split
注意,第二个不合适,打破了顺序,因为 Small 和 Large 不应该跨越 Medium 进行组合。
Multi-way split
attribute types
Nominal
分类变量,没有天然的顺序或等级,只表示不同的类别或标签。类别之间是平等的,不能比较大小。e.g 颜色
Ordinal
Ordinal 属性表示具有顺序但没有明确数值间隔的变量。e.g 小/中/大、等级(优秀/良好/及格/不及格)
Continuous
Continuous 属性表示数值型变量,取值是连续的,可以进行加减乘除等数学运算。
Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
• Binary Decision: (A < v) or (A ≥ v)
• consider all possible splits and finds the best cut
• can be more computationally intensive
Tree evaluation
determine the Best Split:
• Greedy approach:
• Creation of nodes with homogeneous class distribution is preferred
• Need a measure of node impurity:
Random Forest
-
Best suited for:
- Problems with complex, nonlinear relationships.
- Reducing overfitting compared to a single decision tree.
- Both classification and regression tasks.
Example: Predicting customer churn in a subscription service based on user activity.
* self study
A collection of decision trees, where each tree is trained on a random subset of the data and features (bagging).
Nearest Neighbor Classifier
Uses k “closest” points (nearest neighbors) for performing classification
-
Best suited for:
- Small datasets with few features where instances are similar to each other.
- Problems where making predictions based on similarity to labeled examples works well.
Example: Classifying hand-written digits based on similarity to labeled digit images.
requirement
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to retrieve
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes
The value of k determines the complexity of the model: Lower k produces more complex models
steps
1. Compute distance to other training records
2. Identify k nearest neighbors
Determine the class from nearest neighbor list,take the majority vote of class labels among the k-nearest neighbors 总之就是周围最近的一圈的标签是啥,它就是啥
Weigh the vote according to distance: weight factor, w = 1/d^2
3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) 已知推未知哈!所以是有监督学习
issues
1) Scaling issues -- Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
2) Euclidean measure -- • High dimensional data
• curse of dimensionality (维数灾难:指当(数学)空间维度增加时,分析和组织高维空间遇到的各种问题场景。)
• Can produce counter-intuitive results
Solution: Normalize the vectors to unit length
limitation
• k-NN classifiers are lazy learners
• It does not build models explicitly. Unlike eager learners such as decision trees
• Classifying unknown records is relatively expensive
• Naïve algorithm: O(n)
• Need for structures to retrieve nearest neighbors fast.
• The Nearest Neighbor Search problem.
• Also, Approximate Nearest Neighbor Search
• Issues with distance in very high-dimensional spaces
Support Vector Machines
• SVMs are part of a family of classifiers that assumes that the classes are linearly separable
• That is, there is a hyperplane that separates (approximately, or exactly) the instances of the two classes and the goal is to find this hyperplane
-
Best suited for:
- Classification tasks with clear margins between classes.
- Datasets with a high-dimensional feature space (e.g., text classification).
- Problems requiring kernel methods to handle nonlinear data.
Example: Spam email detection.
AdaBoost
* self study
Neural Networks
* self study
线性分类:A simple model for classification is to take a linear combination of
the feature values and compute a score.
非线性:
Deep learning
Networks with multiple layers
Multiple layers: Each layer applies a logistic function to the input linear combination:
• The result is a more complex function
Training Process:
• Start with some initial weights
• Forward pass: Compute the outputs of all internal nodes
• Backward pass: Perform backpropagation to estimate the gradients
• Change the weights to move towards the direction of the gradient
• Repeat
Gradient Descent
Stochastic gradient descent 随机梯度下降
• Stochastic gradient descent: Consider one input point at the time. Each point is considered only once. 在传统梯度下降中,所有训练数据的梯度被计算后再更新参数,称为 批量梯度下降。随机梯度下降(SGD) 则不同,它在每次迭代中仅使用 一个样本(或一个小批量)来估算梯度,从而更新模型参数。
• Intermediate solution: Use mini-batches of data points. 在每次迭代中使用一个小批量数据(例如 32 或 64 个样本)进行更新,结合了批量和随机梯度下降的优点。
Backpropagation
Main idea:
• Start from the final layer: compute the gradients for the weights of the final layer.
• Use these gradients to compute the gradients of previous layers using the chain
rule
• Propagate the error backwards
Backpropagation essentially is an application of the chain rule for differentiation.
Word Embeddings (感觉不太会考)
06 Word2Vec模型(第一个专门做词向量的模型,CBOW和Skip-gram)_哔哩哔哩_bilibili
为什么哪里都有水论文的程序员,xswl
Define a model that aims to predict between a center word 𝑤𝑐 and context words in some window of length 𝑚 in terms of word vectors
Word2Vec
Predict between every word and its context words
词向量的运算:使用向量减法可以很好地解决相似性维度问题(如语法变化和语义类比)。这正是 Word2Vec 模型的核心结果,
CBOW
Predict context words given the center word
Skipgram
Predict center word from a bag-of-words context
Lecture 10 Clustering
Overview of Clustering
Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that
– Members of a cluster are close/similar to each other
– Members of different clusters are dissimilar
distance measures
Document = set of words
– Jaccard distance
Document = point in space of words
– (x1, x2,..., xN), where xi = 1 iff word i appears in doc
– Euclidean distance
Document = vector in space of words
–Vector from origin to (x1, x2,..., xN)
–Cosine distance
Hierarchical Clustering
– Agglomerative (bottom up)(自下而上)(本节课主要讲):
• Initially, each point is a cluster
• Repeatedly combine the two “nearest” clusters into one
– Divisive (top down) 分裂(自上而下):
• Start with one cluster and recursively split it
Agglomerative
operation: Repeatedly combine two nearest clusters
Euclidean case
each cluster has a centroid = average of its (data) points
1) How do you represent a cluster of more than one point?
– Key problem: As you merge clusters, how do you represent the “location” of each cluster, to tell which pair of clusters is closest? centroid
2) How do you determine the “nearness” of clusters?
Measure cluster distances by distances of centroids
NON-Euclidean case
The only “locations” we can talk about are the points themselves,there is no “average” of two points
1) How do you represent a cluster of more than one point?
clustroid = (data)point “closest” to other points
closet: – Smallest maximum distance to other points
– Smallest average distance to other points
– Smallest sum of squares of distances to other points
(2) How do you determine the “nearness” of clusters?
Treat clustroid as if it were centroid, when computing inter-cluster distances
Centroid vs Clustroid
Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point.
Clustroid is an existing (data)point that is “closest” to all other points in the cluster.
3) When to stop combining clusters?
– Approach 1: Pick a number k upfront, and stop when we have k clusters
• Makes sense when we know that the data naturally falls into k classes
– Approach 2: Stop when the next merge would create a cluster with low “cohesion”
• i.e, a “bad” cluster
cohesion based approaches:
– Approach 3.1: Use the diameter of the merged cluster = maximum distance between points in the cluster
– Approach 3.2: Use the radius = maximum distance of a point from centroid (or clustroid)
– Approach 3.3: Use a density-based approach
• Density = number of points per unit volume
• E.g., divide number of points in cluster by diameter or radius of the cluster
• Perhaps use a power of the radius (e.g., square or cube)
inplementation
• Naïve implementation of hierarchical clustering:
– At each step, compute pairwise distances between all pairs of clusters, then merge
– O(N3)
• Careful implementation using priority queue can reduce time to O(N2 log N)
– Still too expensive for really big datasets that do not fit in memory
The k Means Algorithm
Best suited for:
Clustering data into groups based on feature similarity.
Exploratory data analysis to identify hidden patterns or groupings.
Situations where the number of clusters is known or can be estimated.
Example: Customer segmentation for targeted marketing campaigns.
• Assumes Euclidean space/distance
• Start by picking k, the number of clusters
• Initialize clusters by picking one point per cluster
steps
• 1) For each point, place it in the cluster whose current centroid it is nearest
• 2) After all points are assigned, update the locations of centroids of the k clusters
• 3) Reassign all points to their closest centroid
– Sometimes moves points between clusters
• Repeat 2 and 3 until convergence
– Convergence: Points don’t move between clusters and centroids stabilize
– Average falls rapidly until right k, then changes little
picking initial k points
• Approach 1: Sampling
– Cluster a sample of the data using hierarchical clustering, to obtain k clusters
– Pick a point from each cluster (e.g. point closest to centroid)
– Sample fits in main memory
• Approach 2: Pick “dispersed” set of points
– Pick first point at random
– Pick the next point to be the one whose minimum distance from the selected points is as large as possible
– Repeat until we have k points
Complexity
• In each round, we have to examine each input point exactly once to find closest centroid
• Each round is O(kN) for N points, k clusters
• But the number of rounds to convergence can be very large!
BFR ALGORITHM
- BFR 是一种基于高维数据的增量聚类算法,是对k-means 的扩展,适合处理大规模数据集和高维数据。
- 它主要用于处理接近高斯分布的簇。
BFR 将数据点分为三类:
- Discard Set(DS):
- 靠近某个簇中心的数据点,可以被丢弃并直接归入该簇,使用均值和方差来表示。
- Compression Set(CS):
- 彼此靠近但尚未归入某个簇的数据点,可以被压缩为一个子簇,用统计量(均值、方差、点数)表示。
- Retained Set(RS):
- 离群点或尚未找到合适簇的数据点,保留在集合中等待进一步处理。
limitation:
– Clusters are normally distributed in each dimension
– Axes are fixed – ellipses at an angle are not OK
CURE (Clustering Using REpresentatives) ALGORITHM
- CURE 是一种基于代表点的聚类算法。
- 它通过选择多个代表点(representative points)来表示一个簇,而不是单一的中心点,从而可以处理具有非球形分布的簇。
– Assumes a Euclidean distance
– Allows clusters to assume any shape
– Uses a collection of representative points to representclusters
K-means vs hierarchical
层次聚类算法(如凝聚层次聚类)通过合并或分割簇的方式来构建一个树状结构。它基于簇内数据点的相似性来构建树,从而确保内聚性较强的簇被合并在一起。
- 内聚性度量:簇内的最小距离(单链接)、最大距离(完全链接)或平均距离(平均链接)等度量可以影响聚类的结果。
K-means算法通过最小化簇内数据点的平方误差(即数据点与簇中心的距离)来实现聚类。这个误差的最小化直接反映了簇内的内聚性。K-means的目标是将数据点分配到与其最接近的簇中心,从而确保簇内的点尽可能紧密。
- 内聚性度量:K-means最小化每个簇内点到簇中心的平方距离(内聚性指标)。
Lecture11 Recommendation Systems
Overview
types of recommendations
• Editorial and hand curated
– List of favorites
– Lists of “essential” items
• Simple aggregates
– Top 10, Most Popular, Recent Uploads
• Tailored to individual users
– Amazon, Netflix, TikTok…
Formal Model**
• X = set of Customers
• S = set of Items
Utility function u: X × S → R
– R = set of ratings
– R is a totally ordered set
– e.g., 0-5 stars, real number in [0,1]
problems:
• (1) Gathering “known” ratings for matrix
– How to collect the data in the utility matrix
• (2) Extrapolate推断 unknown ratings from the known ones
– Mainly interested in high unknown ratings
• We are not interested in knowing what you don’t like but what you like
• (3) Evaluating extrapolation methods
– How to measure success/performance of recommendation methods
Three approaches to recommender systems↓
Content-based Recommender Systems
Content-based filtering focuses on the attributes of the items themselves (e.g., genres, tags) and the user's past preferences for specific item types.
Main idea: Recommend items to customer x similar to previous items rated highly by x
e.g
• Movie recommendations
– Recommend movies with same actor(s), director, genre, …
• Websites, blogs, news
– Recommend other sites with “similar” content
Item Profiles
For each item, create an item profile
• Profile is a set (vector) of features
– Movies: author, title, actor, director,…
– Text: Set of “important” words in document
• How to pick important features?
– Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency)
• Term … Feature
• Document … Item
TF-IDF identifies words that are important to a specific document but not frequent across all documents.
这个参考lec2中的
User Profiles
User profile possibilities:
– Weighted average of rated item profiles
– Variation: Normalize weights using average rating of user
– More sophisticated aggregations possible…
e.g1 Boolean Utility Matrix
• Items are movies, only feature is "Actor”
– Item profile: vector with 0 or 1 for each Actor
• Suppose user x has watched 5 movies
– 2 movies featuring actor A
– 3 movies featuring actor B
• User profile = mean of item profiles
– Feature A's weight = 2/5 = 0.4
– Feature B's weight =3/5=0.6
e.g2 Star Ratings
• Same example, 1-5 star ratings
– Actor A's movies rated 3 and 5
– Actor B's movies rated 1, 2 and 4
• Useful step: Normalize ratings by subtracting user's mean rating (3)
– Actor A's normalized ratings = 0, +2
Profile weight = (0 + 2)/2 = 1
– Actor B's normalized ratings = -2, -1, +1
Profile weight = (-2-1+1)/3 = -2/3
Predictions
Prediction heuristic 启发式
Pro & Con
Pro:
• +: No need for data on other users
– No cold-start or sparsity problems
• +: Able to recommend to users with unique tastes
• +: Able to recommend new & unpopular items
– No first-rater problem
• +: Able to provide explanations
– Can provide explanations of recommended items by listing content-features that caused an item to be recommended
Con:
• –: Finding the appropriate features is hard
– E.g., images, movies, music
• –: Overspecialization
– Never recommends items outside user’s content profile
– People might have multiple interests
– Unable to exploit quality judgments of other users
• –: Recommendations for new users
– How to build a user profile?
Collaborative Filtering
Collaborative filtering focuses on the relationships and patterns in user interactions (e.g., who liked what) and doesn’t require information about the content of the items.
根据用户推荐
User-user collaborative filtering
• Consider user x
• Find set N of other users whose ratings are “similar” to x’s ratings
• Estimate x’s ratings based on ratings of users in N
FINDING “SIMILAR” USERS
• Consider users x and y with rating vectors rx and ry
• We need a similarity metric sim(x, y)
• Capture intuition that sim(A,B) > sim(A,C)
Jaccard Similarity
sim(A,B) = |rA ∩ rB| / |rA ∪ rB|
sim(A,B) = 1/5
sim(A,C) = 2/4
*problem: Ignores rating values!
总之就是确实算不了打分呢
cosine similarity
• sim(A, B) = cos(rA, rB)
sim(A, B) = 4*5/√[(4^2+5^2+1^2)*(5^2+5^2+4^2) ]=0.38
sim(A, C) = 0.32
sim(A, B) > sim (A, C), but not by much
Problem: treats missing ratings as negative
centered cosine similarity
2/3 = 4 - 10/3, 5/3 = 5 - 10/3 ... 总之是原数减去平均数
sim(A, B) = cos(rA, rB) = 0.09; sim(A, C) = -0.56
sim(A,B) > sim(A,C)
Captures intuition better
Missing ratings treated as "average"
Handles "tough raters" and "easy raters"
Also known as Pearson Correlation Coefficient
Rating Predictions
Item-item collaborative filtering
– For item i, find other similar items
– Estimate rating for item i based on ratings for similar items
– Can use same similarity metrics and prediction functions as in user-user model
In practice, it has been observed that item-item often works better than user-user because Items are simpler, users have multiple tastes
e.g item-item
Here we use Pearson correlation as similarity:
思路:锁user算item
1) Subtract mean rating mi from each movie i
Item-Item | 对每个用户的评分减去该用户的平均评分 | 消除用户评分偏差 |
User-User | 对每个物品的评分减去该物品的平均评分 | 消除物品评分基线偏差 |
m1 = (1+3+5+5+4)/5 = 3.6
row1: 每个数减去3.6 = [-2.6, NAN, -0.6, NAN, NAN, 1.4, NAN, NAN, 1.4, NAN, 0.4, NAN]
2) Compute cosine similarities between rows(items)
S(r1,r3) = 5+20+20/√(76*84)= 0.41
S(r1,r6) = 0.59
3) Predict by taking weighted average:
就是给相关的分加上通过相似性算出来的权重
rating (1,5) = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
这个2和3是基于同一user的不同item评分
Pro & Con
• + Works for any kind of item
– No feature selection needed
• - Cold Start:
– Need enough users in the system to find a match
• - Sparsity:
– The user/ratings matrix is sparse
– Hard to find users that have rated the same items
• - First rater:
– Cannot recommend an item that has not been previously rated
– New items, Esoteric items
• - Popularity bias:
– Cannot recommend items to someone with unique taste
– Tends to recommend popular items
Evaluating Recommendation Systems
Evaluationg predictions
PROBLEMS WITH ERROR MEASURES
• Narrow focus on accuracy sometimes
misses the point
– Prediction Diversity
– Prediction Context
– Order of predictions
• In practice, we care only to predict high ratings:
– RMSE might penalize a method that does well for high ratings and badly for others
– Alternative: precision at top k Percentage of predictions in the user's top k withheld ratings
相关文章:
INT303 Big Data Analytics 笔记
Lecture1 Introduction 不考! “Data Mining is the study of collecting, processing, analyzing, and gaining useful insights from data” EXPLORATORY ANALYSIS Make measurements to understand what the data looks like first steps when collecting da…...
Git 解决 everything up-to-date
首先使用git log查看历史提交,找到最新一次提交,比如: PS D:\Unity Projects\CoffeeHouse\CoffeeHouse_BurstDebugInformation_DoNotShip> git log commit a1b54c309ade7c07c3981d3ed748b0ffac2759a3 (HEAD -> master, origin/master)…...
初级算法 - 数组简介
数组简介 在TypeScript中,数组是一种存储同一类型数据的集合类型。数组可以动态调整长度,支持对元素进行增删改查等操作。通过类型注解,可以更清晰地约束数组中元素的类型,提升代码的可维护性。 创建数组的方式 1. 使用字面量方式…...
【毕业设计选题】目标检测方向毕业设计选题推荐 2025
目录 前言 毕设选题 开题指导建议 更多精选选题 选题帮助 最后 前言 大家好,这里是海浪学长毕设专题! 大四是整个大学期间最忙碌的时光,一边要忙着准备考研、考公、考教资或者实习为毕业后面临的升学就业做准备,一边要为毕业设计耗费大量精力。学长给大家整…...
适配器模式详解
适配器模式(Adapter Pattern)是一种结构型设计模式,它允许将一个类的接口转换成客户所期望的另一个接口,使得原本不兼容的类能够协同工作。这种模式的主要目的是解决接口不匹配的问题,它通过创建一个适配器类ÿ…...
基于 Spring 的自定义注解和请求拦截器实现认证机制
基于 Spring 的自定义注解和请求拦截器实现认证机制 一、基于 Spring 的自定义注解和请求拦截器实现认证机制1. 背景2. 业务场景3. 核心实现3.1 定义自定义注解 IgnoreAuth3.2 定义请求拦截器 AuthInterceptor3.3 配置拦截器3.4 登录接口 4. 小结 一、基于 Spring 的自定义注解…...
Java编程规约:集合处理
文章目录 I 集合处理【强制】【推荐】II 知识扩展I 集合处理 【强制】 不要在 foreach 循环里进行元素的 remove / add 操作。remove 元素请使用 iterator 方式,如果并发操作,需要对 iterator 对象加锁。// 正例: List<String> list = new ArrayList<>(...
python使用PyQt5,整套,桌面应用
安装 安装 pip install PyQt55.7.1 pip install PyQtWebEngine1、创建窗口,按百分比划分 from PyQt5.QtGui import QGuiApplication from PyQt5.QtWidgets import QApplication, QWidget # 创建应用程序实例 app QApplication([]) # 创建主窗口 window QWidget(…...
机械臂的各种标定
文章目录 1. 工具坐标系标定2. 工具手标定3. 手眼标定联系 在工程中,同时使用工具坐标系标定、工具手标定和手眼标定的概念、目的和作用如下: 1. 工具坐标系标定 概念: 工具坐标系标定是指确定工具相对于机器人坐标系的位置和姿态关系的过程…...
金融租赁系统助力企业转型与市场竞争力提升
内容概要 在现代商业环境中,金融租赁系统不仅是一个简单的工具,而是企业转型的重要推动力。通过优化业务流程,提升自动化水平,它帮助企业在复杂的市场中找到自己的立足之地。想象一下,一个企业在使用传统方法时&#…...
正则表达式 - 运算符优先级
正则表达式 - 运算符优先级 正则表达式(Regular Expression,简称Regex)是一种用于处理字符串的强大工具,它通过特定的语法规则来匹配、查找和替换文本中的特定模式。在正则表达式中,运算符的优先级决定了表达式各部分的处理顺序,这对于正确理解和编写正则表达式至关重要…...
【python】unittest单元测试
文章目录 基本使用不同启动方式的区别 基本使用 下面是根据文档写的一个demo,主要的内容基本都包含了,使用时导入自己的业务类测试类中的方法就行。 import unittest# 测试类不强制test开头,仅作为规范。但必须继承unittest.TestCase class…...
【YashanDB知识库】如何使用jdbc向YashanDB批量插入gis数据
本文内容来自YashanDB官网,原文内容请见 https://www.yashandb.com/newsinfo/7817897.html?templateId1718516 以gis表为例: drop table gis; create table gis(id number not null, pos st_geometry not null); 使用如下的java代码片断,…...
基于 LangChain 实现数据库问答机器人
基于 LangChain 实现数据库问答机器人 一、简介二、应用场景三、实战案例1、需求说明2、实现思路3、对应源码 一、简介 在 Retrieval 或者 ReACT 的一些场景中,常常需要数据库与人工智能结合。而 LangChain 本身就封装了许多相关的内容,在其官方文档-SQ…...
XIAO ESP32 S3网络摄像头——2视频获取
本文主要是使用XIAO Esp32 S3制作网络摄像头的第2步,获取摄像头图像。 1、效果如下: 2、所需硬件 3、代码实现 3.1硬件代码: #include "WiFi.h" #include "WiFiClient.h" #include "esp_camera.h" #include "camera_pins.h"// 设…...
图像描述/字幕开源模型与数据集全览
图像描述/字幕(Image Captioning)是用文字描述图像内容的任务,属于计算机视觉和自然语言处理的交叉领域。大多数图像描述系统采用编码器-解码器(encoder-decoder)框架,其中输入图像被编码为中间表示形式&am…...
关于UE加载osgb数据的研究(一)
最近关于倾斜数据在UE中加载显示的问题,直接转换格式本地加载的方式避免了数据延迟加载、缓存加载,动态刷新等问题,但是也暴露了突出的问题:常规的模型格式会丢失掉倾斜数据的lod,致使效果缺失。 故而需要深入研究一下UE加载osgb数据的方式方法。 首先,我们需得学习一下…...
Hypervisor 的两种类型
文章目录 一、定义 Hypervisor(也被称为虚拟机监视器,即Virtual Machine Monitor,VMM)是一种创建和运行虚拟机的软件、固件或硬件。它可以在物理主机上划分出多个虚拟的计算环境,使得多个操作系统(Guest Op…...
07-ArcGIS For JavaScript--隐藏参数qualitySettings(memory和lod控制)
目录 1、综述2、sceneview.qualitySettings2.1、sceneview.qualitySettings.memoryLimit2.2、lodFactor2.3 additionalCacheMemory 3、结论 1、综述 先上重点,SceneView.qualitySettings为隐藏对象参数,该对象的memoryLimit和lodFactor等值,…...
内训宝企业培训平台 upload/scorm 文件上传致RCE漏洞复现
0x01 产品简介 内训宝企业培训平台是一款专注于企业内部培训的在线平台,由北京内训宝科技有限公司开发并提供服务。旨在为企业提供全方位、个性化的内部培训解决方案。通过该平台,企业可以轻松地组织和管理内部培训活动,提升员工的专业技能和综合素质,进而增强企业的竞争力…...
js按日期按数量进行倒序排序,然后再新增一个字段,给这个字段赋值 10 到1
效果如下图: 实现思路: 汇总数据:使用 reduce 方法遍历原始数据数组,将相同日期的数据进行合并,并计算每个日期的总和。创建日期映射:创建一个映射 dateMap,存储每个日期的对象列表。排序并添加…...
[2025] 如何在 Windows 计算机上轻松越狱 IOS 设备
笔记 1. 首次启动越狱工具时,会提示您安装驱动程序。单击“是”确认安装,然后再次运行越狱工具。 2. 对于Apple 6s-7P和iPad系列(iOS14.4及以上),您应该点击“Optinos”并勾选“允许未经测试的iOS/iPadOS/tvOS版本”&…...
修改表字段属性,SQL总结
MYSQl varchar转为mediumtext ALTER TABLE table_name MODIFY COLUMN column_name mediumtext; ALTER TABLE table_name MODIFY COLUMN column_name varchar(255) 1. 修改字段的数据类型 使用 MODIFY COLUMN 可以改变字段的数据类型、长度、默认值或注释,但不会更…...
艾体宝产品丨加速开发:Redis 首款 VS Code 扩展上线!
Redis 宣布推出其首款专为 VS Code 设计的 Redis 扩展。这一扩展将 Redis 功能直接整合进您的集成开发环境(IDE),旨在简化您的工作流程,提升工作效率。 我们一直致力于构建强大的开发者生态系统,并在您工作的每一步提…...
PHP 中的魔术常量
概述 PHP提供了9个魔术常数,您可以在PHP应用程序代码中使用。它们是“神奇的”,因为它们是在编译时定义的,不像常规常量(您可以自己定义)是在运行时定义的。这意味着它们的值可以根据它们在代码中的使用位置而更改。 …...
培训机构Day20
今天还是讲一些基本的js知识点。 知识点: html css :框架结构 样式修饰 javascript:行为交互,动态效果。有逻辑的语言。动态脚本语言。无需编译,解释执行。 寄生在网页上执行。浏览器内核自带js解释器。 js引入三…...
Day62 图论part11
Floyd 算法精讲 Floyd 算法代码很简单,但真正理解起原理 还是需要花点功夫,大家在看代码的时候,会发现 Floyd 的代码很简单,甚至看一眼就背下来了,但我为了讲清楚原理,本篇还是花了大篇幅来讲解。 代码随想…...
【Golang 面试题】每日 3 题(十三)
✍个人博客:Pandaconda-CSDN博客 📣专栏地址:http://t.csdnimg.cn/UWz06 📚专栏简介:在这个专栏中,我将会分享 Golang 面试中常见的面试题给大家~ ❤️如果有收获的话,欢迎点赞👍收藏…...
云手机+Facebook:让科技与娱乐完美结合
移动互联网时代,Facebook作为全球最大的社交媒体平台之一,早已成为企业、品牌和组织竞相角逐的营销阵地。而云手机的出现,则为Facebook营销注入了新的活力,其独特的优势让营销活动更加高效、精准且灵活。本文将深入探讨云手机在Fa…...
2024年总结与展望
23年底的时候,和一大批同事一起转到刚成立的合资公司,当时大家都以为会是好的归宿,毕竟外资文化,投了这么多钱,起码两三年内应该是稳定、不内卷的,似乎可以安心工作,安心生活。2024的开头确实挺…...
标准流,浮动,Flex布局
三个 div 盒子在一一行显示,但是 div 标签默认是独占一行的 想要实现块级盒子在一行,两种方法,浮动和 Flex 布局 浮动之后的盒子:顶对齐,行内块显示特点(宽高显示) 浮动的盒子会脱标,…...
2 、什么是Java中的不可变类
在Java中,不可变类(Immutable Class)是指一类其对象一旦创建后,就不能更改其状态(即对象的字段值不能被修改)的类。不可变类提供了一种安全的方式来管理对象,尤其是在并发编程中,能够…...
网络安全技能试题总结参考
对网络安全技能测试相关的试题进行了总结,供大家参考。 一、单选题 1.(单选题)以下属于汇聚层功能的是 A.拥有大量的接口,用于与最终用户计算机相连 B.接入安全控制 C.高速的包交换 D.复杂的路由策略 答案:D 2.(单选题)VLAN划分的方法,选择一个错误选项 A.基于端口…...
基于Python的社交音乐分享平台
作者:计算机学姐 开发技术:SpringBoot、SSM、Vue、MySQL、JSP、ElementUI、Python、小程序等,“文末源码”。 专栏推荐:前后端分离项目源码、SpringBoot项目源码、Vue项目源码、SSM项目源码、微信小程序源码 精品专栏:…...
CP AUTOSAR标准之FlexRayDriver(AUTOSAR_SWS_FlexRayDriver)(更新中……)
1 简介和功能概述 FlexRay驱动程序(Fr)抽象了特定FlexRay通信控制器(CC)的硬件相关实现细节。本规范主要依赖于符合FlexRay规范[13]的FlexRay CC。此外,本规范还支持符合FlexRay规范[14]的旧版FlexRay控制器。本SWS中因支持的FlexRay规范不同而导致的不同行为在适用的情况下以…...
AIDD -人工智能药物设计- DrugChat:多模态大语言模型实现药物机制与属性的全方位预测
DrugChat:多模态大语言模型实现药物机制与属性的全方位预测 今天为大家介绍的是来自加州大学圣地亚哥分校谢澎涛团队的一篇论文。准确预测潜在药物分子的作用机制和性质对于推进药物发现至关重要。然而,传统方法通常需要为每个特定的预测任务开发专门的…...
CP AUTOSAR标准之FlexRayInterface(AUTOSAR_SWS_FlexRayInterface)(更新中……)
1 简介和功能概述 该规范指定了AUTOSAR基础软件模块“FlexRay接口”的功能、API和配置。 在AUTOSAR分层软件架构中,FlexRay接口属于ECU抽象层,或者更准确地说,属于通信硬件抽象。这表明了FlexRay接口的主要任务: 为上层提供FlexRay通信系统的抽象接口。至少就数据传…...
【Linux-多线程】线程互斥(锁和它的接口等)
一、线程互斥 我们把多个线程能够看到的资源叫做共享资源,我们对共享资源进行保护,就是互斥 1.多线程访问问题 【示例】见一见多线程访问问题,下面是一个抢票的代码,共计票数10000张,4个线程去抢 之前我们展示过封…...
当一个服务拆成两个服务的错误
1.老路由信息没有修改,导致一直404...
单元测试入门和mockup
Java 新手入门:Java单元测试利器,Mock详解_java mock-CSDN博客 这个是典型的before when assert三段式,学一下单测思路 这个没有动态代理,所以是直接class(对比下面) Jmockit使用笔记_增加代码覆盖率_覆盖try catch_使用new Mock…...
Android Room 框架的初步使用
一、简介 Room 是一个强大的对象关系映射库,它允许你将 SQLite 数据库中的表映射到 Java 或 Kotlin 的对象(称为实体)上。你可以使用简单的注解(如 Entity、Dao 和 Database)来定义数据库表、数据访问对象(…...
后盼2024,前顾2025
年年花落又花开,后顾2024,前盼2025~~ 2024年,继续前进,稳扎稳打。 生活 组织并陪伴家人去了理县、成都等地游玩,nice。 出差和旅行去了杭州、西安、成都、江门、珠海等地,嘻嘻。 爱人如养花,…...
电子电器架构 ---什么是智能电动汽车上的逆变器?
我是穿拖鞋的汉子,魔都中坚持长期主义的汽车电子工程师。 老规矩,分享一段喜欢的文字,避免自己成为高知识低文化的工程师: 所谓鸡汤,要么蛊惑你认命,要么怂恿你拼命,但都是回避问题的根源,以现象替代逻辑,以情绪代替思考,把消极接受现实的懦弱,伪装成乐观面对不幸的…...
Android ActionBar 技术深度解析
Android ActionBar 技术深度解析 概述 ActionBar 是 Android 应用中的一个核心 UI 组件,用于提供导航、操作和品牌展示。它通常位于应用窗口的顶部,包含应用的标题、导航按钮、操作项等。ActionBar 自 Android 3.0(API 11)引入,并在 Android 5.0(API 21)后被 Toolbar …...
如何在 Ubuntu 22.04 上部署 Nginx 并优化以应对高流量网站教程
简介 本教程将教你如何优化 Nginx,使其能够高效地处理高流量网站。 Nginx 是一个强大且高性能的 Web 服务器,以其高效处理大量并发连接的能力而闻名,这使得它成为高流量网站的流行选择。 正确优化 Nginx 可以显著提高服务器的性能࿰…...
建立一个Macos载入image的实例含界面
前言 为了方便ios程序的开发,有时候需要先用的Macos平台进行一些功能性的程序开发。 作为对比和参考。 1、创建一个MacOS的App 2、主界面控件的增加 添加的控件方法与ios相同,也是再用commandshiftL(CtrlShiftL),就会弹出控件…...
qt5.15.2+visual studio2022 免安装版环境配置
1.环境准备 visual studio2022qt5.15.2(免安装版本) 2.环境配置 2.1 打开首选项 2.2 添加Qt版本 2.3 构建套件手动添加Qt 5.15.2(msvc2019_64)并配置如下 3.新建项目 问题1:qt creator 没有欢迎界面 解决办法&#…...
鱼眼相机模型与去畸变实现
1.坐标系说明 鱼眼相机模型涉及到世界坐标系、相机坐标系、图像坐标系、像素坐标系之间的转换关系。对于分析鱼眼相机模型,假定世界坐标系下的坐标点,经过外参矩阵的变换转到相机坐标系,相机坐标再经过内参转换到像素坐标,具体如下 进一步进…...
【视觉SLAM:六、视觉里程计Ⅰ:特征点法】
视觉里程计(Visual Odometry, VO)是通过处理图像序列,估计摄像头在时间上的相对位姿变化的技术。它是视觉SLAM的重要组成部分之一,主要通过提取图像中的信息(如特征点或直接像素强度)来实现相机运动估计。以…...
SQLiteDataBase数据库
XML界面设计 <?xml version"1.0" encoding"utf-8"?> <LinearLayout xmlns:android"http://schemas.android.com/apk/res/android"xmlns:tools"http://schemas.android.com/tools"android:layout_width"match_paren…...