Skip to main content

KDD & Data Characterization

一些散的知识点

pre-processing

  • Data Cleaning
  • Data Integration
  • Data Selection
  • Data Transformation

Characterizing Data

  • Data type: quantitative, qualitative, or a complex mix
  • Size of dataset
  • Data dimensionality
  • Quality of data: presence of missing, erroneous, or redundant values

Data type

Quantitative / Numerical (定量的) [一级形容]

  • Number
  • Distance between values is meaningful
    1. Discrete: Integers (离散的)
    2. Continuous (连续的) [二级形容]

Qualitative / Categorical

  • Names, labels, or categories
  • Distance between values is not meaningful
    1. Nominal: cannot be sorted
    2. Ordinal: ordering is meaningful

eg. Height is a Quantitative Continuous variable.

Numerical Encoding of Categorical Features (定性数据的定量编码)

  • One-hot:
    Create a binary column for each unique category/value. Place a 1 in the column for the relevant category, and a 0 in all others.
  • Dummy:
    Same as one-hot, but drop one of the columns. 【避免了线性依赖线性模型中带来的影响】

//------------------

  • Binary:
    Only True/False, 1/0.

//------------------

  • Ordinal:
    Assign numerical values based on natural ordering, starting with 1 (sometimes starting with 0)
  • Label:
    Each unique category/value is assigned a unique integer that does not reflect any ordered relationship.
    NOTE: Uppon two are both sometimes referred to as "Integer encoding"

//------------------

  • Target:
    Average value.

Curse of Dimensionality

What is "BIG" data?

  • A data point is a single object or sample in your set.

  • The dimensionality is the number of features.

  • Complex mix of data and data types.

    简单来说:
    1. 数据点代表的单位小 -> 数据量大
    2. 特征多 -> 维度高
    3. 复杂
    => 大数据

Curse of Dimensionality

维度诅咒 (Curse of Dimensionality) 是指随着数据特征(维度)数量的增加,数据分析和模型训练变得越来越复杂甚至失效的一种现象。
主要体现在:

  1. 数据稀疏性:高维下,距离↑
  2. 计算复杂度
  3. 距离意义失效: d(x,y)=i=1n(xiyi)2d(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} 随着维度增加(n变大),无关特征的占比增大。
  4. 过拟合

Distance Computations

Definition of Norm

In mathematics, a norm is a function f: V → R that satisfies the following properties:

  1. Positivity:
    f(x)>0f(x) > 0, for all x in V, and f(x) = 0 if and only if x = 0.
  2. Triangle Inequality:
    f(x+y)f(x)+f(y)f(x + y) ≤ f(x) + f(y), for all x, y in V.
  3. Positive Homogeneity:
    f(αx)=αf(x)f(αx) = |α|f(x), for all x in V and α in R.
    Which means that once the vector is scaled, the norm is scaled by the same factor.

Distance Metrics

(1-Norm) Manhattan Distance:

d(x,y)=i=1nxiyid(x, y) = \sum_{i=1}^{n} |x_i - y_i|

其中,x, y是两个点,n则是维度。
例如A(1, 2), B(3, 4)的曼哈顿距离为:

d(A,B)=13+24=4d(A, B) = |1-3| + |2-4| = 4

(2-Norm) Euclidean Distance:

d(x,y)=i=1n(xiyi)2d(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}

=> 两者都是Minkowski Distance的特例

Minkowski Distance (p-Norm / Lp Norm):

d(x,y)=(i=1nxiyip)1pd(x, y) = (\sum_{i=1}^{n} |x_i - y_i|^p)^{\frac{1}{p}}

可知,无穷范数(∞-Norm):

d(x,y)=maxixiyid(x, y) = \max_{i} |x_i - y_i|

这是因为遵守p-Norm的定义,当p趋近于无穷时,范数逐渐接近向量中最大分量的绝对值(指数爆炸嘛)