### Pure Mathematics Foundation

**Basic Objects**
- Scalar: Single number (5, -2.3)
- Vector: Ordered list of numbers [1, 3, -2]
- Matrix: Rectangular array of numbers with rows and columns

**Strict Mathematical Rules**
- Addition: Only between same-size objects
  - Matrix + Matrix: Must have identical dimensions (2×3 + 2×3 ✓, 2×3 + 3×2 ✗)
  - Vector + Vector: Must have same length ([1,2] + [3,4] ✓)
  - Matrix + Vector: UNDEFINED in pure math

- Multiplication:
  - Scalar × Anything: Multiply every element
  - Vector × Vector (dot product): Same length/dimension required, produces scalar
    - `ab=ba` commutative
  - Matrix × Matrix: Inner dimensions must match (m×n) × (n×p) → (m×p)
    - `AxB` means A_row[i] dot product B_col[j]
    - `AxB != BxA`
    - `AxB` in pure math is only defined for 2D matrices.
  - Matrix × Vector: Columns in matrix = elements in vector (2×3) × (3×1) → (2×1)

- Division:
  - Vector ÷ Vector: NOT DEFINED in pure mathematics


### Fundamental Meaning
- Vector as an arrow in space that tells you both which way to go and how far to travel.
- direction and magnitude. Anything that can be represented as `how much` and `what direction`.
- [3, 4], means go 3 steps in `x` direction, 4 steps in `y` direction
- Operations with vectors
  - `a + b`, combining the movement of both vectors, if I displace a, then b, I get to c
  - `a x b`, this is `dot product`, which is a measure of similarity, based on the direction they are pointing towards considering their magnitude as well.
- Matrix is a way to write transformation rules for vectors
- `A x b` means we are using A matrix transformation rules to move b vector.
  - How are we moving `b`?
    - it depends on what `A` represents
    - if (n, n) x (n,) = (n,) - vector dimensions are preserved, we can say either
      - We kept the vector in it's coordinate system, and just moved it around (rotated, scaled, reflected)
      - Or depending of what A represents, we could've kept the dimensions count, but moved the vector to a complete new coordinate system. (`projected`!)
    - if (m, n) x (n, ) = (m, ) - we are definitely moving the vector to a new coordinate system
    - **Example**
      > Imagine you have data about students with three features: hours studied, hours slept, and number of practice problems completed. This is a vector in 3-dimensional space, like [8, 7, 15]. Now suppose you want to predict two outcomes: their 
      test score and their stress level. You might use a transformation matrix that takes your 3-dimensional input and produces a 2-dimensional output.
- But wait, what about a matrix as a collection of vectors instead?
  - Yes, also true, it is a collection of numbers you can decide what to put in it.
  - if the rows are being treated individually, then is a collection of vectors
- `Q @ K^T`, is a collection of vectors, each vector at Q is computing similarities with K
- But `W_Q @ X` is a transformation, where W_T is moving X to a new dimension space to provide Q.
  - Fun fact: `W_Q @ X` is the correct mathematical way, but ML adopted `X @ W_Q` to represent it.
- So what if A,B are matrices that don't represent list of vectors like in Attention, but are Transformations?
  - `A @ B = C`, where C means, apply Transformation B, then apply A.
  - This is what basic neural networks (FFN/MLP) do all the time.
    - input → `W1` → intermediate → W2 → output
    - Each of W matrices, learns to project the input into a space where x pattern becomes more visible.
    - `W2 @ W1`, `W1` learns to create representations that are useful for `W2`, and `W2` learns how to use those representations.

### ML Changes Algebra Math a little

- Pure math matrix multiplication is represented as `@`
  - Same as pure math, Linear transformation representations
- Element wise multiplication is `*` (this doesn't exist in actual math)
  - Used for scaling the values or mask them.
- Element wise `+` and `-` remain the same, except that we have broadcasting.
  - In ML, we introduce `/` as an operation (doesn't exist in pure math).
  - Broadcasting is a pratical fix that ML does in libraries, in which if dimensions don't match, e.g. pytorch has some defined rules on how to match them.
- There are no 3D, 4D matrices in math, we call them Tensors instead.
- Matmul `@` in tensors still has the (mxn) x (nxp) = (m,p) rule, so internal dimensions must match,
  - what does this mean in 3D, 4D?
  - Last 2 dimensions do the magic here.
  - (..., m, n) x (..., n, p) = `(..., m, p)`
  - what about (...)? at each dim i, you take max(Ai_size,Bi_size)
  - Pytorch can do broadcasting here as well as long as rule of last 2 dimensions is correct.

### Similarity Operations

**Dot product:** "how much this vectors point in the same direction, weighted by how large they are?"
- Same direction, high dot product
- Direction and magnitude matter, if one is way larger, < dot product. 
- Ranges from -inf, inf

**Cosine similarity:** removes the magnitude bias, only focuses on direction / angle of vectors. (-1, 1)
- for example, in retrieval, same topics, not so relevant the amount.

**Scaled dot product:** raw dot product divided by `sqrt(vector dimension)`
- multidimensional spaces, tend to make dot product very large, and with high variance.
- done as well due to subsequent `softmax` in attention. Large dot products are handled poorly by `softmax`.
- This is generally an `attention` thing.

- [ ] Prove all of this in code with pytorch and numpy
- [ ] Difference between * and @
- [ ] Does nn.Linear .T of input? why?
  - [ ] Explore further operations with pytorch layers
- [ ] Find someone that can explain geometrically algebra meaning, and ideally in relationship to ML
- [ ] Check/fix projection exercise based on updated understanding
