Skip to main content

Internals of Neural Networks

Introduction

Neural networks power modern AI systems, transforming input data into meaningful predictions. In this chapter, we explore their internal structure, how they process information, and how they are trained and used for inference. We will focus specifically on Transformers, one of the most widely used architectures in natural language processing (NLP).

1. Input Representation and Context Length

Neural networks take inputs in the form of sequences of tokens. These tokens could be words, subwords, or even characters. The number of tokens processed in one pass is limited by a predefined context length, which determines how much information the model can consider at a time. While, in theory, a model could process an infinite number of tokens, practical limitations require restricting this length to a feasible range (e.g., 8,000 tokens in modern Transformer models).

The Role of Parameters in Neural Networks

A neural network consists of a large set of parameters (also called weights) that define how input data is transformed into outputs. These parameters are initially set randomly, leading the model to make arbitrary predictions. However, through a process called training, the parameters are iteratively adjusted to align with meaningful patterns in data.
Think of parameters as knobs on a DJ set—adjusting them changes the way the model processes and interprets inputs.

Mathematical Operations in Neural Networks

At its core, a neural network is a mathematical function that transforms inputs into outputs using a series of operations, such as:

  • Multiplication (scaling inputs with weights)
  • Addition (combining weighted inputs)
  • Exponentiation (e.g., in activation functions)
  • Normalization (adjusting values for stability)
  • Softmax (converting outputs into probabilities)

The architecture of the neural network dictates how these operations are arranged, ensuring efficiency, expressiveness, and optimizability.

4. The Transformer Architecture

One of the most advanced and widely used neural network architectures is the Transformer, which excels at handling sequential data, such as text.
Structure of a Transformer

A. Transformer consists of several key components:

  • Token Embeddings: Convert input tokens into numerical vectors.
  • Self-Attention Mechanism: Determines how different tokens relate to each other.
  • Layer Normalization: Ensures stable training by normalizing activations.
  • Feedforward Layers (MLP Block): Applies transformations to refine token representations.
  • Output Layer (Softmax): Predicts the next token in a sequence based on learned patterns.

B.Information Flow in a Transformer

  • Input tokens are first embedded into numerical vectors.
  • These embeddings are passed through self-attention layers, allowing the model to focus on relevant parts of the input
  • The output of self-attention is processed through feedforward layers, refining the representation.
  • The final layer produces logits, which are converted into probabilities to determine the next token prediction.
Training a Neural Network

Training a neural network involves finding the best parameter values to minimize prediction errors. The key steps include:

  1. Forward Pass: Compute predictions based on current parameter values.
  2. Loss Calculation: Compare predictions with actual outputs to measure error.
  3. Backward Pass (Gradient Descent): Adjust parameters using optimization techniques like Stochastic Gradient Descent (SGD) or Adam to reduce error.
  4. Iteration: Repeat these steps over millions of examples until the model generalizes well to unseen data.
Inference: Generating New Data

Once a neural network is trained, it can be used for inference, meaning it generates new data based on learned patterns.
How Inference Works:

  1. Input Processing:
    A sequence of tokens is provided to the trained model.
  2. Prediction:
    The model computes probabilities for the next token.
  3. Sampling:
    The next token is selected based on probabilities (e.g., greedy decoding, beam search, or temperature-controlled sampling).
  4. Iteration:
    Steps 1-3 are repeated until the desired sequence length is reached.

For example, given the input "Once upon a time," the model might predict the next word as "there" and continue generating a coherent story.