Tuesday, January 6, 2026
Hugging Face
PyTorch
- A toolbox for building AI models
- For building and training neural networks
- Tensors: Like arrays but with GPU acceleration
- Autograd: Automatic differentiation for back propagation
- Torch.nn: Tools for building deep learning models
### Text Generation
- You provide a <b>starting prompt</b> and the model continues writing in a similar style
- max_new_tokens: controls the length of the generated text
- temperature: how <b>creative</b> do you want it to be 0.2 - 1+ (experimental)
- num_return_sequences: Generate multiple different continuations from the same prompt
### Zero-Shot Classification
The magic of not needing any examples
- Classifying text into categories you define on the fly, <b>without</b> needing to train a model on those specific labels
- The model uses <b>Natural Language Inference (NLI)</b> to see which user-provided label is the <b>most plausible hypothesis</b> for the given text
- `multi_label=True` allows the model to assign multiple categories to a single text
### Fill-Mask
A window into the model's mind
- Predicting the most likely works to fill a <b>mask</b> token in a sentence
- The fundamental training method for models like <b>BERT</b> (Masked Language Modeling)
- It teaches the model about word relationships and context
- Unlike <b>text generation</b> that only look left, these models look at the context on both left and right of the mask to make a prediction
### Summarization
The TL;DR generator
- The model reads a <b>long piece of text</b> and generates a <b>new, shorter version</b> that captures the main points
- it <b>rewrites</b> the content, not just extracts sentences
- `min_length` & `max_length` set the desired length boundaries for your summary
- <b>key Trade-off</b> Brevity versus information retention, A shorter summary is faster to read but may lose more nuance
### Named Entity Recognition (NER)
Finding people, places and things
- Automatically identifying and classifying named entities in text
- PER: Person
- ORG: Organization
- LOC: Location
- <b>Crucial Parameter</b>: `aggregation_strategy="simple"` ensures that multi-token entities (like "Barak Obama") are grouped together as a single entity
### Translation
Translating text from a source language to a target language
- The model name itself often specifies the language pair. The <b>Helsinki-NLP group</b> has published man hight-quality models for this
---
Transformer Pipelines, it bundles together all the necessary steps to go from raw text to a structured, understandable output
- Preprocessing
- Model Inference
- Post processing
"A pipeline is an abstraction that handles multiple steps in processing text, including tokenization, turning text into embeddings, feeding it to a model, and converting results back to text. It simplifies the complex process of working with machine learning models"
"Key considerations selecting a machine learn model include model size, speed, complexity, language support, accuracy, and specific task requirements. Trade-offs exist between more sophisticated models with advanced capabilities and simpler, faster models."
---
"Zero shot classification is a technique where the model can classify into categories without being specifically trained on those categories, using the model's existing language understanding to match text with potential labels"
---
"BERT models look both left and right in a text sequence to understand context, while GPT models primarily look backward when generating text"
"Fill Mask attempts to statistically guess the most likely word to fill a blank in a sentence by examining the context of words before and after the masked token"
---
The named entity recognition in the context of natural language is a technique that extracts and categorizes nouns from text, identifying entities like people, locations, organizations and labeling them with their specific type
---
### Tokenization
Tokenization is the process of converting human-readable text into a sequence of numbers that a neural network can understand
- Split, break down the text into smaller pieces
- Map, each token get's a unique numerical ID
- Add, insert special markers ([CLS], [SEP]), that provide structural context
- Create attention Masks, generate a guide for the model to distinguish real tokens from things it can ignore
Not all tokenizes are equal, vocabulary size determines how many unique tokens the model knows
- BERT, 30522
- GPT-2, 50257
- RoBERTa, 50265
- T5, 32128
A general approximation is that the number of tokens is about .75 the number of characters, though this can vary depending on the specific tokenization method used
-
Q: What is the purpose of using a pad token when working with different length strings in neural networks?
A: To make strings the same length by filling shorter strings with empty tokens, enabling mathematical operations and comparisons between vectors in neural networks.
An attention mask is a binary representation where ones represent actual content tokens and zeros represent padding tokens, helping to ignore irrelevant tokens during processing
Neural networks need consistent input lengths to perform mathematical operations and comparisons between different token representations
Special tokens are included in the tokenizations process and contribute to the total token count, which affects the padding and attention mask
Different string lengths make mathematical comparisons difficult, necessitating normalization techniques like padding to create equal-length vectors
-
Each model has a unique tokenization approach, breaking down text into tokens differently. For example, the tokens for the same input sentence will vary between models, and the vocabulary size ranges from around 30k to 50k tokens, reflecting each model's specific tokenization strategy.
When a word is not in the model's vocabulary, it may be broken down into smaller tokens, potentially using individual letters or common letter combinations. Some models have a special 'unknown word' token to handle out-of-vocabulary words.
Tokenizers first split text into tokens, then convert these tokens into numerical IDs. These numerical representations are what machine learning models can process, allowing text to be transformed into a format suitable for computational analysis.
Token count can indicate model complexity, with larger vocabulary representing more nuanced language understanding. However, a larger token count does not automatically mean a better model, as factors like model architecture and training data also play crucial roles.
Tokenization is a model-specific, meaning tokens created by one model cannot be directly used with another.
---
### Transformers
A transformer is a neural network, powers text-generative models, that excel's at understanding long-range dependencies within sentences.
The key components are:
- Embeddings
- Transformer Block
- Output Probabilities
Self-attention mechanism allows each token to communicate with and be aware of other tokens in order to capture contextual meaning, helping tokens understand their meaning based on surrounding words
Words ambiguity are examined within the context of surrounding words, the model shifts the token's vector representation to its most likely meaning based on other words in the sentence
The two-step process of token understanding in a transform model:
- Tokens engage in a 'group discussion' where they share information with each other
- then they have individual 'quiet thinking time' where each token adjusts it's own meaning based on the context learned from other tokens
Transformer models differ from earlier language models by looking at both forward and backward context
Token ID's during training: Initially, they are random and have no meaningful relationship. Through training, the neural network adjusts weights randomly until the tokens develop contextual meaning based on their usage and relationship.
Current models remember conversation context by sending the entire conversation history back and forth, with each new prompt, effectively making it a stateless protocol where the full context is transmitted each time a new request is made
Vectorizing in the context of llm involves converting text into numerical representations (vectors) that can be stored in a vector database. These vectors capture contextual relationships between words and can be used to find relevant information when processing a prompt.
Different meaning of the same word are pulled closer to or further apart based on context, so vector representation are treated based on surrounding context.
The core training mechanism is done through a process of repeatedly guessing the next word in a sequence, with the neural network adjusting its internal weights randomly until it minimizes errors, effectively creating its own algorithm through brute force learning
Retrieval augmented generation it's a technique where personal data is converted into vectors, sorted in a vector database, and then used to augment prompts by finding relevant text chunks that match the query and appending them to provide additional [[Context]]
Vector database find relevant content by tokenizing and converting both prompt and stored content into numerical vectors, then finding the most semantically similar chunks using vector similarity matching
The Bertviz library:: It visualizes semantic relationship between words in different sentences by showing how tokens are connected and related through attention mechanisms
Transformer models understand word relationships by creating complex numerical representations of words and analyzing their connections through multiple layers of attention, which are developed through training on large datasets
The primarily function of a encoder for a neural language is to transform words into vectors (a list of numbers) that capture the meaning of the words, converting text into a format that can be processed by the neural network.
The key characteristic of a decoder in a language model is to generate the next token one at a time, only looking backwards and predicting probabilities for the next word in a sequence
Two primary decoding strategies are
- Top-k sampling (from the 50 most likely next words)
- Top-P sampling (selecting words aboce a certain confidence trash hold)
LLM generates text progressively by predicting probabilities for the next token, selecting a token based on a decoding strategy, adding that token to the input, and then moving to the next token
Temperature determines the 'spiciness' or randomness of word selection, influencing how predictable or varied the generated text will be
Two main decoding strategies:
- Greedy decoding (picking the single most likely token)
- Temperature adjusted to token selection creativity
At low temperatures, the model selects very few, highly likely tokens. As temperature increases, more tokens become statistically available, potentially creating more creative but less predictable outputs
The end of sequence token signals when the model should stop generating text, providing a mechanism to limit output length beyond just setting a maximum token count
For longe sequences, as it grows beyond the context window, the model begins dropping earlier tokens, which can lead to hallucinations or losing the original context
Top K selects from the top K most likely words, while top P (nucleus sampling) selects from a dynamic set of tokens that collectively have a probability mass up to certain trash hold
---
### Fine -Tuning
What is LoRA (achieves 90 to 99% of performance)
It's Low-Rank Adaptation of Large Language Models
- Freeze the big model and don't touch most of it
- Add a few tiny trainable layers on top of all of the existing ones
- Train just those small pieces
Effectiveness conditions:
- Task similar to what the base model already knows
- Dataset is relatively small or task-specific (e.g., summarization, classification)
Edge Case: if the task is very different from the pre-trained model's knowledge, full fine-tuning can outperform LoRA
Fine-tuning typically uses around 16k examples, you only need to store the additional layers/adapters, making the storage and deployment efficient, full training outperforms the lighter approach but requiring oftentimes 28 computational resources
Quantization is a process of reducing the precision of model parameters to fit larger models into memory with some loss of fidelity, allowing models to be used on devices with limited computational resources
By formatting data set entries with a consistent structure like 'Quote by (author) end of statement', the model is primed to learn a specific pattern and generate text in a more predictable manner during fine-tuning
Tokenizer in machine learning converts text into numerical tokens that the model can understand, and helps identify important tokens like the end of statement or end of text token, which is curial for processing and generating text
Feeding a small, consistently formatted data set to the model, you can adjust the model's parameters to better understand and generate text in the desired style or format
The data_collater breaks the training data into smaller chunks, allowing the model to load and process the data more efficiently during fine-tuning
Tokenization in this model training process is to pad each example to the same length, ensuring all input texts are transformed into tokens of equal length by adding padding tokens
One efficient fine-tuning technique is parameter-efficient fine-tuning, where only a few layers on top of the base model are trained, leaving the underlying model unchanged
The fine-tuned model generates more structured and concise outputs, such as consistently formatting quotes with an author name and stopping at an end-of-sequence token
The scale of training dataset used for fine-tuning was 2.5k lines of an open source dataset, which was used to train the model in approximately two minutes
---
### Image Generation
Stable Diffusion
The Four Pillars
- Text Encoder, converts prompt into numerical embeddings
- U-Net, predicts the noise in an image at each step
- Scheduler, manages the denoising process over a set number of steps, controlling how much noise is removed
- Variational Auto Encode (VAE), compresses the image into a lower dimensional 'latent space' for efficient processing and then decodes the final latent representation back into a visible image
The knobs and how to turn them:
- num_inference_steps, denoising steps (~25-50) lead to higher quality but are slower
- guidance_scale, How strictly the model should adhere to your prompt, Higher values (~7-12) mean stronger adherence but can lead to less creative or 'stiff' images. Lower values allow for more artistic freedom
- width/height, The dimensions of the output image. Larger images require significantly more VRAM
The Core concept behind stable diffusion image generation is that starts with chaotic noise/random static and iteratively remove parts that aren't the desired image, gradually transforming the random pixels into a recognizable image by comparing against labeled training images
The trade-off when generating AI images is that more generation steps lead to better images quality but take longer, smaller images process faster and consumes less computational resources. Higher GPU capacity allows fewer iterations
Through iterative passes noise is turn into images, the model starts with random static and progressively removes noise, getting closer to the target image by comparing against labeled training images, ultimately creating a recognizable picture
The Art of the Prompt
A place where Prompt Engineering actually matters
- Subject, The main focus of the image
- Style, The artistic style
- Details & Environment, Specific attributes and setting
- Composition & Lighting, Camera details and lighting effects
- Quality Modifiers, Keywords that encourage higher quality
The negative prompt in image generation are concepts or elements you do not want include in the generated image
An image generation scheduler is a technique that helps manage computational resources and image generation process, with options like attention slicing to reduce VRAM usage and potentially improve image quality
Hugging face SDKs provide in image generation model selection an abstraction layer that allows easy switching between different models by simply changing a model string, simplifying the model selection process
Factors that might influence the choice of an image generation model are:
- Speed
- Computational resources
- Waiting time
- Desired image quality
Two techniques for optimizing GPU memory usage when running Stable Diffusion models are:
- Attention Slicing (breaking large models into smaller chunks)
- Using safe tensors with low CPU usage and swapping idle layers with the CPU
The Parameters:
- Number of inference steps
- determines how many times the model attempts to remove noise from the image (more steps = more detailed)
- guidance Scale
- determines how closely the model follows the text prompt (higher values mean stricter adherence to the prompt)
AI models like Stable Diffusion transform an initial image during generation, the model starts with random chaotic noise and progressively removes that noise through a forward process, gradually transforming the noise into a coherent image that matches the text prompt
Using negative prompts specify elements or characteristics that should NOT appear in the generated image, such as 'blurry', 'low resolution', 'watermarks', or 'extra limbs'
The model will automatically select CUDA (Nvidia GPU) if available, otherwise defaulting to CPU processing. This can be controlled by checking device availability and setting the appropriate computational device
Two types of image generations pipelines are:
- text to image pipelines
- image to image pipeline
- The input image is encoded into latent space, noise is added based on the strength parameter, and then the noise is removed to generate a transformed image
The degree of transformation parameters are controlled by strength, which ranges from 0-1.0, with lower values staying close to the original image and higher values taking greater liberties with the transformation
Once the strength parameter approaches 1.0 in image to image generation the model adds more noise to the original image, potentially moving further away from the original image's characteristics
The significance of processing images in 'latent space' represents the image as numerical data in a GPU-friendly format, allowing for manipulation and transformation without directly editing the original image pixels
Dreambooth is a lightweight technique to fine-tune stable diffusion models on a specific subject using 3-10 images, by associating a made-up term with those images, the key requirement is that the term must be a word that does not exists in the neural network, ensuring it is unique and not associates with any pre-existing concept
By training the model with 3-10 images of a specific subject using a unique term, dreambotth allows the model to generate images of that subject in various context and scenarios, prior preservations in this context involves generating additional images similar to the training set, to help prime and improve the model's understanding of the subject