Text Preprocessing: Cleaning, Tokenization, and Normalization

Intermediate Preprocessing

~14 min read Preprocessing

Prerequisites:

Encoding Categorical Variables: Label, One-Hot, Target, and Ordinal Encoding

Definition

Text preprocessing transforms unstructured natural language into structured, machine-readable formats suitable for analysis and modeling. Raw text contains noise, inconsistencies, and irrelevant information that can hinder model performance. The preprocessing pipeline typically includes cleaning (removing noise, handling encoding issues), normalization (standardizing text form through lowercasing, stemming, lemmatization), tokenization (splitting text into words or subwords), and vectorization (converting tokens to numerical representations). Effective text preprocessing balances the need to reduce dimensionality and noise against preserving semantic meaning. The specific steps depend on the downstream task—sentiment analysis might preserve emoticons while information extraction might remove them. Modern NLP increasingly uses subword tokenization (BPE, WordPiece) and pre-trained embeddings that handle raw text with minimal preprocessing, but fundamental cleaning and normalization remain essential for most applications.

Intuition

💡

Think of text preprocessing like preparing ingredients for cooking. Raw text is like vegetables straight from the garden—dirt, leaves, stems, and all. Cleaning is washing and removing dirt (HTML tags, special characters, encoding errors). Normalization is chopping into consistent pieces (tokenization) and maybe dicing them uniformly (stemming: 'running' → 'run'). Some recipes need fine dicing (word-level), others need bigger chunks (sentence-level), and modern methods use pre-cut portions (subword tokens: 'unhappiness' → ['un', 'happiness']). Stop word removal is like removing water from vegetables—it reduces volume but some recipes need that moisture. The goal is preparing text so models can 'digest' it: too much processing loses flavor (meaning), too little leaves it unpalatable (noisy).

Mathematical Formula

\text{TF-IDF:} \quad \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)

\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}, \quad \text{IDF}(t) = \log \frac{N}{|\{d \in D: t \in d\}|}

\text{Cosine Similarity:} \quad \cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum A_i^2} \sqrt{\sum B_i^2}}

\text{Jaccard Similarity:} \quad J(A, B) = \frac{|A \cap B|}{|A \cup B|}

Step-by-Step Explanation:

TF-IDF: Term Frequency-Inverse Document Frequency weights words by importance in document and rarity in corpus
Cosine Similarity: Measures cosine of angle between document vectors; 1 = identical, 0 = orthogonal
Jaccard Similarity: Ratio of intersection to union of word sets; measures overlap between documents
Bag of Words: Document representation as word frequency vector, ignoring grammar and order
N-grams: Contiguous sequences of n items; captures local word order (bigrams, trigrams)

Real-World Use Cases

Healthcare

Medical notes preprocessing for diagnosis prediction: de-identification (remove PII), standardize medical terms (lemmatization), extract symptoms and medications (NER). Preserving negation ('no chest pain') is critical.

Finance

News sentiment analysis for trading: remove boilerplate, normalize company names, handle financial jargon, preserve numbers and percentages. Real-time processing requires efficient tokenization.

Retail

Product review analysis: handle emojis and emoticons (sentiment), normalize misspellings, remove spam indicators, extract product aspects. Tokenization for multi-language reviews.

Tech

Customer support ticket classification: remove email headers and signatures, standardize technical terms, extract error codes, handle code snippets. Preserving technical accuracy vs normalization tradeoff.

Legal

Contract analysis: preserve legal terminology precisely, handle cross-references, extract entities (parties, dates, amounts). Minimal stemming to maintain legal meaning.

Implementation

Manual Implementation (No Libraries)

import re\
import string\
import unicodedata\
from collections import Counter\
\
# Sample text data\
sample_texts = [\
    "Check out this AMAZING deal!!! Visit https://example.com or call 1-800-555-1234. #shopping #deals",\
    "I LOVED the product, but shipping was slow... :(  Rating: 4/5 stars!!!",\
    "   Contact us at support@company.io   ",\
    "UNBELIEVABLE!!! You won't believe what happened next!!! ???",\
    "The quick brown fox jumps over the lazy dog. The dog was very lazy indeed."\
]\
\
print("Original Texts:")\
for i, text in enumerate(sample_texts):\
    print(f"{i+1}. {text}")\
\
# 1. BASIC CLEANING (Manual)\
def clean_text_manual(text, lowercase=True, remove_urls=True, remove_emails=True, remove_numbers=True):\
    """\
    Manual text cleaning.\
    """\
    # Lowercase\
    if lowercase:\
        text = text.lower()\
    \
    # Remove URLs\
    if remove_urls:\
        text = re.sub(r'http\\S+|www\\S+|https\\S+', '', text, flags=re.MULTILINE)\
    \
    # Remove email addresses\
    if remove_emails:\
        text = re.sub(r'\\S+@\\S+', '', text)\
    \
    # Remove phone numbers\
    text = re.sub(r'\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b', '', text)\
    \
    # Remove numbers (optional)\
    if remove_numbers:\
        text = re.sub(r'\\d+', '', text)\
    \
    # Remove extra whitespace\
    text = ' '.join(text.split())\
    \
    return text\
\
print("\\
=== 1. BASIC CLEANING ===")\
for text in sample_texts[:3]:\
    cleaned = clean_text_manual(text)\
    print(f"Original: {text}")\
    print(f"Cleaned:  {cleaned}\\
")\
\
# 2. PUNCTUATION AND SPECIAL CHARACTERS\
def remove_punctuation_manual(text, keep_emoticons=False):\
    """\
    Remove punctuation marks.\
    """\
    if keep_emoticons:\
        # Keep emoticons like :) :( :D\
        emoticons = r'[:;=]-?[)(\\[\\]{}|\\\/DPp@]'
        # Temporarily replace emoticons\
        emoticon_list = re.findall(emoticons, text)\
        for i, emoticon in enumerate(emoticon_list):\
            text = text.replace(emoticon, f'EMOTICON{i}', 1)\
    \
    # Remove punctuation\
    text = text.translate(str.maketrans('', '', string.punctuation))\
    \
    if keep_emoticons:\
        # Restore emoticons\
        for i, emoticon in enumerate(emoticon_list):\
            text = text.replace(f'EMOTICON{i}', emoticon)\
    \
    return text\
\
print("=== 2. PUNCTUATION REMOVAL ===")\
text = "Hello, world! How are you? :)"\
print(f"Original: {text}")\
print(f"Without punctuation: {remove_punctuation_manual(text)}")\
print(f"Keeping emoticons: {remove_punctuation_manual(text, keep_emoticons=True)}")\
\
# 3. TOKENIZATION (Manual)\
def tokenize_manual(text, min_length=2):\
    """\
    Simple word tokenization.\
    """\
    # Split on whitespace and punctuation\
    tokens = re.findall(r'\\b\\w+\\b', text.lower())\
    \
    # Filter short tokens\
    tokens = [t for t in tokens if len(t) >= min_length]\
    \
    return tokens\
\
def tokenize_sentences_manual(text):\
    """\
    Simple sentence tokenization.\
    """\
    # Split on sentence boundaries\
    sentences = re.split(r'[.!?]+', text)\
    # Clean and filter empty\
    sentences = [s.strip() for s in sentences if s.strip()]\
    return sentences\
\
print("\\
=== 3. TOKENIZATION ===")\
text = "The quick brown fox jumps. It jumps over the lazy dog! Does it?"\
print(f"Original: {text}")\
print(f"Word tokens: {tokenize_manual(text)}")\
print(f"Sentence tokens: {tokenize_sentences_manual(text)}")\
\
# 4. STOP WORDS REMOVAL\
def remove_stopwords_manual(tokens, stopwords=None):\
    """\
    Remove common stop words.\
    """\
    if stopwords is None:\
        stopwords = {\
            'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',\
            'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',\
            'would', 'could', 'should', 'may', 'might', 'must', 'shall',\
            'can', 'need', 'dare', 'ought', 'used', 'to', 'of', 'in',\
            'for', 'on', 'with', 'at', 'by', 'from', 'as', 'into',\
            'through', 'during', 'before', 'after', 'above', 'below',\
            'between', 'under', 'and', 'but', 'or', 'yet', 'so',\
            'if', 'because', 'although', 'though', 'while', 'where',\
            'when', 'that', 'which', 'who', 'whom', 'whose', 'what',\
            'this', 'these', 'those', 'i', 'you', 'he', 'she', 'it',\
            'we', 'they', 'me', 'him', 'her', 'us', 'them', 'my',\
            'your', 'his', 'its', 'our', 'their', 'mine', 'yours',\
            'hers', 'ours', 'theirs', 'am', 'it', 's', 't', 'just',\
            'don', 'now', 'll', 'm', 're', 've', 'y', 'ma', 'd',\
            's', 'o', 'on', 'no', 'not', 'only', 'own', 'same',\
            'than', 'too', 'very', 'just', 'also', 'back', 'still'\
        }\
    \
    return [token for token in tokens if token.lower() not in stopwords]\
\
print("\\
=== 4. STOP WORDS REMOVAL ===")\
text = "The quick brown fox jumps over the lazy dog in the garden"\
tokens = tokenize_manual(text)\
filtered = remove_stopwords_manual(tokens)\
print(f"Original: {text}")\
print(f"Tokens: {tokens}")\
print(f"Without stopwords: {filtered}")\
\
# 5. STEMMING (Porter Stemmer - Simplified)\
def porter_stemmer_simple(word):\
    """\
    Simplified Porter Stemmer implementation.\
    """\
    word = word.lower()\
    \
    # Step 1a\
    if word.endswith('ies') and len(word) > 4:\
        word = word[:-3] + 'y'\
    elif word.endswith('ied') and len(word) > 4:\
        word = word[:-3] + 'y'\
    elif word.endswith('s') and not word.endswith('ss') and not word.endswith('us'):\
        word = word[:-1]\
    \
    # Step 1b\
    if word.endswith('eed') and len(word) > 4:\
        word = word[:-3] + 'ee'\
    elif word.endswith('ed') and len(word) > 3:\
        stem = word[:-2]\
        if any(v in stem for v in 'aeiou'):\
            word = stem\
    elif word.endswith('ing') and len(word) > 4:\
        stem = word[:-3]\
        if any(v in stem for v in 'aeiou'):\
            word = stem\
    \
    # Step 2\
    if word.endswith('ational'):\
        word = word[:-7] + 'ate'\
    elif word.endswith('tional'):\
        word = word[:-6] + 'tion'\
    elif word.endswith('izer'):\
        word = word[:-4] + 'ize'\
    elif word.endswith('li') and len(word) > 3:\
        word = word[:-2]\
    \
    # Step 3\
    if word.endswith('icate'):\
        word = word[:-5] + 'ic'\
    elif word.endswith('ative'):\
        word = word[:-5]\
    elif word.endswith('alize'):\
        word = word[:-5] + 'al'\
    elif word.endswith('iciti'):\
        word = word[:-5] + 'ic'\
    elif word.endswith('ness'):\
        word = word[:-4]\
    \
    # Step 4\
    if word.endswith('ement'):\
        word = word[:-5]\
    elif word.endswith('ment'):\
        word = word[:-4]\
    elif word.endswith('ent'):\
        word = word[:-3]\
    elif word.endswith('ion') and len(word) > 3 and word[-4] in 'st':\
        word = word[:-3]\
    elif word.endswith('ous'):\
        word = word[:-3]\
    elif word.endswith('ive'):\
        word = word[:-3]\
    elif word.endswith('ize'):\
        word = word[:-3]\
    \
    return word\
\
print("\\
=== 5. STEMMING ===")\
stemming_examples = [\
    'running', 'flies', 'died', 'agreed', 'happiness',\
    'national', 'rationalization', 'cats', 'troubled'\
]\
for word in stemming_examples:\
    print(f"{word} → {porter_stemmer_simple(word)}")\
\
# 6. N-GRAM GENERATION\
def generate_ngrams_manual(tokens, n=2):\
    """\
    Generate n-grams from tokens.\
    """\
    ngrams = []\
    for i in range(len(tokens) - n + 1):\
        ngram = ' '.join(tokens[i:i+n])\
        ngrams.append(ngram)\
    return ngrams\
\
print("\\
=== 6. N-GRAMS ===")\
text = "the quick brown fox jumps"\
tokens = tokenize_manual(text)\
print(f"Tokens: {tokens}")\
print(f"Bigrams: {generate_ngrams_manual(tokens, n=2)}")\
print(f"Trigrams: {generate_ngrams_manual(tokens, n=3)}")\
\
# 7. TEXT NORMALIZATION (Unicode)\
def normalize_unicode_manual(text):\
    """\
    Normalize unicode characters.\
    """\
    # NFKD decomposition\
    text = unicodedata.normalize('NFKD', text)\
    # Remove combining characters\
    text = ''.join(c for c in text if not unicodedata.combining(c))\
    return text\
\
print("\\
=== 7. UNICODE NORMALIZATION ===")\
text = "Café résumé naïve"\
print(f"Original: {text}")\
print(f"Normalized: {normalize_unicode_manual(text)}")\
\
# 8. SPELLING NORMALIZATION (Simple)\
def normalize_repeated_chars_manual(text, max_repeats=2):\
    """\
    Normalize repeated characters (e.g., 'sooooo' → 'so').\
    """\
    pattern = r'(.)\\1{' + str(max_repeats) + '}'\
    return re.sub(pattern, r'\\1' * max_repeats, text)\
\
print("\\
=== 8. REPEATED CHARACTERS ===")\
text = "I am sooooo happy!!!! This is coooool!!!"\
print(f"Original: {text}")\
print(f"Normalized: {normalize_repeated_chars_manual(text)}")\
\
# 9. COMPLETE PREPROCESSING PIPELINE\
def preprocess_text_complete(text,\
                              lowercase=True,\
                              remove_urls=True,\
                              remove_emails=True,\
                              remove_numbers=False,\
                              remove_punctuation=True,\
                              remove_stopwords=True,\
                              stem=False,\
                              min_token_length=2):\
    """\
    Complete text preprocessing pipeline.\
    """\
    # Clean\
    text = clean_text_manual(text, lowercase, remove_urls, remove_emails, remove_numbers)\
    \
    # Remove punctuation\
    if remove_punctuation:\
        text = remove_punctuation_manual(text)\
    \
    # Tokenize\
    tokens = tokenize_manual(text, min_length=min_token_length)\
    \
    # Remove stopwords\
    if remove_stopwords:\
        tokens = remove_stopwords_manual(tokens)\
    \
    # Stem\
    if stem:\
        tokens = [porter_stemmer_simple(t) for t in tokens]\
    \
    return tokens\
\
print("\\
=== 9. COMPLETE PIPELINE ===")\
text = sample_texts[1]\
print(f"Original: {text}")\
print(f"\\
Tokenized: {preprocess_text_complete(text, stem=False)}")\
print(f"Stemmed: {preprocess_text_complete(text, stem=True)}")\
\
# 10. BAG OF WORDS (Manual)\
def create_bow_manual(documents):\
    """\
    Create Bag of Words representation.\
    """\
    # Preprocess all documents\
    processed_docs = [preprocess_text_complete(doc) for doc in documents]\
    \
    # Build vocabulary\
    vocab = sorted(set(token for doc in processed_docs for token in doc))\
    vocab_dict = {word: idx for idx, word in enumerate(vocab)}\
    \
    # Create document-term matrix\
    doc_term_matrix = []\
    for doc in processed_docs:\
        counts = Counter(doc)\
        vector = [counts.get(word, 0) for word in vocab]\
        doc_term_matrix.append(vector)\
    \
    return np.array(doc_term_matrix), vocab\
\
print("\\
=== 10. BAG OF WORDS ===")\
docs = [\
    "The quick brown fox",\
    "The lazy dog sleeps",\
    "The quick dog jumps"\
]\
bow_matrix, vocab = create_bow_manual(docs)\
print(f"Vocabulary ({len(vocab)} terms): {vocab}")\
print(f"Document-Term Matrix:")\
print(bow_matrix)\
\
# Show as DataFrame\
bow_df = pd.DataFrame(bow_matrix, columns=vocab, index=[f'Doc{i+1}' for i in range(len(docs))])\
print(f"\\
As DataFrame:")\
print(bow_df)

Using Libraries ()

import pandas as pd\
import numpy as np\
import re\
import string\
import nltk\
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD\
from sklearn.metrics.pairwise import cosine_similarity\
from textblob import TextBlob\
import warnings\
warnings.filterwarnings('ignore')\
\
# Sample documents\
documents = [\
    "The quick brown fox jumps over the lazy dog. Amazing!!!",\
    "A quick brown dog outpaces a lazy fox. Great deal at https://shop.com",\
    "Machine learning is fascinating. Check out AI research papers.",\
    "Natural language processing enables text understanding. #NLP #AI",\
    "The lazy dog sleeps all day while the fox runs. Contact: info@example.com",\
    "Deep learning revolutionizes image recognition and NLP tasks!!!"\
]\
\
print("Sample Documents:")\
for i, doc in enumerate(documents):\
    print(f"{i+1}. {doc}")\
\
# 1. NLTK TOKENIZATION AND PROCESSING\
print("\\
" + "="*60)\
print("1. NLTK TOKENIZATION")\
print("="*60)\
\
try:\
    from nltk.tokenize import word_tokenize, sent_tokenize\
    from nltk.corpus import stopwords\
    from nltk.stem import PorterStemmer, WordNetLemmatizer\
    \
    # Download if needed (commented for production)\
    # nltk.download('punkt')\
    # nltk.download('stopwords')\
    # nltk.download('wordnet')\
    # nltk.download('averaged_perceptron_tagger')\
    \
    # Word tokenization\
    sample = documents[0]\
    tokens = word_tokenize(sample)\
    print(f"Word tokens: {tokens}")\
    \
    # Sentence tokenization\
    text = "First sentence. Second sentence! Third sentence?"\
    sentences = sent_tokenize(text)\
    print(f"Sentence tokens: {sentences}")\
    \
except ImportError:\
    print("NLTK not available, using basic tokenization")\
\
# 2. SCIKIT-LEARN COUNT VECTORIZER\
print("\\
" + "="*60)\
print("2. COUNT VECTORIZER (Bag of Words)")\
print("="*60)\
\
# Basic CountVectorizer\
count_vec = CountVectorizer()\
count_matrix = count_vec.fit_transform(documents)\
\
print(f"Document-term matrix shape: {count_matrix.shape}")\
print(f"Vocabulary size: {len(count_vec.vocabulary_)}")\
print(f"Feature names: {count_vec.get_feature_names_out()[:20]}...")\
\
# With parameters\
count_vec_params = CountVectorizer(\
    lowercase=True,\
    stop_words='english',\
    max_features=50,\
    ngram_range=(1, 2),  # unigrams and bigrams\
    min_df=1,  # minimum document frequency\
    max_df=1.0  # maximum document frequency\
)\
count_matrix_params = count_vec_params.fit_transform(documents)\
print(f"\\
With parameters - Shape: {count_matrix_params.shape}")\
print(f"Features: {count_vec_params.get_feature_names_out()}")\
\
# 3. TF-IDF VECTORIZER\
print("\\
" + "="*60)\
print("3. TF-IDF VECTORIZER")\
print("="*"*60)\
\
tfidf_vec = TfidfVectorizer(\
    lowercase=True,\
    stop_words='english',\
    max_features=100,\
    ngram_range=(1, 2),\
    min_df=1,\
    norm='l2'  # L2 normalization\
)\
tfidf_matrix = tfidf_vec.fit_transform(documents)\
\
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")\
print(f"Vocabulary: {tfidf_vec.get_feature_names_out()[:15]}...")\
\
# Show TF-IDF values for first document\
feature_names = tfidf_vec.get_feature_names_out()\
doc1_tfidf = tfidf_matrix[0].toarray()[0]\
top_indices = doc1_tfidf.argsort()[-10:][::-1]\
print(f"\\
Top TF-IDF terms in Doc1:")\
for idx in top_indices:\
    if doc1_tfidf[idx] > 0:\
        print(f"  {feature_names[idx]}: {doc1_tfidf[idx]:.4f}")\
\
# 4. COSINE SIMILARITY\
print("\\
" + "="*60)\
print("4. DOCUMENT SIMILARITY")\
print("="*60)\
\
# Calculate cosine similarity\
similarity_matrix = cosine_similarity(tfidf_matrix)\
print("Cosine similarity matrix:")\
sim_df = pd.DataFrame(\
    similarity_matrix,\
    columns=[f'Doc{i+1}' for i in range(len(documents))],\
    index=[f'Doc{i+1}' for i in range(len(documents))]\
)\
print(sim_df.round(3))\
\
# Most similar pair\
n = len(documents)\
max_sim = 0\
max_pair = None\
for i in range(n):\
    for j in range(i+1, n):\
        if similarity_matrix[i, j] > max_sim:\
            max_sim = similarity_matrix[i, j]\
            max_pair = (i, j)\
\
print(f"\\
Most similar: Doc{max_pair[0]+1} and Doc{max_pair[1]+1} (similarity: {max_sim:.3f})")\
\
# 5. CUSTOM PREPROCESSING PIPELINE\
print("\\
" + "="*60)\
print("5. CUSTOM PREPROCESSING PIPELINE")\
print("="*60)\
\
def custom_preprocessor(text):\
    """Custom preprocessing function for vectorizer."""\
    # Lowercase\
    text = text.lower()\
    # Remove URLs\
    text = re.sub(r'http\\S+|www\\S+', '', text)\
    # Remove emails\
    text = re.sub(r'\\S+@\\S+', '', text)\
    # Remove hashtags but keep text\
    text = re.sub(r'#', '', text)\
    # Remove extra whitespace\
    text = ' '.join(text.split())\
    return text\
\
def custom_tokenizer(text):\
    """Custom tokenizer."""\
    # Simple word tokenization\
    tokens = re.findall(r'\\b\\w+\\b', text.lower())\
    return tokens\
\
custom_vec = TfidfVectorizer(\
    preprocessor=custom_preprocessor,\
    tokenizer=custom_tokenizer,\
    stop_words='english',\
    ngram_range=(1, 2)\
)\
custom_matrix = custom_vec.fit_transform(documents)\
print(f"Custom pipeline - Shape: {custom_matrix.shape}")\
print(f"Features: {custom_vec.get_feature_names_out()[:15]}")\
\
# 6. DIMENSIONALITY REDUCTION\
print("\\
" + "="*60)\
print("6. DIMENSIONALITY REDUCTION")\
print("="*60)\
\
# LSA (Latent Semantic Analysis)\
lsa = TruncatedSVD(n_components=3, random_state=42)\
lsa_matrix = lsa.fit_transform(tfidf_matrix)\
\
print(f"LSA reduced shape: {lsa_matrix.shape}")\
print(f"Explained variance ratio: {lsa.explained_variance_ratio_}")\
print(f"Total variance explained: {lsa.explained_variance_ratio_.sum():.3f}")\
\
# Show document representations\
print(f"\\
Document representations in 3D LSA space:")\
lsa_df = pd.DataFrame(\
    lsa_matrix,\
    columns=['Dim1', 'Dim2', 'Dim3'],\
    index=[f'Doc{i+1}' for i in range(len(documents))]\
)\
print(lsa_df.round(3))\
\
# LDA Topic Modeling\
lda = LatentDirichletAllocation(\
    n_components=2,\
    random_state=42,\
    max_iter=10\
)\
lda_matrix = lda.fit_transform(count_matrix)\
\
print(f"\\
LDA topic distribution (Doc1): {lda_matrix[0].round(3)}")\
\
# 7. TEXT LENGTH FEATURES\
print("\\
" + "="*60)\
print("7. TEXT STATISTICAL FEATURES")\
print("="*60)\
\
def extract_text_features(texts):\
    """Extract statistical features from text."""\
    features = []\
    \
    for text in texts:\
        # Basic counts\
        char_count = len(text)\
        word_count = len(text.split())\
        sentence_count = len(re.split(r'[.!?]+', text))\
        \
        # Averages\
        avg_word_length = np.mean([len(w) for w in text.split()]) if word_count > 0 else 0\
        avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0\
        \
        # Special characters\
        exclamation_count = text.count('!')\
        question_count = text.count('?')\
        uppercase_ratio = sum(1 for c in text if c.isupper()) / len(text) if text else 0\
        \
        # Digital features\
        number_count = len(re.findall(r'\\d+', text))\
        url_count = len(re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\)]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text))\
        email_count = len(re.findall(r'\\S+@\\S+', text))\
        \
        features.append({\
            'char_count': char_count,\
            'word_count': word_count,\
            'sentence_count': sentence_count,\
            'avg_word_length': avg_word_length,\
            'avg_sentence_length': avg_sentence_length,\
            'exclamation_count': exclamation_count,\
            'question_count': question_count,\
            'uppercase_ratio': uppercase_ratio,\
            'number_count': number_count,\
            'url_count': url_count,\
            'email_count': email_count\
        })\
    \
    return pd.DataFrame(features)\
\
text_features = extract_text_features(documents)\
print("Text statistical features:")\
print(text_features)\
\
# 8. SENTIMENT ANALYSIS WITH TEXTBLOB\
print("\\
" + "="*60)\
print("8. SENTIMENT ANALYSIS")\
print("="*60)\
\
try:\
    sentiment_data = []\
    for doc in documents:\
        blob = TextBlob(doc)\
        sentiment_data.append({\
            'text': doc[:40] + '...',\
            'polarity': blob.sentiment.polarity,  # -1 (negative) to 1 (positive)\
            'subjectivity': blob.sentiment.subjectivity  # 0 (objective) to 1 (subjective)\
        })\
    \
    sentiment_df = pd.DataFrame(sentiment_data)\
    print("Sentiment analysis results:")\
    print(sentiment_df.round(3).to_string(index=False))\
    \
except ImportError:\
    print("TextBlob not available. Install with: pip install textblob")\
\
# 9. N-GRAM ANALYSIS\
print("\\
" + "="*60)\
print("9. N-GRAM ANALYSIS")\
print("="*60)\
\
# Bigrams\
bigram_vec = CountVectorizer(ngram_range=(2, 2), stop_words='english')\
bigram_matrix = bigram_vec.fit_transform(documents)\
bigram_names = bigram_vec.get_feature_names_out()\
bigram_counts = bigram_matrix.sum(axis=0).A1\
\
top_bigrams = sorted(zip(bigram_names, bigram_counts), key=lambda x: x[1], reverse=True)[:10]\
print("Top bigrams:")\
for bigram, count in top_bigrams:\
    print(f"  {bigram}: {count}")\
\
# Trigrams\
trigram_vec = CountVectorizer(ngram_range=(3, 3), stop_words='english')\
trigram_matrix = trigram_vec.fit_transform(documents)\
trigram_names = trigram_vec.get_feature_names_out()\
trigram_counts = trigram_matrix.sum(axis=0).A1\
\
top_trigrams = sorted(zip(trigram_names, trigram_counts), key=lambda x: x[1], reverse=True)[:5]\
print("\\
Top trigrams:")\
for trigram, count in top_trigrams:\
    print(f"  {trigram}: {count}")\
\
# 10. COMPLETE TEXT PIPELINE\
print("\\
" + "="*60)\
print("10. COMPLETE TEXT PIPELINE")\
print("="*60)\
\
from sklearn.pipeline import Pipeline, FeatureUnion\
from sklearn.base import BaseEstimator, TransformerMixin\
from sklearn.ensemble import RandomForestClassifier\
\
class TextStatsExtractor(BaseEstimator, TransformerMixin):\
    """Custom transformer for text statistics."""\
    \
    def fit(self, X, y=None):\
        return self\
    \
    def transform(self, X):\
        return extract_text_features(X).values\
\
# Create pipeline\
text_pipeline = Pipeline([\
    ('features', FeatureUnion([\
        ('tfidf', TfidfVectorizer(max_features=100, stop_words='english')),\
        ('stats', TextStatsExtractor())\
    ])),\
    ('classifier', RandomForestClassifier(n_estimators=10, random_state=42))\
])\
\
# Dummy labels for demonstration\
y = [0, 0, 1, 1, 0, 1]\
\
# Fit pipeline\
text_pipeline.fit(documents, y)\
print("Pipeline fitted successfully!")\
print(f"Feature union shape: {text_pipeline.named_steps['features'].transform(documents).shape}")\
\
# 11. BEST PRACTICES SUMMARY\
print("\\
" + "="*60)\
print("11. BEST PRACTICES FOR TEXT PREPROCESSING")\
print("="*60)\
\
best_practices = {\
    'Practice': [\
        'Always clean text',\
        'Handle encoding',\
        'Consider domain',\
        'Use n-grams',\
        'Try TF-IDF over Count',\
        'Limit vocabulary',\
        'Handle OOV',\
        'Normalize case',\
        'Consider stemming/lemmatization',\
        'Validate with embeddings'\
    ],\
    'Description': [\
        'Remove URLs, emails, HTML, special characters before vectorization',\
        'Use UTF-8; normalize unicode to prevent encoding errors',\
        'Medical text needs different preprocessing than social media',\
        'Bigrams/trigrams capture phrases (not just bag of words)',\
        'TF-IDF weights by importance; usually outperforms raw counts',\
        'Limit vocab size (10k-50k) to prevent overfitting and speed up training',\
        'Handle out-of-vocabulary words with <UNK> token or subword tokenization',\
        'Lowercase unless case matters (proper nouns, sentiment)',\
        'Lemmatization > stemming for meaning preservation; skip for deep learning',\
        'Compare TF-IDF vs word embeddings (Word2Vec, BERT) for your task'\
    ]\
}\
\
print(pd.DataFrame(best_practices).to_string(index=False))

When to Use

✅ Appropriate Use Cases:

TF-IDF: Use for document classification, information retrieval, similarity search, topic modeling
Count Vectorizer: Use for topic modeling (LDA), simple baseline, when frequency matters more than rarity
N-grams: Use for capturing phrases and multi-word expressions, sentiment analysis, named entities
Stop word removal: Use for high-dimensional data, topic modeling; skip for sentiment or when stop words carry meaning
Stemming: Use for information retrieval, reducing dimensionality; skip when word form matters
Lemmatization: Use when preserving part-of-speech matters, for semantic analysis

❌ Avoid When:

Don't remove stop words for sentiment analysis—'not good' vs 'good' is critical
Avoid stemming for named entity recognition—'US' and 'us' are different
Don't use bag-of-words when word order matters—use n-grams or embeddings instead
Avoid aggressive cleaning that removes semantic markers—emoticons carry sentiment
Don't ignore out-of-vocabulary words in production—have a fallback strategy
Avoid TF-IDF alone for semantic similarity—word embeddings capture meaning better

Common Pitfalls

Data leakage from preprocessing—fitting vectorizers on full data before split
Case sensitivity issues—'Apple' (company) vs 'apple' (fruit) need different handling
Not handling OOV in production—new words cause errors or silent failures
Over-reliance on English stop words—domain-specific stop words often needed
Ignoring document length—normalize for length bias in classification
Not validating tokenization—custom tokenizers may break on edge cases

Previous Outlier Handling: Detection Methods and Treatment Strategies