In this article, you will learn practical, advanced ways to use large language models (LLMs) to engineer features that fuse structured (tabular) data with text for stronger downstream models. Topics we will cover include: Generating semantic features from tabular contexts and combining them with numeric data. Using LLMs for context-aware imputation, enrichment, and domain-driven feature construction. Building hybrid embedding spaces and guiding feature selection with model-informed reasoning. Let’s get right to it. 5 Advanced Feature Engineering Techniques with LLMs for Tabular DataImage by Editor Introduction In the epoch of LLMs, it may seem like the most classical machine learning concepts, methods, and techniques like feature engineering are no longer in the spotlight. In fact, feature engineering still matters—significantly. Feature engineering can be extremely valuable on raw text data used as input to LLMs. Not only can it help preprocess or structure unstructured data like text, but it can also enhance how state-of-the-art LLMs extract, generate, and transform information when combined with tabular (structured) data scenarios and sources. Integrating tabular data into LLM workflows has multiple benefits, such as enriching feature spaces underlying the main text inputs, driving semantic augmentation, and automating model pipelines by bridging the — otherwise notable — gap between structured and unstructured data. This article presents five advanced feature engineering techniques through which LLMs can incorporate valuable information from (and into) fully structured, tabular data into their workflows. 1. Semantic Feature Generation Via Textual Contexts LLMs can be utilized to describe or summarize rows, columns, or values of categorical attributes in a tabular dataset, generating text-based embeddings as a result. Based on the extensive knowledge gained after an arduous training process on a vast dataset, an LLM could, for instance, receive a value for a “postal code” attribute in a customer dataset and output context-enriched information like “this customer lives in a rural postal region.” These contextually aware text representations can notably enrich the original dataset’s information. Meanwhile, we can also use a Sentence Transformers model (hosted on Hugging Face) to turn an LLM-generated text into meaningful embeddings that can be seamlessly combined with the rest of the tabular data, thereby building a much more informative input for downstream predictive machine learning models like ensemble classifiers and regressors (e.g., with scikit-learn). Here’s an example of this procedure: from sentence_transformers import SentenceTransformer import numpy as np # LLM-generated description (mocked in this example for the sake of simplicity) llm_description = “A32 refers to a rural postal region in the northwest.” # Create text embeddings using a Sentence Transformers model model = SentenceTransformer(“sentence-transformers/all-MiniLM-L6-v2”) embedding = model.encode(llm_description) # shape e.g. (384,) numeric_features = np.array([0.42, 1.07]) hybrid_features = np.concatenate([numeric_features, embedding]) print(“Hybrid feature vector shape:”, hybrid_features.shape) from sentence_transformers import SentenceTransformer import numpy as np # LLM-generated description (mocked in this example for the sake of simplicity) llm_description = “A32 refers to a rural postal region in the northwest.” # Create text embeddings using a Sentence Transformers model model = SentenceTransformer(“sentence-transformers/all-MiniLM-L6-v2”) embedding = model.encode(llm_description) # shape e.g. (384,) numeric_features = np.array([0.42, 1.07]) hybrid_features = np.concatenate([numeric_features, embedding]) print(“Hybrid feature vector shape:”, hybrid_features.shape) 2. Intelligent Missing-Value Imputation And Data Enrichment Why not try out LLMs to push the boundaries of conventional techniques for missing value imputation, often based on simple summary statistics at the column level? When trained properly for tasks like text completion, LLMs can be used to infer missing values or “gaps” in categorical or text attributes based on pattern analysis and inference, or even reasoning over other related columns to the target one containing the missing value(s) in question. One possible strategy to do this is by crafting few-shot prompts, with examples to guide the LLM toward the precise kind of desired output. For example, missing information about a customer called Alice could be completed by attending to relational cues from other columns. prompt = “””Customer data: Name: Alice City: Paris Occupation: [MISSING] Infer occupation.””” # “Likely ‘Tourism professional’ or ‘Hospitality worker’””” prompt = “”“Customer data: Name: Alice City: Paris Occupation: [MISSING] Infer occupation.”“” # “Likely ‘Tourism professional’ or ‘Hospitality worker’””” The potential benefits of using LLMs for imputing missing information include the provision of contextual and explainable imputation beyond approaches based on traditional statistical methods. 3. Domain-Specific Feature Construction Through Prompt Templates This technique entails the construction of new features aided by LLMs. Instead of implementing hardcoded logic to build such features based on static rules or operations, the key is to encode domain knowledge in prompt templates that can be used to derive new, engineered, interpretable features. A combination of concise rationale generation and regular expressions (or keyword post-processing) is an effective strategy for this, as shown in the example below related to the financial domain: prompt = “”” Transaction: ‘ATM withdrawal downtown’ Task: Classify spending category and risk level. Provide a short rationale, then give the final answer in JSON. “”” prompt = “”“ Transaction: ‘ATM withdrawal downtown’ Task: Classify spending category and risk level. Provide a short rationale, then give the final answer in JSON. ““” The text “ATM withdrawal” hints at a cash-related transaction, whereas “downtown” may indicate little to no risk in it. Hence, we directly ask the LLM for new structured attributes like category and risk level of the transaction by using the above prompt template. import json, re response = “”” Rationale: ‘ATM withdrawal’ indicates a cash-related transaction. Location ‘downtown’ does not add risk. Final answer: {“category”: “Cash withdrawal”, “risk”: “Low”} “”” result = json.loads(re.search(r”\{.*\}”, response).group()) print(result) # {‘category’: ‘Cash withdrawal’, ‘risk’: ‘Low’} import json, re response = “”“ Rationale: ‘ATM withdrawal’ indicates a cash-related transaction. Location ‘downtown’ does not add risk. Final answer: {“category“: “Cash withdrawal“, “risk“: “Low“} ““” result = json.loads(re.search(r“\{.*\}”, response).group()) print(result) # {‘category’: ‘Cash withdrawal’, ‘risk’: ‘Low’} 4. Hybrid Embedding Spaces For Structured–Unstructured Data Fusion This strategy refers to merging numeric embeddings, e.g., those resulting from applying PCA or autoencoders on a highly dimensional dataset, with semantic embeddings produced by LLMs like sentence transformers. The result: hybrid, joint feature spaces
7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings
7 Advanced Feature Engineering Tricks for Text Data Using LLM EmbeddingsImage by Editor Introduction Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models—such as those used in scikit-learn—to improve downstream performance. This article presents seven advanced Python examples of feature engineering tricks that add extra value to text data by leveraging LLM-generated embeddings, thereby enhancing the accuracy and robustness of downstream machine learning models that rely on text, in applications such as sentiment analysis, topic classification, document clustering, and semantic similarity detection. Common setup for all examples Unless stated otherwise, the seven example tricks below make use of this common setup. We rely on Sentence Transformers for embeddings and scikit-learn for modeling utilities. !pip install sentence-transformers scikit-learn -q from sentence_transformers import SentenceTransformer import numpy as np # Load a lightweight LLM embedding model; builds 384-dimensional embeddings model = SentenceTransformer(“all-MiniLM-L6-v2”) !pip install sentence–transformers scikit–learn –q from sentence_transformers import SentenceTransformer import numpy as np # Load a lightweight LLM embedding model; builds 384-dimensional embeddings model = SentenceTransformer(“all-MiniLM-L6-v2”) 1. Combining TF-IDF and Embedding Features The first example shows how to jointly extract—given a source text dataset like fetch_20newsgroups—both TF-IDF and LLM-generated sentence-embedding features. We then combine these feature types to train a logistic regression model that classifies news texts based on the combined features, often boosting accuracy by capturing both lexical and semantic information. from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Loading data data = fetch_20newsgroups(subset=”train”, categories=[‘sci.space’, ‘rec.autos’]) texts, y = data.data[:500], data.target[:500] # Extracting features of two broad types tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray() emb = model.encode(texts, show_progress_bar=False) # Combining features and training ML model X = np.hstack([tfidf, StandardScaler().fit_transform(emb)]) clf = LogisticRegression(max_iter=1000).fit(X, y) print(“Accuracy:”, clf.score(X, y)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Loading data data = fetch_20newsgroups(subset=‘train’, categories=[‘sci.space’, ‘rec.autos’]) texts, y = data.data[:500], data.target[:500] # Extracting features of two broad types tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray() emb = model.encode(texts, show_progress_bar=False) # Combining features and training ML model X = np.hstack([tfidf, StandardScaler().fit_transform(emb)]) clf = LogisticRegression(max_iter=1000).fit(X, y) print(“Accuracy:”, clf.score(X, y)) 2. Topic-Aware Embedding Clusters This trick takes a few sample text sequences, generates embeddings using the preloaded language model, applies K-Means clustering on these embeddings to assign topics, and then combines the embeddings with a one-hot encoding of each example’s cluster identifier (its “topic class”) to build a new feature representation. It is a useful strategy for creating compact topic meta-features. from sklearn.cluster import KMeans from sklearn.preprocessing import OneHotEncoder texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”, “Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”] emb = model.encode(texts) topics = KMeans(n_clusters=2, n_init=”auto”, random_state=42).fit_predict(emb) topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(-1, 1)) X = np.hstack([emb, topic_ohe]) print(X.shape) from sklearn.cluster import KMeans from sklearn.preprocessing import OneHotEncoder texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”, “Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”] emb = model.encode(texts) topics = KMeans(n_clusters=2, n_init=‘auto’, random_state=42).fit_predict(emb) topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(–1, 1)) X = np.hstack([emb, topic_ohe]) print(X.shape) 3. Semantic Anchor Similarity Features This simple strategy computes similarity to a small set of fixed “anchor” (or reference) sentences used as compact semantic descriptors—essentially, semantic landmarks. Each column in the similarity-feature matrix contains the similarity of the text to one anchor. The main value lies in allowing the model to learn relationships between the text’s similarity to key concepts and a target variable—useful for text classification models. from sklearn.metrics.pairwise import cosine_similarity anchors = [“space mission”, “car performance”, “politics”] anchor_emb = model.encode(anchors) texts = [“The rocket launch was successful.”, “The car handled well on the track.”] emb = model.encode(texts) sim_features = cosine_similarity(emb, anchor_emb) print(sim_features) from sklearn.metrics.pairwise import cosine_similarity anchors = [“space mission”, “car performance”, “politics”] anchor_emb = model.encode(anchors) texts = [“The rocket launch was successful.”, “The car handled well on the track.”] emb = model.encode(texts) sim_features = cosine_similarity(emb, anchor_emb) print(sim_features) 4. Meta-Feature Stacking via Auxiliary Sentiment Classifier For text associated with labels such as sentiments, the following feature-engineering technique adds extra value. A meta-feature is built as the prediction probability returned by an auxiliary classifier trained on the embeddings. This meta-feature is stacked with the original embeddings, resulting in an augmented feature set that can improve downstream performance by exposing potentially more discriminative information than raw embeddings alone. A slight additional setup is needed for this example: !pip install sentence-transformers scikit-learn -q from sentence_transformers import SentenceTransformer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Import StandardScaler import numpy as np embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # 384-dim # Small dataset containing texts and sentiment labels texts = [“I love this!”, “This is terrible.”, “Amazing quality.”, “Not good at all.”] y = np.array([1, 0, 1, 0]) # Obtain embeddings from the embedder LLM emb = embedder.encode(texts, show_progress_bar=False) # Train an auxiliary classifier on embeddings X_train, X_test, y_train, y_test = train_test_split( emb, y, test_size=0.5, random_state=42, stratify=y ) meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train) # Leverage the auxiliary model’s predicted probability as a meta-feature meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1) # Prob of positive class # Augment original embeddings with the meta-feature # Do not forget to scale again for consistency scaler = StandardScaler() emb_scaled = scaler.fit_transform(emb) X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together print(“emb shape:”, emb.shape) print(“meta_feature shape:”, meta_feature.shape) print(“augmented shape:”, X_aug.shape) print(“meta clf accuracy on test slice:”, meta_clf.score(X_test, y_test)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
7 Machine Learning Projects to Land Your Dream Job in 2026
7 Machine Learning Projects to Land Your Dream Job in 2026Image by Editor Introduction machine learning continues to evolve faster than most can keep up with. New frameworks, datasets, and applications emerge every month, making it hard to know what skills will actually matter to employers. But this one thing never changes: projects speak louder than certificates. When hiring managers scan portfolios, they want to see real-world applications that solve meaningful problems, not just notebook exercises. The right projects don’t just show that you can code — they prove that you can think like a data scientist and build like an engineer. So if you want to stand out in 2026, these seven projects will help you do exactly that. 1. Predictive Maintenance for IoT Devices Manufacturers, energy providers, and logistics companies all want to predict equipment failure before it happens. Building a predictive maintenance model teaches you how to handle time-series data, feature engineering, and anomaly detection. You’ll work with sensor data, which is messy and often incomplete, so it’s a great way to practice real-world data wrangling. A good approach is to use Long Short-Term Memory (LSTM) networks or tree-based models like XGBoost to predict when a machine is likely to fail. Combine that with data visualization to show insights over time. This kind of project signals that you can bridge hardware and AI — an increasingly desirable skill as more devices become connected. If you want to take it further, create an interactive dashboard that shows predicted failures and maintenance schedules. This demonstrates not just your machine learning skills but also your ability to communicate results effectively. Dataset to get started: NASA C-MAPSS Turbofan Engine Degradation 2. AI-Powered Resume Screener Every company wants to save time on recruiting, and AI-based screening tools are already becoming standard. By building one yourself, you’ll explore natural language processing (NLP) techniques like tokenization, named entity recognition, and semantic search. This project combines text classification and information extraction — two critical subfields in modern machine learning. Start by collecting anonymized resumes or job postings from public datasets. Then, train a model to match candidates with roles based on skill keywords, project relevance, and even sentiment cues from descriptions. It’s an excellent demonstration of how AI can streamline workflows. Add a bias detection feature if you want to stand out even more — and establish a legitimate side hustle, just like 36% of Americans already have. And with machine learning, your opportunities for scaling are basically infinite. Dataset to get started: Updated Resume Dataset 3. Personalized Learning Recommender Education technology (EdTech) is one of the fastest-growing industries, and recommendation systems drive much of that innovation. A personalized learning recommender uses a combination of user profiling, content-based filtering, and collaborative filtering to suggest courses or learning materials tailored to individual preferences. Building this kind of system forces you to work with sparse matrices and similarity metrics, which deepens your understanding of recommendation algorithms. You can use public education datasets like those from Coursera or Khan Academy to start. To make it portfolio-ready, include user interaction tracking and explainability features — such as why a course was recommended. Recruiters love seeing interpretable AI, especially in human-centered applications like education. Dataset to get started: KDD Cup 2015 4. Real-Time Traffic Flow Prediction Urban AI is one of the hottest emerging fields, and traffic prediction sits right at its core. This project challenges you to process live or historical data to forecast congestion levels. It’s ideal for showing off your data streaming and time-series modeling skills. You can experiment with architectures like Graph Neural Networks (GNNs), which model city roads as interconnected nodes. Alternatively, CNN–LSTM hybrids perform well when you need to capture both spatial and temporal patterns. Make sure to highlight your deployment pipeline if you host your model in a cloud environment or stream data from APIs like Google Maps. That level of technical maturity separates beginners from engineers who can deliver end-to-end solutions. Dataset to get started: METR-LA (traffic sensor time series) 5. Deepfake Detection System As AI-generated media becomes more sophisticated, deepfake detection has turned into an urgent global concern. Building a classifier that distinguishes between authentic and manipulated images or videos not only strengthens your computer vision skills but also shows that you’re aware of AI’s ethical dimensions. You can start by using publicly available datasets like FaceForensics++ and experiment with convolutional neural networks (CNNs) or transformer-based models. The biggest challenge will be generalization — training a model that works across unseen data and different manipulation techniques. This project shines because it combines technical and moral responsibility. A well-documented notebook that discusses false positives and potential misuse makes you stand out as someone who doesn’t just build AI but understands its implications. Dataset to get started: Deepfake Detection Challenge (DFDC) 6. Multimodal Sentiment Analysis Most sentiment analysis projects focus on text, but modern applications demand more. Think of a model that can analyze speech tone, facial expressions, and text simultaneously. That’s where multimodal learning comes in. It’s complex, fascinating, and instantly eye-catching on a resume. You’ll likely combine CNNs for visual data, recurrent neural networks (RNNs) or transformers for textual data, and maybe even spectrogram analysis for audio. The integration challenge — making all these modalities talk to each other — is what really showcases your skill. If you want to polish the project for recruiters, create a simple web interface where users can upload a short video and see the detected sentiment in real time. That demonstrates deployment skills, user experience awareness, and creativity all at once. Dataset to get started: CMU-MOSEI 7. AI Agent for Financial Forecasting Finance has always been fertile ground for machine learning, and 2026 will be no different. Building an AI agent that learns to predict stock movements or cryptocurrency trends allows you to combine reinforcement learning with traditional forecasting techniques. You can start simple — training an agent using historical data and a reward system based on return rates. Then expand by incorporating real-time
Two AIs – Artificial Intelligence And Aspirational Indian Powering India Today: Bansuri Swaraj At TiEcon Delhi 2025 | Technology News
With the Narendra Modi government focusing on entrepreneurship, the country already has such an ecosystem in place that fosters innovation. Lok Sabha MP Bansuri Swaraj on Thursday said that India today is powered by two AIs and when the two meet, it accelerates the progress of the country. Speaking during TiEcon Delhi 2025, the BJP MP affirmed her faith in women-led development, saying that under Digital India, technology has become a tool for public good. “India today is powered by two AIs- Artificial Intelligence and the Aspirational Indian. When the two meet, they accelerate progress. As we enter the decade of deeptech, women must be at the forefront because if we leave out half of our population, we are not building artificial intelligence, we are risking artificial ignorance. Women who were once silent engines of progress are now becoming focal visionaries in technology, and that shift is transforming India’s story. Under the Digital India vision of Prime Minister Narendra Modi, technology has become a tool for public good, empowering talent across the nation and ensuring equitable access for women,” said Swaraj, after unveiling the ‘Wired for Impact: Women in AI’ report by Kalaari. The report recognizes and applauds the achievements of women leaders shaping India’s AI landscape. With over 2000 delegates, TiEcon Delhi 2025 affirmed its position as one of the country’s leading deeptech summit while shining a powerful spotlight on women-led innovation, AI inclusion, and financial leadership. The Wired for Impact report reveals that while women currently make up only one in five professionals in India’s technology workforce, this number is projected to grow nearly fourfold by 2027, with over 3.3 lakh women expected to hold AI roles. The report also found that AI/ML has emerged as the most preferred career track for women in technology, with 41% choosing it over other domains, a figure that even surpasses their male counterparts at 37%. Add Zee News as a Preferred Source TiEcon Delhi 2025 brought together policymakers, investors, and founders on one platform, creating a powerful collective voice in support of India’s entrepreneurial growth. “We are gratified about the participation from corporates and in particular, key decision makers across the government department. Our startup pitching sessions highlighted breakthrough ideas and the investor community’s enthusiasm reaffirmed the immense potential that lies ahead for India’s innovation economy,” said Geetika Dayal, Director General, TiE Delhi-NCR. Speaking at the conference Vani Kola, MD, Kalaari Capital said, “Innovation reaches its full potential only when it reflects the diversity of those it serves. In India, women continue to be underrepresented in technology, especially in roles that require advanced technical skills or leadership. With AI specifically, underrepresentation doesn’t just limit participation; it limits perspective and, ultimately, impact. When the systems we build learn and reason from a narrow or biased worldview, they risk encoding those same limitations into the intelligence that shapes our future.” Experts noted that if India is to build better and more trustworthy AI for the world, diversity must be treated as a mission-critical KPI.
The Machine Learning Practitioner’s Guide to Fine-Tuning Language Models
In this article, you will learn when fine-tuning large language models is warranted, which 2025-ready methods and tools to choose, and how to avoid the most common mistakes that derail projects. Topics we will cover include: A practical decision framework: prompt engineering, retrieval-augmented generation (RAG), and when fine-tuning truly adds value. Today’s essential methods—LoRA/QLoRA, Spectrum—and alignment with DPO, plus when to pick each. Data preparation, evaluation, and proven configurations that keep you out of trouble. Let’s not waste any more time. The Machine Learning Practitioner’s Guide to Fine-Tuning Language ModelsImage by Author Introduction Fine-tuning has become much more accessible in 2024–2025, with parameter-efficient methods letting even 70B+ parameter models run on consumer GPUs. But should you fine-tune at all? And if so, how do you choose between the dozens of emerging techniques? This guide is for practitioners who want results, not just theory. You’ll learn when fine-tuning makes sense, which methods to use, and how to avoid common pitfalls. Fine-tuning is different from traditional machine learning. Instead of training models from scratch, you’re adapting pretrained models to specialized tasks using far less data and compute. This makes sophisticated natural language processing (NLP) capabilities accessible without billion-dollar budgets. For machine learning practitioners, this builds on skills you already have. Data preparation, evaluation frameworks, and hyperparameter tuning remain central. You’ll need to learn new architectural patterns and efficiency techniques, but your existing foundation gives you a major advantage. You’ll learn: When fine-tuning provides value versus simpler alternatives like prompt engineering or retrieval-augmented generation (RAG) The core parameter-efficient methods (LoRA, QLoRA, Spectrum) and when to use each Modern alignment techniques (DPO, RLHF) that make models follow instructions reliably Data preparation strategies that determine most of your fine-tuning success Critical pitfalls in overfitting and catastrophic forgetting, and how to avoid them If you’re already working with LLMs, you have what you need. If you need a refresher, check out our guides on prompt engineering and LLM applications. Before getting into fine-tuning mechanics, you need to understand whether fine-tuning is the right approach. When to Fine-Tune Versus Alternative Approaches Fine-tuning should be your last resort, not your first choice. The recommended progression starts with prompt engineering, escalates to RAG when external knowledge is needed, and only proceeds to fine-tuning when deep specialization is required. Google Cloud’s decision framework and Meta AI’s practical guide identify clear criteria: Use prompt engineering for basic task adaptation. Use RAG when you need source citations, must ground responses in documents, or information changes frequently. Meta AI reveals five scenarios where fine-tuning provides genuine value: customizing tone and style for specific audiences, maintaining data privacy for sensitive information, supporting low-resource languages, reducing inference costs by distilling larger models, and adding entirely new capabilities not present in base models. The data availability test: With fewer than 100 examples, stick to prompt engineering. With 100–1,000 examples and static knowledge, consider parameter-efficient methods. Only with 1,000–100,000 examples and a clear task definition should you attempt fine-tuning. For news summarization or general question answering, RAG excels. For customer support requiring a specific brand voice or code generation following particular patterns, fine-tuning proves essential. The optimal solution often combines both—fine-tune for specialized reasoning patterns while using RAG for current information. Essential Parameter-Efficient Fine-Tuning Methods Full fine-tuning updates all model parameters, requiring massive compute and memory. Parameter-efficient fine-tuning (PEFT) revolutionized this by enabling training with just ~0.1% to 3% of parameters updated, achieving comparable performance while dramatically reducing requirements. LoRA (Low-Rank Adaptation) emerged as the dominant technique. LoRA freezes pretrained weights and injects trainable rank-decomposition matrices in parallel. Instead of updating entire weight matrices, LoRA represents updates as low-rank decompositions. Weight updates during adaptation often have low intrinsic rank, with rank 8 typically sufficient for many tasks. Memory reductions reach 2× to 3× versus full fine-tuning, with checkpoint sizes decreasing 1,000× to 10,000×. A 350 GB model can require only a ~35 MB adapter file. Training can be ~25% faster on large models. Critically, learned matrices merge with frozen weights during deployment, introducing zero inference latency. QLoRA extends LoRA through aggressive quantization while maintaining accuracy. Base weights are stored in 4-bit format with computation happening in 16-bit bfloat16. The results can be dramatic: 65B models on 48 GB GPUs, 33B on 24 GB, 13B on consumer 16 GB hardware—while matching many 16-bit full fine-tuning results. Spectrum, a 2024 innovation, takes a different approach. Rather than adding adapters, Spectrum identifies the most informative layers using signal-to-noise ratio analysis and selectively fine-tunes only the top ~30%. Reports show higher accuracy than QLoRA on mathematical reasoning with comparable resources. Decision framework: Use LoRA when you need zero inference latency and moderate GPU resources (16–24 GB). Use QLoRA for extreme memory constraints (consumer GPUs, Google Colab) or very large models (30B+). Use Spectrum when working with large models in distributed settings. Ready to implement LoRA and QLoRA? How to fine-tune open LLMs in 2025 by Phil Schmid provides complete code examples with current best practices. For hands-on practice, try Unsloth’s free Colab notebooks. Modern Alignment and Instruction Tuning Instruction tuning transforms completion-focused base models into instruction-following assistants, establishing basic capabilities before alignment. The method trains on diverse instruction-response pairs covering question answering, summarization, translation, and reasoning. Quality matters far more than quantity, with ~1,000 high-quality examples often sufficient. Direct Preference Optimization (DPO) has rapidly become the preferred alignment method by dramatically simplifying reinforcement learning from human feedback (RLHF). The key idea: re-parameterize the reward as implicit in the policy itself, solving the RLHF objective through supervised learning rather than complex reinforcement learning. Research from Stanford and others reports that DPO can achieve comparable or superior performance to PPO-based RLHF with single-stage training, ~50% less compute, and greater stability. DPO requires only preference data (prompt, chosen response, rejected response), a reference policy, and standard supervised learning infrastructure. The method has become common for training open-source LLMs in 2024–2025, including Zephyr-7B and various Mistral-based models. RLHF remains the foundational alignment technique but brings high complexity: managing four model copies during training (policy,
WhatApp New Feature: Soon, Individual Storage Management Per Chat At Your Fingertips – Details | Technology News
If you’ve ever run out of phone storage and wondered which WhatsApp chat is the culprit, there’s some good news coming your way. WhatsApp is reportedly working on a handy new feature that will let users see — and manage — how much storage space each individual chat is using. The update was spotted in a recent beta version of the app and, according to WABetaInfo, the feature is already being tested by a few users through Apple’s TestFlight program. So, what’s new here? Basically, WhatsApp is adding a “Manage Storage” option right inside the chat info screen. This means you’ll be able to open a specific chat — whether it’s with a friend or a group — and see exactly how much space it’s taking up. You’ll even get a neat gallery-style breakdown of all the photos, videos, documents, and other files shared in that conversation. Add Zee News as a Preferred Source Up until now, users had to dig through the app’s general settings under Storage and Data > Manage Storage to find this kind of information. That method shows overall storage usage but mixes up files from all chats. The new feature, on the other hand, zooms in on each conversation, making it way easier to spot which chats are hoarding the most space. If you often share memes, long videos, or hundreds of photos in group chats, this could be a real lifesaver. Instead of guessing which chat to clear out, you’ll be able to see exactly where your gigabytes are going — and clean up accordingly. WhatsApp hasn’t said when this feature will officially roll out, but since it’s already showing up in beta, it’s safe to assume it’ll make its way to all Android and iOS users soon. When it does, managing storage on WhatsApp will get a lot more intuitive — no more blind deleting or surprise “storage full” pop-ups. Just clear the clutter and keep the chats that actually matter.
Elon Musk’s Starlink To Run Technical, Security Demos In Mumbai From Oct 30 | Technology News
New Delhi: Tesla CEO Elon Musk-led Starlink is scheduled to conduct demonstration runs in Mumbai on October 30 and 31 to demonstrate compliance with India’s security and technical requirements for satellite broadband services, according to people familiar with the developments. The demos to be done before law enforcement agencies will be based on the provisional spectrum assigned to Starlink, which would mark a significant step ahead of its planned entry into the Indian satellite broadband market, they said. This step is necessary for the company to obtain clearances to commence commercial operations in the country. Starlink will run a demo to show compliance with the security and technical conditions of Global Mobile Personal Communication by Satellite (GMPCS) authorisation. Over 10 satellite operators, including the licensed Starlink, have entered India, with the private sector permitted to hold up to 100 per cent FDI. Add Zee News as a Preferred Source Elon Musk’s Starlink is the world’s dominant satcom operator with a constellation of 7,578 satellites. India has currently provided necessary approvals to Starlink, Reliance Jio-SES JV, and Bharti Group backed-Eutelsat OneWeb to offer satcom services in the country. The opening up of direct-to-cell communications service, which refers to a signal from a satellite directly to a mobile phone, has strengthened the growing satcom market in India. Internet penetration remains limited in certain regions of the country, underscoring the need for satellite internet to complement existing networks. Satellite internet refers to the internet service provided through satellites placed in Geostationary Orbits (GSO) or Non-Geostationary Orbits (NGSO). The government had informed in August that the data, traffic and other details accumulated by Elon Musk’s Starlink will be stored in India, and the domestic user traffic is not to be mirrored to any system/server located abroad.
10 Python One-Liners for Generating Time Series Features
10 Python One-Liners for Generating Time Series Features Introduction Time series data normally requires an in-depth understanding in order to build effective and insightful forecasting models. Two key properties are critical in time series forecasting: representation and granularity. Representation entails using meaningful approaches to transform raw temporal data — e.g. daily or hourly measurements — into informative patterns Granularity is about analyzing how precisely such patterns capture variations across time. As two sides of the same coin, their difference is subtle, but one thing is certain: both are achieved through feature engineering. This article presents 10 simple Python one-liners for generating time series features based on different characteristics and properties underlying raw time series data. These one-liners can be used in isolation or in combination to help you create more informative datasets that reveal much about your data’s temporal behavior — how it evolves, how it fluctuates, and which trends it exhibits over time. Note that our examples make use of Pandas and NumPy. 1. Lag Feature (Autoregressive Representation) The idea behind using autoregressive representation or lag features is simpler than it sounds: it consists of adding the previous observation as a new predictor feature in the current observation. In essence, this is arguably the simplest method to represent temporal dependency, e.g. between the current time instant and previous ones. As the first one-liner example code in this list of 10, let’s look at this one more closely. This example one-liner assumes you have stored a raw time series dataset in a DataFrame called df, one of whose existing attributes is named ‘value’. Note that the argument in the shift() function can be adjusted to fetch the value registered n time instants or observations before the current one: df[‘lag_1’] = df[‘value’].shift(1) df[‘lag_1’] = df[‘value’].shift(1) For daily time series data, if you wanted to capture previous values for a given day of the week, e.g. Monday, it would make sense to use shift(7). 2. Rolling Mean (Short-Term Smoothing) To capture local trends or smoother short-term fluctuations in the data, it is usually handy to use rolling means across the n past observations leading to the current one: this is a simple but very useful way to smooth sometimes chaotic raw time series values over a given feature. This example creates a new feature containing, for each observation, the rolling mean of the three previous values of this feature in recent observations: df[‘rolling_mean_3’] = df[‘value’].rolling(3).mean() df[‘rolling_mean_3’] = df[‘value’].rolling(3).mean() Smoothed time series feature with rolling mean 3. Rolling Standard Deviation (Local Volatility) Similar to rolling means, there is also the possibility of creating new features based on rolling standard deviation, which is effective for modeling how volatile consecutive observations are. This example introduces a feature to model the variability of the latest values over a moving window of a week, assuming daily observations. df[‘rolling_std_7’] = df[‘value’].rolling(7).std() df[‘rolling_std_7’] = df[‘value’].rolling(7).std() 4. Expanding Mean (Cumulative Memory) The expanding mean calculates the mean of all data points up to (and including) the current observation in the temporal sequence. Hence, it is like a rolling mean with a constantly increasing window size. It is useful to analyze how the mean of values in a time series attribute evolves over time, thereby capturing upward or downward trends more reliably in the long term. df[‘expanding_mean’] = df[‘value’].expanding().mean() df[‘expanding_mean’] = df[‘value’].expanding().mean() 5. Differencing (Trend Removal) This technique is used to remove long-term trends, highlighting change rates — important in non-stationary time series to stabilize them. It calculates the difference between consecutive observations (current and previous) of a target attribute: df[‘diff_1’] = df[‘value’].diff() df[‘diff_1’] = df[‘value’].diff() 6. Time-Based Features (Temporal Component Extraction) Simple but very useful in real-world applications, this one-liner can be used to decompose and extract relevant information from the full date-time feature or index your time series revolves around: df[‘month’], df[‘dayofweek’] = df[‘Date’].dt.month, df[‘Date’].dt.dayofweek df[‘month’], df[‘dayofweek’] = df[‘Date’].dt.month, df[‘Date’].dt.dayofweek Important: Be careful and check whether in your time series the date-time information is contained in a regular attribute or as the index of the data structure. If it were the index, you may need to use this instead: df[‘hour’], df[‘dayofweek’] = df.index.hour, df.index.dayofweek df[‘hour’], df[‘dayofweek’] = df.index.hour, df.index.dayofweek 7. Rolling Correlation (Temporal Relationship) This approach takes a step beyond rolling statistics over a time window to measure how recent values correlate with their lagged counterparts, thereby helping discover evolving autocorrelation. This is useful, for example, in detecting regime shifts, i.e. abrupt and persistent behavioral changes in the data over time, which take place when rolling correlations start to weaken or reverse at some point. df[‘rolling_corr’] = df[‘value’].rolling(30).corr(df[‘value’].shift(1)) df[‘rolling_corr’] = df[‘value’].rolling(30).corr(df[‘value’].shift(1)) 8. Fourier Features (Seasonality) Sinusoidal Fourier transformations can be used in raw time series attributes to capture cyclic or seasonal patterns. For example, applying the sine (or cosine) function transforms cyclical day-of-year information underlying date-time features into continuous features useful for learning and modeling yearly patterns. df[‘fourier_sin’] = np.sin(2 * np.pi * df[‘Date’].dt.dayofyear / 365) df[‘fourier_cos’] = np.cos(2 * np.pi * df[‘Date’].dt.dayofyear / 365) df[‘fourier_sin’] = np.sin(2 * np.pi * df[‘Date’].dt.dayofyear / 365) df[‘fourier_cos’] = np.cos(2 * np.pi * df[‘Date’].dt.dayofyear / 365) Allow me to use a two-liner, instead of a one-liner in this example, for a reason: both sine and cosine together are better at capturing the big picture of possible cyclic seasonality patterns. 9. Exponentially Weighted Mean (Adaptive Smoothing) The exponentially weighted mean — or EWM for short — is applied to obtain exponentially decaying weights that give higher importance to recent data observations while still retaining long-term memory. It is a more adaptive and somewhat “smarter” approach that prioritizes recent observations over the distant past. df[‘ewm_mean’] = df[‘value’].ewm(span=5).mean() df[‘ewm_mean’] = df[‘value’].ewm(span=5).mean() 10. Rolling Entropy (Information Complexity) A bit more math for the last one! The rolling entropy of a given feature over a time window calculates how random or spread out the values over that time window are, thereby revealing the quantity and complexity of information in it. Lower values of the resulting rolling entropy indicate a sense of order and predictability, whereas the
The Complete Guide to Model Context Protocol
In this article, you will learn what the Model Context Protocol (MCP) is, why it exists, and how it standardizes connecting language models to external data and tools. Topics we will cover include: The integration problem MCP is designed to solve. MCP’s client–server architecture and communication model. The core primitives (resources, prompts, and tools) and how they work together. Let’s not waste any more time. The Complete Guide to Model Context ProtocolImage by Editor Introducing Model Context Protocol Language models can generate text and reason impressively, yet they remain isolated by default. Out of the box, they can’t access your files, query databases, or call APIs without additional integration work. Each new data source means more custom code, more maintenance burden, and more fragmentation. Model Context Protocol (MCP) solves this by providing an open-source standard for connecting language models to external systems. Instead of building one-off integrations for every data source, MCP provides a shared protocol that lets models communicate with tools, APIs, and data. This article takes a closer look at what MCP is, why it matters, and how it changes the way we connect language models to real-world systems. Here’s what we’ll cover: The core problem MCP is designed to solve An overview of MCP’s architecture The three core primitives: tools, prompts, and resources How the protocol flow works in practice When to use MCP (and when not to) By the end, you’ll have a solid understanding of how MCP fits into the modern AI stack and how to decide if it’s right for your projects. The Problem That Model Context Protocol Solves Before MCP, integrating AI into enterprise systems was messy and inefficient because tying language models to real systems quickly runs into a scalability problem. Each new model and each new data source need custom integration code — connectors, adapters, and API bridges — that don’t generalize. If you have M models and N data sources, you end up maintaining M × N unique integrations. Every new model or data source multiplies the complexity, adding more maintenance overhead. The MCP solves this by introducing a shared standard for communication between models and external resources. Instead of each model integrating directly with every data source, both models and resources speak a common protocol. This turns an M × N problem into an M + N one. Each model implements MCP once, each resource implements MCP once, and everything can interoperate smoothly. From M × N integrations to M + N with MCPImage by Author In short, MCP decouples language models from the specifics of external integrations. In doing so, it enables scalable, maintainable, and reusable connections that link AI systems to real-world data and functionality. Understanding MCP’s Architecture MCP implements a client-server architecture with specific terminology that’s important to understand. The Three Key Components MCP Hosts are applications that want to use MCP capabilities. These are typically LLM applications like Claude Desktop, IDEs with AI features, or custom applications you’ve built. Hosts contain or interface with language models and initiate connections to MCP servers. MCP Clients are the protocol clients created and managed by the host application. When a host wants to connect to an MCP server, it creates a client instance to handle that specific connection. A single host application can maintain multiple clients, each connecting to different servers. The client handles the protocol-level communication, managing requests and responses according to the MCP specification. MCP Servers expose specific capabilities to clients: database access, filesystem operations, API integrations, or computational tools. Servers implement the server side of the protocol, responding to client requests and providing resources, tools, and prompts. MCP ArchitectureImage by Author This architecture provides a clean separation of concerns: Hosts focus on orchestrating AI workflows without concerning themselves with data source specifics Servers expose capabilities without knowing how models will use them The protocol handles communication details transparently A single host can connect to multiple servers simultaneously through separate clients. For example, an AI assistant might maintain connections to filesystem, database, GitHub, and Slack servers concurrently. The host presents the model with a unified capability set, abstracting away whether data comes from local files or remote APIs. Communication Protocol MCP uses JSON-RPC 2.0 for message exchange. This lightweight remote procedure call protocol provides a structured request/response format and is simple to inspect and debug. MCP supports two transport mechanisms: stdio (Standard Input/Output): For local server processes running on the same machine. The host spawns the server process and communicates through its standard streams. HTTP: For networked communication. Uses HTTP POST for requests and, optionally, Server-Sent Events for streaming. This flexibility lets MCP servers run locally or remotely while keeping communication consistent. The Three Core Primitives MCP relies on three core primitives that servers expose. They provide enough structure to enable complex interactions without limiting flexibility. Resources Resources represent any data a model can read. This includes file contents, database records, API responses, live sensor data, or cached computations. Each resource uses a URI scheme, which makes it easy to identify and access different types of data. Here are some examples: Filesystem: file:///home/user/projects/api/README.md Database: postgres://localhost/customers/table/users Weather API: weather://current/san-francisco The URI scheme identifies the resource type. The rest of the path points to the specific data. Resources can be static, such as files with fixed URIs, or dynamic, like the latest entries in a continuously updating log. Servers list available resources through the resources/list endpoint, and hosts retrieve them via resources/read. Each resource includes metadata, such as MIME type, which helps hosts handle content correctly — text/markdown is processed differently than application/json — and descriptions provide context that helps both users and models understand the resource. Prompts Prompts provide reusable templates for common tasks. They encode expert knowledge and simplify complex instructions. For example, a database MCP server can offer prompts like analyze-schema, debug-slow-query, or generate-migration. Each prompt includes the context necessary for the task. Prompts accept arguments. An analyze-table prompt can take a table name and include schema details, indexes, foreign key relationships, and recent
Indias IT Sector Expected To Reach $400 Billion By 2030 Amidst AI-Related Disruptions
New Delhi: India’s information technology (IT) sector is projected to reach $400 billion by 2030, led by firms delivering domain-specific automation that outperforms traditional service models on speed, quality, and cost, a report said on Tuesday. The country’s strong talent pool, global client trust, and cost efficiency will enable it to leverage the increased global demand for AI-driven solutions, a report by venture firm Bessemer Venture Partners indicated. AI is automating tasks previously performed by humans and disrupting the billable-hour model that supports traditional Indian IT services, which makes deep strategic pivots crucial to stay competitive, the report noted. The venture firm mentioned that agile, AI-native challengers are adapting more quickly to such changes than incumbent companies. Three types of fast-moving AI-first challengers that will disrupt existing models are AI-enabled services, services built for AI, and pure software-led platforms, the report said. The venture firm forecast that India’s IT services industry will grow with margins intact despite challenges from AI-related disruptions. It noted that three years after the launch of ChatGPT, India’s IT revenues continue to climb, and margins remain surprisingly resilient because uptake of general-purpose large language models is concentrated in only two sectors- technology and media or advertising. Add Zee News as a Preferred Source Incumbent IT firms continue to play a crucial role in solving complex business problems that are nuanced rather than providing one-size-fits-all SaaS deployments. The strong balance sheets of these companies further strengthen client confidence, Bessemer Venture Partners said. Fortune 500 companies still trust that IT services vendors can manage multi-year projects, absorb macro shocks, and deliver consistent execution, the report said. The market capitalisation of India’s top ten IT firms has more than doubled from $166 billion to $354 billion in the past decade, driven by annual revenue growth exceeding 7 per cent.