Injecting Domain Knowledge
Injecting domain knowledge is a critical step to steer a general-purpose LLM into a powerful, expertlevel agent. The method chosen for this task depends on the scale of the knowledge, how frequently it changes, and the desired behavior of the agent. There are three primary techniques: In-Context Learning, Retrieval-Augmented Generation (RAG), and Fine-Tuning.
In-Context Learning
The most straightforward way to inject domain knowledge is to place the relevant information directly within the LLM's context window. This method leverages the model's incontext learning ability, allowing it to use the provided text as a direct source of truth for generating a response. As the user can dynamically insert any piece of information on the fly without requiring any additional infrastructure, in-context learning is very agile and is especially effective when the knowledge is limited in scope, such as some trading rules for the stock market or the clinical summary for a patient. Below we provide an example of a medical care scenario.
Example of Injecting Domain Knowledge through In-Context Learning
Task: Triage a patient's chief complaint to recommend an appropriate care setting.
Prompt: You are an experienced triage nurse. Use the following medical knowledge: chest pain with exertional onset, radiating to the jaw/arm, and associated with diaphoresis (sweating) is a classic presentation of cardiac ischemia and is an emergency requiring immediate ER evaluation. Non-radiating, pleuritic pain (sharp, worse on inspiration) is more suggestive of musculoskeletal or pulmonary issues like costochondritis or pneumonia, which may be handled by urgent care. Now triage this complaint: "I have a crushing feeling in my chest that started after I mowed the lawn, and I'm sweating but feel cold."
Output: This is a high-risk presentation. The exertional onset (mowing the lawn), crushing quality, and associated diaphoresis (sweating) are hallmark red flags for acute coronary syndrome (e.g., a heart attack). This patient requires immediate emergency room evaluation. Do not delay.
However, this method is not well-suited for extensive knowledge bases. Directly including large volumes of text in every prompt can be expensive (as the API costs are usually calculated on the number of input and output tokens), especially in multi-turn dialogues. Moreover, too long context can "crowd out" the space needed for the user's query, the conversation history, and the model's generated output, potentially leading to truncated or less coherent responses. We will discuss how to handle the long context window in Section Managing Long Context.
Retrieval-Augmented Generation
When dealing with a large corpus of documents, RAG offers a more scalable and efficient solution. Instead of attempting to fit all possible information into the context window, RAG operates on a "just-in-time" basis by retrieving only the most relevant snippets of knowledge in response to the user's query. Taking the medical scenario as an example, this allows an agent to pull specific symptom descriptions for cardiac ischemia from a vast medical database to provide precise, expert-level advice. Standard RAG often treats knowledge chunks as independent, parallel pieces of information, but the user can also explore different ways to construct the knowledge base according to their needs. For example, by incorporating a knowledge graph, the system can represent the intricate relationships between different concepts and entities, and thus fetches not just directly relevant text but also interconnected information. Finally, RAG is not limited to a static set of documents. The retrieval mechanism can be extended to pull information from a wide variety of sources, including structured databases (e.g., via text-to-SQL queries) or live web searches, ensuring the agent has access to the most current information available.
Fine-Tuning
A third approach is to embed domain knowledge directly into the LLM's parameter weights through fine-tuning. This process involves further training the pre-trained model on a custom dataset specific to the target domain. Full-Parameter Fine-Tuning would modify all of the model's weights, which offers the highest potential for performance but is extremely resource-intensive. In contrast, methods as LoRA (Low-Rank Adaptation) or adapters freeze most of the model's parameters and only train a small set of new ones, known as Parameter-Efficient Fine-Tuning (PEFT). This dramatically reduces computational costs and makes fine-tuning more accessible. Fine-tuning does more than just teach the model facts; it changes its behavior and makes the knowledge an intrinsic part of its reasoning process. Meanwhile, since the knowledge is stored in the model's weights, the user does not need to include large amounts of text in the prompt during inference. This leads to faster response times and lower operational costs, which is especially beneficial for high-volume applications.
If you find this work helpful, please consider citing our paper:
@article{hu2025hands,
title={Hands-on LLM-based Agents: A Tutorial for General Audiences},
author={Hu, Shuyue and Ren, Siyue and Chen, Yang and Mu, Chunjiang and Liu, Jinyi and Cui, Zhiyao and Zhang, Yiqun and Li, Hao and Zhou, Dongzhan and Xu, Jia and others},
journal={Hands-on},
volume={21},
pages={6},
year={2025}
}