In this post, we will cover the basics of LangChain and guide you through its core components.
This post is based on Greg Kamradt’s LangChain Cookbook. Here is the video:
What is LangChain?
LangChain is a framework for developing applications powered by language models, making them easier to integrate into applications.
Schemas
The first major component of LangChain is Schemas.
Text
Text schemas represent natural language input that you provide to a language model.
# You'll be working with simple strings (that'll soon grow in complexity!)
my_text = "What day comes after Friday?"
Chat Messages
Chat messages allow the AI to generate responses based on the provided context.
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage
chat = ChatOpenAI(temperature=0.7, openai_api_key="YourAPIKey")
chat([SystemMessage(content="You are a nice AI bot that helps a user figure out what to eat in one short sentence"), HumanMessage(content="I like tomatoes, what should I eat?")])
Output:
AIMessage(content='You could try making a tomato salad with fresh basil and mozzarella cheese.', additional_kwargs={})
Documents
Documents represent pieces of text along with associated metadata.
from langchain.schema import Document
Document(
page_content="This is my document. It is full of text that I've gathered from other places",
metadata={"my_document_id": 234234, "my_document_source": "The LangChain Papers", "my_document_create_time": 1680013019},
)
Output:
Document(page_content="This is my document. It is full of text that I've gathered from other places", lookup_str='', metadata={'my_document_id': 234234, 'my_document_source': 'The LangChain Papers', 'my_document_create_time': 1680013019}, lookup_index=0)
Models
The next major component of LangChain is Models. We will examine three types of models: Language Models, Chat Models, and Text Embedding Models.
Language Model
Language models take text as input and generate text as output.
from langchain.llms import OpenAI
llm = OpenAI(model_name="text-ada-001", openai_api_key="YourAPIKey")
llm("What day comes after Friday?")
The model would generate the output “Saturday.”
Chat Model
Chat models interact with chat messages and can be more creative and dynamic based on their configuration.
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage
chat = ChatOpenAI(temperature=1, openai_api_key="YourAPIKey")
chat([SystemMessage(content="You are an unhelpful AI bot that makes a joke at whatever the user says"), HumanMessage(content="I would like to go to New York, how should I do this?")])
In this example, the model responds humorously and unhelpfully to the user’s query about traveling to New York, as instructed by the system message.
AIMessage(content="You could try walking, but I don't recommend it unless you have a lot of time on your hands. Maybe try flapping your arms really hard and see if you can fly there?", additional_kwargs={})
Text Embeddings
Text embeddings convert text into numerical representations called embeddings. These embeddings can be used to compare and analyze text more efficiently.
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(openai_api_key="YourAPIKey")
text = "Hi! It's time for the beach"
text_embedding = embeddings.embed_query(text)
print(f"Your embedding is length {len(text_embedding)}")
print(f"Here's a sample: {text_embedding[:5]}...")
The generated embeddings are a one-dimensional array (or a list) of numbers that semantically represent the text’s meaning.
Prompts
The next major component of LangChain is Prompts. Prompts are the text that you send to your language model.
Basic Prompt
Here is an example of a basic prompt:
from langchain.llms import OpenAI
llm = OpenAI(model_name="text-davinci-003", openai_api_key="YourAPIKey") # I like to use three double quotation marks for my prompts because it's easier to read
prompt = """ Today is Monday, tomorrow is Wednesday. What is wrong with that statement? """
llm(prompt)
In this example, the AI would respond with: “The statement is incorrect. Tomorrow’s Tuesday, not Wednesday.”
Prompt Templates
Prompt templates are useful when you need to dynamically generate prompts based on different inputs.
from langchain.llms import OpenAI
from langchain import PromptTemplate
llm = OpenAI(model_name="text-davinci-003", openai_api_key="YourAPIKey") # Notice "location" below, that is a placeholder for another value later
template = """ I really want to travel to {location}. What should I do there? Respond in one short sentence """
prompt = PromptTemplate(
input_variables=["location"],
template=template,
)
final_prompt = prompt.format(location="Rome")
print(f"Final Prompt: {final_prompt}")
print("-----------")
print(f"LLM Output: {llm(final_prompt)}")
In this example, the final prompt is generated by replacing the {location}
placeholder with “Rome.” The AI then responds with a short sentence, such as: “When in Rome, don’t miss a visit to the iconic Colosseum.”
Final Prompt:
I really want to travel to Rome. What should I do there?
Respond in one short sentence
-----------
LLM Output:
Visit the Colosseum, the Pantheon, and the Trevi Fountain for a taste of Rome's ancient and modern culture.
Example Selectors
Example selectors allow you to tailor the AI’s response based on the user input.
from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.llms import OpenAI
llm = OpenAI(model_name="text-davinci-003", openai_api_key="YourAPIKey")
example_prompt = PromptTemplate(
input_variables=["input", "output"],
template="Example Input: {input}\nExample Output: {output}",
)
# Examples of locations that nouns are found
examples = [
{"input": "pirate", "output": "ship"},
{"input": "pilot", "output": "plane"},
{"input": "driver", "output": "car"},
{"input": "tree", "output": "ground"},
{"input": "bird", "output": "nest"},
]
# SemanticSimilarityExampleSelector will select examples that are similar to your input by semantic meaning
example_selector = SemanticSimilarityExampleSelector.from_examples(
# This is the list of examples available to select from.
examples,
# This is the embedding class used to produce embeddings which are used to measure semantic similarity.
OpenAIEmbeddings(openai_api_key=openai_api_key),
# This is the VectorStore class that is used to store the embeddings and do a similarity search over.
FAISS,
# This is the number of examples to produce.
k=2,
)
similar_prompt = FewShotPromptTemplate(
# The object that will help select examples
example_selector=example_selector,
# Your prompt
example_prompt=example_prompt,
# Customizations that will be added to the top and bottom of your prompt
prefix="Give the location an item is usually found in",
suffix="Input: {noun}\nOutput:",
# What inputs your prompt will receive
input_variables=["noun"],
)
For instance, given the noun “student,” the example selector might find “driver” and “pilot” to be the most similar examples.
# Select a noun!
my_noun = "student"
print(similar_prompt.format(noun=my_noun))
Output:
Give the location an item is usually found in
Example Input: driver
Example Output: car
Example Input: pilot
Example Output: plane
Input: student
Output:
Output Parsers
Output parsers enable you to obtain structured output, such as JSON objects, from the language model’s responses.
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.llms import OpenAI
llm = OpenAI(model_name="text-davinci-003", openai_api_key="YourAPIKey")
# How you would like your reponse structured. This is basically a fancy prompt template
response_schemas = [
ResponseSchema(name="bad_string", description="This a poorly formatted user input string"),
ResponseSchema(name="good_string", description="This is your response, a reformatted response"),
]
# How you would like to parse your output
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
After defining the response schema, create an output parser to read the schema and parse it. Next, create the formatting instructions using the get_format_instructions()
method from the output parser.
# See the prompt template you created for formatting
format_instructions = output_parser.get_format_instructions()
print(format_instructions)
Now, create a prompt template with placeholder variables for the formatting instructions and user input. This template will also include a section for the AI-generated response.
json
{
"bad_string": string // This a poorly formatted user input string
"good_string": string // This is your response, a reformatted response
}
Once you’ve prepared the prompt template, send it to the language model and obtain the response. You can then parse the response using the output parser to get a structured JSON object (or a dictionary in Python).
template = (
""" You will be given a poorly formatted string from a user. Reformat it and make sure all the words are spelled correctly {format_instructions} % USER INPUT: {user_input} YOUR RESPONSE: """
)
prompt = PromptTemplate(input_variables=["user_input"], partial_variables={"format_instructions": format_instructions}, template=template)
promptValue = prompt.format(user_input="welcom to califonya!")
print(promptValue)
Output:
json\n{\n\t"bad_string": "welcom to califonya!",\n\t"good_string": "Welcome to California!"\n}\n
Indexes
The next major component of LangChain is Indexes. Indexes structure documents for language models.
Document Loaders
Document loaders enable you to load data from various sources in a structured format.
from langchain.document_loaders import HNLoader
loader = HNLoader("https://news.ycombinator.com/item?id=34422627")
data = loader.load()
print(f"Found {len(data)} comments")
print(f"Here's a sample:\n\n{''.join([x.page_content[:150] for x in data[:2]])}")
By using document loaders, you can quickly and easily load data from various sources and make it available for use in your language models.
Found 76 comments
Here's a sample:
dang 69 days ago
| next [–]
Related ongoing thread:GPT-3.5 and Wolfram Alpha via LangChain - https://news.ycombinator.com/item?id=344Ozzie_osman 69 days ago
| prev | next [–]
LangChain is awesome. For people not sure what it's doing, large language models (LLMs) are
Text Splitters
Text splitters allow you to split a document into smaller chunks allows the model to process the content more effectively.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# This is a long document we can split up.
with open("data/PaulGrahamEssays/worked.txt") as f:
pg_work = f.read()
print(f"You have {len([pg_work])} document")
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=150,
chunk_overlap=20,
)
texts = text_splitter.create_documents([pg_work])
print(f"You have {len(texts)} documents")
print("Preview:")
print(texts[0].page_content, "\n")
print(texts[1].page_content)
Output:
Preview:
February 2021Before college the two main things I worked on, outside of school,
were writing and programming. I didn't write essays. I wrote what
beginning writers were supposed to write then, and probably still
are: short stories. My stories were awful. They had hardly any plot,
Retrievers
Retrievers combine your documents with language models.
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
loader = TextLoader("data/PaulGrahamEssays/worked.txt")
documents = loader.load()
# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
# Split your docs into texts
texts = text_splitter.split_documents(documents)
# Get embedding engine ready
embeddings = OpenAIEmbeddings(openai_api_key="YourAPIKey")
# Embedd your texts
db = FAISS.from_documents(texts, embeddings)
# Init your retriever. Asking for just 1 document back
retriever = db.as_retriever()
By initializing the retriever with a document store, you can easily find relevant documents based on your query. In the example below, we search for documents related to building things.
VectorStoreRetriever(vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x7fb81007a9d0>, search_type='similarity', search_kwargs={})
The retriever will convert the query into a vector and compare it to the vectors in the document store. It will then return the most similar documents.
docs = retriever.get_relevant_documents("what types of things did the author want to build?")
print("\n\n".join([x.page_content[:200] for x in docs[:2]]))
Output:
standards; what was the point? No one else wanted one either, so
off they went. That was what happened to systems work.I wanted not just to build things, but to build things that would
last.In this di
much of it in grad school.Computer Science is an uneasy alliance between two halves, theory
and systems. The theory people prove things, and the systems people
build things. I wanted to build things.
VectorStores
VectorStores store and search through embeddings, which are numerical representations of the semantic meaning of documents. Two main players in the vector storage space are Pinecone and Weaviate. You can also explore OpenAI’s retriever documentation for other options.
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
loader = TextLoader("data/PaulGrahamEssays/worked.txt")
documents = loader.load()
# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
# Split your docs into texts
texts = text_splitter.split_documents(documents)
# Get embedding engine ready
embeddings = OpenAIEmbeddings(openai_api_key="YourAPIKey")
print(f"You have {len(texts)} documents")
In this example, we create embeddings based on the split documents. The number of embeddings should match the number of documents created.
You have 78 documents
The VectorStore will store these embeddings and make them easily searchable. It acts as a database for storing the semantic meaning of your documents, allowing for quick and efficient searching based on semantic similarity.
embedding_list = embeddings.embed_documents([text.page_content for text in texts])
print(f"You have {len(embedding_list)} embeddings")
print(f"Here's a sample of one: {embedding_list[0][:3]}...")
Output:
You have 78 embeddings
Here's a sample of one: [-0.0011257503647357225, -0.01111479103565216, -0.012860921211540699]...
Memory
The next major component of LangChain is Memory. Memory helps language models remember prior interactions and provide more context-aware responses.
Chat Message History
from langchain.memory import ChatMessageHistory
from langchain.chat_models import ChatOpenAI
chat = ChatOpenAI(temperature=0, openai_api_key="YourAPIKey")
history = ChatMessageHistory()
history.add_ai_message("hi!")
history.add_user_message("what is the capital of france?")
After adding messages to the history, you can pass this history to the language model to generate context-aware responses:
ai_response = chat(history.messages)
Output:
AIMessage(content='The capital of France is Paris.', additional_kwargs={})
You can then add the AI-generated response to the history:
history.add_ai_message(ai_response.content)
Chains
The next major component of LangChain is Chains. Chains allow you to combine different language model (LLM) calls and actions automatically.
Simple Sequential Chain
A Simple Sequential Chain helps break up tasks to avoid language models getting distracted, confused, or hallucinating when asked to perform too many tasks in a row.
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import SimpleSequentialChain
llm = OpenAI(temperature=1, openai_api_key="YourAPIKey")
template = """Your job is to come up with a classic dish from the area that the users suggests. % USER LOCATION {user_location} YOUR RESPONSE: """
prompt_template = PromptTemplate(input_variables=["user_location"], template=template)
# Holds my 'location' chain
location_chain = LLMChain(llm=llm, prompt=prompt_template)
template = """Given a meal, give a short and simple recipe on how to make that dish at home. % MEAL {user_meal} YOUR RESPONSE: """
prompt_template = PromptTemplate(input_variables=["user_meal"], template=template)
# Holds my 'meal' chain
meal_chain = LLMChain(llm=llm, prompt=prompt_template)
overall_chain = SimpleSequentialChain(chains=[location_chain, meal_chain], verbose=True)
review = overall_chain.run("Rome")
In this example, the chain first receives the user location (Rome) and outputs a classic dish from Rome. Then, it provides a simple recipe for that classic dish. The verbose=True
parameter ensures that the chain prints statements during its execution, making it easier to debug and understand the chain’s progress.
> Entering new SimpleSequentialChain chain...
A classic dish from Rome is Spaghetti alla Carbonara, a pasta dish made with egg, cheese, guanciale (cured pork cheek), and black pepper.
Ingredients:
-1/2 lb. spaghetti
-4 oz. guanciale, diced
-2 cloves garlic, minced
-2 eggs
-2/3 cup Parmigiano Reggiano cheese, divided
-1/4 tsp. freshly cracked black pepper
-1/4 cup reserved pasta water
-2 tablespoons olive oil
-Parsley for garnish (optional)
Instructions:
1. Boil spaghetti in a large pot of salted boiling water until al dente, about 8 minutes. Reserve 1/4 cup of cooking water and drain the spaghetti.
2. In a large skillet, heat oil over medium-high heat, then add guanciale and sauté until lightly brown, about 5 minutes.
3. Add garlic and sauté for an additional 1-2 minutes.
4. In a medium bowl, whisk together eggs and 1/3 cup Parmigiano Reggiano cheese.
5. Add cooked spaghetti to the large skillet, toss to combine, then reduce the heat to medium-low.
6. Pour in the egg and cheese mixture, then add pepper and reserved pasta water.
7. Toss pasta
> Finished chain.
Summarization Chain
The Summarization Chain breaks the text into smaller chunks and summarizing each chunk, creating a final summary based on the individual summaries.
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("data/PaulGrahamEssays/disc.txt")
documents = loader.load()
# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=50)
# Split your docs into texts
texts = text_splitter.split_documents(documents)
# There is a lot of complexity hidden in this one line. I encourage you to check out the video above for more detail
chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)
chain.run(texts)
In this example, the chain first splits the essay into chunks of 700 characters. It then generates summaries for each chunk and creates a final concise summary based on these individual summaries.
> Entering new MapReduceDocumentsChain chain...
Prompt after formatting:
Write a concise summary of the following:
"January 2017Because biographies of famous scientists tend to
edit out their mistakes, we underestimate the
degree of risk they were willing to take.
And because anything a famous scientist did that
wasn't a mistake has probably now become the
conventional wisdom, those choices don't
seem risky either.Biographies of Newton, for example, understandably focus
more on physics than alchemy or theology.
The impression we get is that his unerring judgment
led him straight to truths no one else had noticed.
How to explain all the time he spent on alchemy
and theology? Well, smart people are often kind of
crazy.But maybe there is a simpler explanation. Maybe"
CONCISE SUMMARY:
Prompt after formatting:
Write a concise summary of the following:
"the smartness and the craziness were not as separate
as we think. Physics seems to us a promising thing
to work on, and alchemy and theology obvious wastes
of time. But that's because we know how things
turned out. In Newton's day the three problems
seemed roughly equally promising. No one knew yet
what the payoff would be for inventing what we
now call physics; if they had, more people would
have been working on it. And alchemy and theology
were still then in the category Marc Andreessen would
describe as "huge, if true."Newton made three bets. One of them worked. But
they were all risky."
CONCISE SUMMARY:
> Entering new LLMChain chain...
Prompt after formatting:
Write a concise summary of the following:
" Biographies of famous scientists often omit the risks they took and the mistakes they made during their lifetime. This gives us an impression that these scientists had a perfect judgement, when in fact they made unwise decisions like Newton's dabblings in alchemy and theology. Perhaps these scientists were just taking risks and making mistakes like anyone else.
This passage discusses how, in the time of Sir Isaac Newton, the three areas of study – physics, alchemy, and theology – were all considered equally valuable and worthy of exploration. Newton's success in the area of physics has since made the others seem like a waste of time, however at the point of Newton's exploration, all three were seen as high-risk but high-reward propositions."
CONCISE SUMMARY:
> Finished chain.
> Finished chain.
" Biographies of famous scientists often omit the risks they took and mistakes they made, creating an impression of perfect judgement. Sir Isaac Newton's exploration of physics, alchemy, and theology was seen as all high-risk but high-reward propositions at the time, and should not be overlooked."
Agents
The last major component of LangChain is Agents. Agents enable the language model to dynamically decide which tools to use in order to best respond to a given query.
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.llms import OpenAI
import json
llm = OpenAI(temperature=0, openai_api_key="YourAPIKey")
serpapi_api_key = "..."
toolkit = load_tools(["serpapi"], llm=llm, serpapi_api_key=serpapi_api_key)
agent = initialize_agent(toolkit, llm, agent="zero-shot-react-description", verbose=True, return_intermediate_steps=True)
response = agent({"input": "what was the first album of the" "band that Natalie Bergman is a part of?"})
The final response provides a clear and accurate answer to the query.
> Entering new AgentExecutor chain...
I should try to find out what band Natalie Bergman is a part of.
Action: Search
Action Input: "Natalie Bergman band"
Observation: Natalie Bergman is an American singer-songwriter. She is one half of the duo Wild Belle, along with her brother Elliot Bergman. Her debut solo album, Mercy, was released on Third Man Records on May 7, 2021. She is based in Los Angeles.
Thought: I should search for the debut album of Wild Belle.
Action: Search
Action Input: "Wild Belle debut album"
Observation: Isles
Thought: I now know the final answer.
Final Answer: Isles is the debut album of Wild Belle, the band that Natalie Bergman is a part of.
> Finished chain.
Awesome Examples
Summary
Now you understand the core components of LangChain. LangChain simplifies the process of integrating AI models into applications, making it easier for developers to build powerful applications with language models like ChatGPT.
First import in “Chat Messages” should be
`from langchain.chat_models import ChatOpenAI`
Using langchain 0.0.131
Nice catch! Updated to reflect.