LangChain CSV Agents and Pandas for Data Querying [Tutorial]

Introduction

Let us explore the simplest way to interact with your CSV files and retrieve the necessary information with CSV Agents of LangChain. By passing data from CSV files to large foundational models like GPT-3, we may quickly understand the data using straight Questions to the language model.

This blog will assist you to start utilizing Langchain agents to work with CSV files. LangChain provides a dedicated CSV Agent which is optimized for Q&A tasks. Remember this data is safe. OpenAI won’t track the data passed through API requests or use this data to train their model.

Important Components of LangChain

Chains in LangChain

Chains are fundamental building blocks LangChain Library. A chain is a pipeline that processes an input by using a specific connection of primitives(These primitives can be LargeLanguageModels(LLMs) or other entities take functions as an example). Assume chains are assembling other components of langChain in particular ways to accomplish specific use cases.  

Agents in Langchain

An Agent is a system that decides what action is taken by LLM, it observes and repeats until it reaches the correct answer. It allows us to combine LLMs capabilities with external sources of computation(tools) or knowledge. To use agents, we require three things: 1) A base LLM, 2)A tool to take on action, and 3) An agent to control and initiate actions and interactions.

Prompt Templates

A prompt contains a text string (“the template”) that is passed to the model, which can take in a set of parameters from the end user and generates a prompt.

A prompt template can contain:

  • Instructions to the language model
  • A set of a few shot examples to help the language model generate a better response,
  • A question about the language model.

Good prompts are essential for effective task implementation in the new age of LLMs. A prompt may contain instructions, user input or query, and external information for LLM which help it perform better. The LangChain library recognizes the power of prompts and has built an entire set of objects for them.

Tools and Toolkits

Tools are functions that agents can use to interact with the world. These tools can be generic utilities (e.g. search), other chains, or even other agents. There are pre-built tools for popular tasks and you can always build custom tools for your use case. Toolkits are collections of tools that are designed to be used together for specific tasks and have convenience loading methods. To dive deep, have a read of Custom tools in LangChain.

Agent Types

Agents use an LLM to determine which actions to take and in what order. An action can either be using a tool and observing its output, or returning a response to the user. Each agent type has a specific set of tools and is suitable for performing a specific set of tasks. Check out various types of agents in LangChain.

Implementation of CSV Agents

CSV Agent of LangChain uses CSV (Comma-Separated Values) format, which is a simple file format for storing tabular data. It can read and write data from CSV files and perform primary operations on the data.

This agent is more focused on working with CSV files specifically. This agent calls the Pandas DataFrame agent under the hood, which in turn calls the Python agent, which executes LLM-generated Python code.

The pandas_dataframe_agent and create_csv_agent are both agents that can interact with data frames, but their underlying purpose is different.

The pandas_agent

The pandas_dataframe_agent uses the Pandas library, which is a powerful data manipulation and analysis library in Python. It allows you to perform various operations on data frames, such as filtering, sorting, and calculating statistics. This agent specifically optimizes working with data frames. It can handle complex data analysis tasks.

We will use the Titanic dataset, a well-known dataset perfect for testing implementation. We form train.csv and test.csv from splitting the Titanic dataset.

Import all the required modules from packages:

import os
import pandas as pd
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.agents import create_csv_agent
from langchain.agents.agent_types import AgentType
import pandas as pd
os.environ["OPENAI_API_KEY"] = "Enter_Your_OpenAI_API_Key"

Working on pandas_dataframe_agent

df = pd.read_csv('train.csv')
df[0:10]

Output:

Create pandas agent:

pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), 
                         df, 
                         verbose=True)

The code creates a pandas data frame agent using create_pandas_dataframe_agent, it takes in 3 arguments, information about the language model OpenAI(temparature=0), the data frame that we want to pass to the model, and setting verbose to true gives detailed outputs.

Run the agent now to observe the steps involved in processing natural language input.

pd_agent.run("What is survival rate of onboarded passenger?")

Output:

> Entering new  chain...
Thought: I need to calculate the ratio of survived passengers to total passengers
Action: python_repl_ast
Action Input: df['Survived'].sum() / df['PassengerId'].count()
Observation: 0.3838383838383838
Thought: I now know the final answer
Final Answer: The survival rate of onboarded passengers is 0.38.
> Finished chain.
'The survival rate of onboarded passengers is 0.38.'

From the output, the agent receives the task as input, and it initiates thought on knowing what is the task about. It moves on to the next action i.e. to execute a Python REPL command (which is to work interactively with the Python interpreter) that calculates the ratio of survived passengers to total passengers.

The pandas_dataframe_agent is more versatile and suitable for advanced data analysis tasks, while the csv_agent is more specialized for working with CSV files.

OPENAI Functions Agent

OpenAI models (like gpt-3.5-turbo-0613 and gpt-4-0613) have been fine-tuned to detect when a function should be called and respond with the inputs that should be passed to these functions.

In an API call, you can describe functions and have the model intelligently choose to output a JSON object containing arguments to call those functions. The goal of the OpenAI Function APIs is to more reliably return valid and useful function calls than a generic text completion or chat API. To explore more, read OpenAI’s New Function Calling Feature blog.

To simplify choosing an agent_type can provide an effective iteration of the next action considering a range of tools at its disposal for some series of multiple connected tasks. There are many agent types for various use cases or you can create one for your use case.

Woking with CSV Agents

The csv_agent utilizes ChatOpenAI as the language model, and it initializes the agent with the OPENAI_FUNCTIONS agent type before running to interact with the CSV file.

csv_agent = create_csv_agent(ChatOpenAI(temperature=0,
   model="gpt-3.5-turbo-0613"), 
  'train.csv',
   verbose=True,
   agent_type=AgentType.OPENAI_FUNCTIONS,)
csv_agent.run('What are mortality rates considering passenger class and gender? which catrgory had best motality?')

Output:

> Entering new  chain...
Thought: I need to group the data by passenger class and gender and then calculate the mortality rate for each group.
Action: python_repl_ast
Action Input: df.groupby(['Pclass', 'Sex'])['Survived'].mean()
Observation: Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64
Thought: I now know the mortality rates for each group.
Final Answer: The highest mortality rate was for female passengers in the first class, with a rate of 0.968085.
> Finished chain.
The highest mortality rate was for female passengers in the first class, with a rate of 0.968085.

As we discussed in Pandas agent, you can understand the thought and actions agent initialized towards getting the answer from the output. It initiates a thought on what is the task about.

It moves on to the next action i.e. to execute a Python REPL command that gets code to group passengers by ticket class and sex, and then calculates the mean of Survived column which in a logical sense is the mortality rate. The next action is to execute the code and get the final answer.

CSV Agent with Multiple CSV files

Similarly, you can upload multiple CSV files by passing them in as a list and interacting with the agent for insights. See the example implementation that follows,

multi_agent = create_csv_agent(
    ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613"),
    ["test.csv", "train.csv"],
    verbose=True,
    agent_type=AgentType.OPENAI_FUNCTIONS,
)
multi_agent.run('Are there any common rows in both datasets?')
Output:
> Entering new  chain...
Invoking: `python_repl_ast` with `common_rows = df1.merge(df2, on=['PassengerId'], how='inner')
common_rows.head()`
Empty DataFrame
Columns: [PassengerId, Pclass_x, Name_x, Sex_x, Age_x, SibSp_x, Parch_x, Ticket_x, Fare_x, Cabin_x, Embarked_x, Survived, Pclass_y, Name_y, Sex_y, Age_y, SibSp_y, Parch_y, Ticket_y, Fare_y, Cabin_y, Embarked_y]
Index: []
[0 rows x 22 columns]There are no common rows in both datasets based on the 'PassengerId' column.
> Finished chain.
There are no common rows in both datasets based on the 'PassengerId' column
multi_agent.run('What is the combined mean ticket fare for each class?')
Output:
Invoking: `python_repl_ast` with `df = pd.concat([df1, df2])
combined_mean_fare = df.groupby('Pclass')['Fare'].mean()
combined_mean_fare`
 
The combined mean ticket fare for each class is as follows:
- Class 1: $87.51
- Class 2: $21.18
- Class 3: $13.30

The reason to use ChatOpenAI as the language model is based on its support for OpenAI functions and its suitability for chat-type tasks. Use the code to use OpenAi as the agent’s language model.

csv_agent = create_csv_agent(OpenAI(temperature=0),'train.csv',verbose=True,)

Why use agent_type for CSV agent?

Different agent types for the CSV agents determine how the agent interacts with the CSV file. The two main agent types for the CSV agent are ZERO_SHOT_REACT_DESCRIPTION and OPENAI_FUNCTIONS.

Using the ZERO_SHOT_REACT_DESCRIPTION agent type allows the agent to interact with the CSV file using natural language queries. The agent can leverage natural language descriptions to determine and utilize appropriate functions. OPENAI_FUNCTIONS agent type allows you to interact with the CSV file using Python code.

The main difference between using different agent types for the CSV agent is how you interact with the CSV file. ZERO_SHOT_REACT_DESCRIPTION agent type allows for natural language queries, while the OPENAI_FUNCTIONS agent type allows for executing Python code on the CSV data.

Conclusion

As mentioned above pandas_dataframe_agent is more versatile and suitable for advanced data analysis tasks, while the csv_agent is more specialized for working with CSV files. Now you know how to query structured data using CSV Agents of LangChain and Pandas.

You may even notice that the prompt templates of both agents are identical when you check them. Run “pd_agent.agent.llm_chain.prompt.template” to check it out. An agent can carry out limited tasks using tools at its disposal.

CSV agent is a helpful tool for the quick processing of data. By connecting these agents, we can achieve the ability to handle complex tasks on our data. A wide range of data-driven applications can be built by leveraging LangChain agents.

Langchain can be used to build tools around core LLMs with very few lines of code. Here are some captivating reads on Langchain that’ll get you started with the LangChain library:

LangChain: Introduction and Getting Started

The Next Frontier: Empowering Language Models with Custom Tools in LangChain

Understanding Conversational Memory in LangChain 🦜🔗: Enhancing Chatbot Interactions

Document Summarization with LangChain



0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *