Data Extractor
April 21, 2023 No Comments

Introduction

In the world of data analysis, extracting useful information from tabular data can be a difficult task. Conventional approaches typically require manual exploration and analysis of data, which can be requires a significant amount of effort, time, or workforce to complete. 

The emergence of advanced language models such as ChatGPT has introduced a promising and innovative approach to extracting useful information from tabular data. OpenAI announced an official API for ChatGPT which is powered by gpt-3.5-turbo, OpenAI’s most advanced language model.

This blog describes you the process of extracting useful information from tabular data using ChatGPT API.

Dataset

For all illustrations in this post, We will be utilizing the following data. These data are taken from the following: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams

Note that we have only taken into account 30 records from the dataset. 

Please see the data provided below, which will be used for the purpose of this blog.

gender race parentallevelofeducation lunch testpreparationcourse mathscore readingscore writingscore
female group B bachelor standard none 72 72 74
female group C some college standard completed 69 90 88
female group B master standard none 92 95 93
male group A associate free/reduced none 47 57 44
male group C some college standard none 76 78 75
female group B associate standard none 71 83 78
female group B some college standard completed 88 95 92
male group B some college free/reduced none 40 43 39
male group D high school free/reduced completed 64 64 67
female group B high school free/reduced none 38 60 50
male group C associate standard none 58 54 52
male group D associate standard none 40 52 43
female group B high school standard none 65 81 73
male group A some college standard completed 78 72 70
female group A master standard none 50 53 58
female group C some high school standard none 69 75 78
male group C high school standard none 88 89 86
female group B some high school free/reduced none 18 32 28
male group C master free/reduced completed 46 42 46
female group C associate free/reduced none 54 58 61
male group D high school standard none 66 69 63
female group B some college free/reduced completed 65 75 70
male group D some college standard none 44 54 53
female group C some high school standard none 69 73 73
male group D bachelor free/reduced completed 74 71 80
male group
A
master free/reduced none 73 74 72
male group B some college standard none 69 54 55
female group C bachelor standard none 67 69 75
male group C high school standard none 70 70 65

 

Data Extraction using ChatGPT API

ChatGPT relies solely on natural language processing (NLP) techniques to understand and extract information from tabular data. It can analyze the text-based input provided by the user, interpret the query, and generate a response based on the content of the tabular data.

Here’s an example of how you can utilize the ChatGPT API to extract information from tabular data:

Step 1: Prepare Input

We have stored our tabular data in a CSV file, you can read the CSV file using “Pandas” Python library and pass the data to the ChatGPT API for information extraction.

				
					import pandas as pd

read_csv=pd.read_csv("Student.csv")
				
			
Step 2: Use the ChatGPT API

Before we begin utilizing the ChatGPT API, please make sure that you have installed OpenAI Python library in your system.

				
					pip install openai
				
			

You can extract information by providing the tabular data and input text to the ChatGPT API. This can be done by reading the tabular data from a CSV file, preparing the input for the API, and passing it along with the input text. The API will then extract the relevant information from the data and provide it in the response.

				
					import openai
import pandas as pd

openai.api_key = '<YOUR OPENAI API KEY>'

read_csv = pd.read_csv("Student.csv")


input_text='''What is the average math score for male students?'''

prompt = """Please regard the following data:\n {}. Answer the following question and please return only value: {}""".format(read_csv, input_text)

request = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0301",
    messages=[
        {"role": "user", "content": prompt},
    ]
)
result = request['choices'][0]['message']['content']

print("ChatGPT Response=>",result) 
				
			

The following is the response that was received from the ChatGPT API:

Now let’s review the responses to a few questions based on the extracted information.

QuestionChatGPT ResponseAnalysis
What is the gender of the student who scored 72 in math?femaleCorrect
Which race/ethnicity group does the student belong to who scored the highest in reading?group CIncorrect
It should be “group B”
Did the student who scored 47 in math complete the test preparation course?NoCorrect



What is the average math score for male students?
65.4Incorrect
It should be “62.2”
What is the writing score for the student who belongs to group C race/ethnicity and has a master’s degree?The writing score for the specified student is not given in the data.Incorrect
It should be “46”

We observed that ChatGPT is incapable of performing aggregations, such as summing or averaging the table entries, and occasionally struggles to respond to basic questions.

You can add your table and test it right away using ChatGPT playground without using the API. Please visit the below link
https://chat.openai.com/

SQL-based Data Extraction from Database using ChatGPT API

We can now utilize ChatGPT again, but this time not for direct table analysis. Instead, we can use ChatGPT to generate SQL statements for a database that contains the data. SQL is capable of handling filtering, sorting, aggregation, and summation logic, which can help in performing the required calculations accurately.
We utilized SQLite as the database engine and employed the sqlite3 Python library to interact with it.

Step 1: Create SQLite database and table

Here’s a Python code that creates a database and a table in SQLite:

				
					import sqlite3

# Connect to SQLite database (this will create a new database file if it doesn't exist)
conn = sqlite3.connect("chatgpt.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
    CREATE TABLE IF NOT EXISTS student (
        gender TEXT,
        race TEXT,
        parentallevelofeducation TEXT,
        lunch TEXT,
        testpreparationcourse TEXT,
        mathscore INTEGER,
        readingscore INTEGER,
        writingscore INTEGER
    )
""")

# Commit the transaction and close the connection
conn.commit()
conn.close()
				
			
Step 2: Adding Data to a Database

Here’s a Python code that read data from CSV file using pandas and insert it into a SQLite database:

				
					import sqlite3
import pandas as pd

df=pd.read_csv("Student.csv")
# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')

# Insert DataFrame into SQLite database
df.to_sql('student', conn, if_exists='replace', index=False)

# Close database connection
conn.close()
				
			
Step 3: Use ChatGPT API

You can extract information by providing the database table name, its corresponding columns, and input text to the ChatGPT API.  ChatGPT API will generate an SQL query from the given input text and then use that query to retrieve data from the database.

Here’s an example of how you can extract information using the ChatGPT API by providing the database table name, its corresponding columns, and input text:

				
					import sqlite3
import openai

# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')
cursor = conn.cursor()


openai.api_key = '<YOUR OPENAI API KEY>'

# Function to get table columns from SQLite database
def get_table_columns(table_name):
    cursor.execute("PRAGMA table_info({})".format(table_name))
    columns = cursor.fetchall()
    print(columns)
    return [column[1] for column in columns]


# Function to generate SQL query from input text using ChatGPT
def generate_sql_query(table_name,text,columns):
    prompt = """You are a ChatGPT language model that can generate SQL queries. Please provide a natural language input text, and I will generate the corresponding SQL query for you.The table name is {} and corresponding columns are {}.\nInput: {}\nSQL Query:""".format(table_name,columns,text)
    print(prompt)
    request = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0301",
        messages=[
            {"role": "user", "content": prompt},
        ]
    )
    sql_query = request['choices'][0]['message']['content']
    return sql_query    

# Function to execute SQL query on SQLite database
def execute_sql_query(query):
    cursor.execute(query)
    result = cursor.fetchall()
    return result


text="What is the average math score for male students?"

table_name = 'student'
columns = get_table_columns(table_name)
sql_query=generate_sql_query(table_name,text,columns)
print("Generated SQL query: ",sql_query)
if sql_query:
    result=execute_sql_query(sql_query)
    print("ChatGPT Response=>",result)
    
# Close database connection
cursor.close()
conn.close()
				
			

After sending the input text to the ChatGPT API, you will receive a response containing the generated SQL query. You can then use this query to retrieve the desired data from the database.

The response of this approach looks as below:

QuestionSQL query generated by ChatGPTAnswerAnalysis
What is the gender of the student who scored 72 in math?SELECT gender FROM student WHERE mathscore=72;femaleCorrect
Which race/ethnicity group does the student belong to who scored the highest in reading?SELECT race AS “Race/Ethnicity”, MAX(readingscore) AS “Highest Reading Score”
FROM student
GROUP BY race
ORDER BY “Highest Reading Score” DESC
LIMIT 1;
group BCorrect
Did the student who scored 47 in math complete the test preparation course?SELECT testpreparationcourse
FROM student
WHERE mathscore = 47
NoneCorrect
What is the average math score for male students?SELECT AVG(mathscore) FROM student WHERE gender = ‘male’62.2Correct
What is the writing score for the student who belongs to group C race/ethnicity and has a masters degree?SELECT writingscore FROM student WHERE race = ‘group C’ AND parentallevelofeducation = ‘master’46Correct

ChatGPT without SQL relies solely on NLP techniques to understand and generate responses based on tabular data, which may result in incorrect responses. However, when ChatGPT is combined with SQL capabilities, it can perform more advanced and flexible interactions with tabular data by directly executing SQL queries. This allows for more accurate and precise retrieval of information from databases, enhancing the overall capabilities of ChatGPT in dealing with tabular data.

Ready to harness the power of ChatGPT for your chatbot or NLP application? Partner with our expert developers to bring your dream product to life. Connect with us at letstalk@pragnakalp.com or share your requirements here to get started on your journey to success!

Write a comment

Your email address will not be published. Required fields are marked *