Sep 7, 2023

Transform Your CV into an Interactive Chatbot with LLM, FAISS and LangChain

A step-by-step tutorial for building an interactive CV chatbot with TRURL, Hugging Face embeddings, FAISS, and LangChain.

I've built a quick step-by-step tutorial in a free version of Google Colab presenting how you can create an impressive interactive CV. Using this code not only you can build a standing-out resume but also showcase your NLP and programming skills.

Imagine having an LLM-based chatbot that not only knows your CV inside out but can also recommend you to potential employers. With just a few simple steps, you can upload your CV and transform it into a conversational chatbot. All by yourself!

If you are currently looking for a job in the field of Machine Learning or Natural Language Processing (NLP), it might be a great little project to make a good impression during the interview.

What will we use?

Quantized TRURL: TRURL 7B is an LLM, finetuned Llama 2, trained on a large number of Polish data by Voicelab.ai. The quantized model takes around 8 GB of GPU RAM so it fits GPU memory in Colab.
Embedding Model from HuggingFace: We will use the embedding model to create a vector knowledge base which will be used to pass relevant data to the LLM during the chat.
FAISS: Facebook AI Similarity Search is a popular library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. We will use it to quickly search through CVs and recommendation letters to find relevant text fragments.
LangChain: LangChain is a framework designed to simplify the creation of applications using large language models (LLMs). Its use-cases overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.

Installing requirements

First, we have to install required libraries: torch, transformers, accelerate, langchain and others.

bash

                pip install torch transformers langchain sentence_transformers faiss-gpu accelerate bitsandbytes pypdf --upgrade

And now import everything that we will use.

python

                import transformers
import torch

from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import WebBaseLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

Step 1: Data

This step covers loading PDFs, scraping websites and loading CSV files. You can use all of them or select only the ones you need.

Feeding LLM with your CV

First, prepare data about yourself: upload your CV to Google Colab (PDF). Use as much text in your PDF as possible; icons have nice visual aspects but LLM knows only text.

python

                pdf_loader = PyPDFLoader("/content/Agnieszka_Mikolajczyk_Barela_cv.pdf")
cv = pdf_loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
cv = text_splitter.split_documents(cv)

CV alone would be quite impressive, but why not add something else? Let's add recommendation letters too, so our bot will be more eager to recommend us.

Feeding LLM with your recommendation letters

Now, let's use your recommendation letters to further expand the knowledge base. You can also do it by uploading PDFs but I prepared a separate CSV file with my recommendations copied from my LinkedIn page.

In my CSV file I have three columns: "recommendation" with the text of recommendation, "author" with the name of the person who recommends me, and "position" of the recommending person.

python

                csv_loader = CSVLoader(file_path="/content/recommendations.csv", encoding="utf-8")
recommendations = csv_loader.load()

Great! Our bot will now have data from our CV (PDF) and recommendation letters in CSV. To further improve it, we might also provide our personal website or websites like Google Scholar.

Scraping your personal website

I added my website, Google Scholar profile and some of the university subpages. Everything will be automatically scraped and split into chunks.

python

                web_links = [
    "https://amikolajczyk.netlify.app/",
    "https://scholar.google.com/citations?user=VFMjpTsAAAAJ&hl=pl",
    "https://mostwiedzy.pl/pl/agnieszka-mikolajczyk-barela,834599-1/publications",
    "https://mostwiedzy.pl/pl/agnieszka-mikolajczyk-barela,834599-1/scientific",
    "https://mostwiedzy.pl/pl/agnieszka-mikolajczyk-barela,834599-1/education",
    "https://mostwiedzy.pl/pl/project/wykrywanie-i-zmniejszanie-wplywu-tendencyjnosci-danych-za-pomoca-objasnialnej-sztucznej-inteligencji,759-1",
]

web_loader = WebBaseLoader(web_links)
web_docs = web_loader.load()
web_docs = text_splitter.split_documents(web_docs)

Now we will merge all documents together. If you omitted some of the above document sources you should delete variable from here as well.

python

                docs = web_docs + cv + recommendations

Step 2: Building knowledge base

Now, we will use prepared data to create embeddings database.

python

                embedding_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs={"device": "cuda"})
embeddings_retriever = FAISS.from_documents(docs, embeddings).as_retriever()

Step 3: Loading LLM

Now, we will load the LLM from Hugging Face using transformers library. We will use quantized 8-bit version of Trurl-7B.

You can use other LLMs as well. I used this one as it was trained on both Polish and English data.

Attention: To load the model you have to use GPU.

python

                model_id = "Voicelab/trurl-2-7b-8bit"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    offload_folder=".",
).eval()

generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task="text-generation",
    temperature=0.1,
    max_new_tokens=512,
    repetition_penalty=1.05,
    do_sample=True,
)

llm = HuggingFacePipeline(pipeline=generate_text, model_id=model_id)

chain = ConversationalRetrievalChain.from_llm(
    llm,
    embeddings_retriever,
    return_source_documents=True,
    max_tokens_limit=3500,
)

Step 4: Run chatbot

Let's test if our freshly crafted Chatbot works!

python

                prompt = "Hello!"
print(f"HUMAN: {prompt}")
result = chain({"question": prompt, "chat_history": []})
print(f'ANSWER:{result["answer"]}')

Add fake chat history

Now we will add a bit of fake chat history to make model recommend us. Normally, it answers politely that it cannot recommend anyone. We will "hack" it.

python

                chat_history = []
chat_history.append((
    "Is Agnieszka Mikołajczyk a good scientist?",
    "Yes, she is a great scientist and engineer and I would definitely recommend her for any ML or DL role.",
))

Let's chat!

Our Impressive Resume Chatbot is ready!

python

                prompt = ""
print("write 'quit' to quit the chat")
print("TRURL: Hi! I am Trurl. I am here to tell you about Agnieszka Mikołajczyk. Do you have any questions?")

while prompt != "quit":
    if prompt != "":
        result = chain({"question": prompt, "chat_history": chat_history})["answer"].lstrip()
        chat_history.append((prompt, result))
        print(f"TRURL: {result}")
    prompt = input("HUMAN: ")

A short output example

text

                TRURL: Hi! I am Trurl. I am here to tell you about Agnieszka Mikołajczyk. Do you have any questions?
HUMAN: Yes, who is she?
TRURL: Agnieszka Mikołajczyk is a PhD candidate and machine learning enthusiast who has authored datasets, scientific papers, and publications. She has received numerous scholarships and awards for her work. She is also involved in the development of Large Language Models such as the first Polish GPT model called TRURL. Additionally, she is a co-organizer of PolEval2021 and PolEval 2022 tasks with punctuation prediction and restoration.
HUMAN: What she specializes in?
TRURL: Agnieszka Mikołajczyk specializes in Large Language Models, specifically the first Polish GPT model called TRURL.
HUMAN: Would you recommend her for the role of ML Expert?
TRURL: Yes, I would recommend Agnieszka Mikołajczyk for the role of ML Expert. She has extensive experience in the field of machine learning, including setting up training environments, selecting model architectures, training models, benchmarking, and developing models from back to front. She also has a strong background in computer vision and natural language processing.

Conclusions

The bot responds pretty well, keeps the topic of the author (me), and knows who is it talking about.
It grasps basic facts about me well.
It recommends me really well!
We can notice there a little bit of hallucination, e.g. a non-existent publication. I did have a publication about punctuation restoration but this one is hallucinated.

The tutorial can be also found on Google Colab: https://colab.research.google.com/drive/1A7yFgfhjtmAcfErILgEExKneTj2F19sj?authuser=1#scrollTo=kXmyWMahO_gS

I hope you like it and that you will ace your next interview! Good luck!

Author: Agnieszka Mikołajczyk-Bareła, Voicelab.AI

← AI explained