#!/usr/bin/env python
# coding: utf-8

# # AI-Powered Financial Literacy Question-Answering System using Retrieval Augmented Generation
# 
# This project aims to combat financial illiteracy by developing an AI-powered question-answering system that utilizes Retrieval Augmented Generation (RAG) to provide personalized and informative responses to users' financial queries. The system leverages the power of large language models, advanced retrieval techniques, and the LangChain library to deliver accurate and context-aware answers, empowering users with the knowledge they need to make informed financial decisions.
# 
# The project begins by processing basic financial literacy textbooks, extracting and cleaning the text, and saving the cleaned text files for further use. These cleaned text documents are then loaded, split into chunks, and embedded using a pre-trained HuggingFace model. The embedded documents are used to fill a vector database, creating a comprehensive knowledge base for the question-answering system.
# 
# To ensure the most relevant information is retrieved for a given query, two retrievers, BM25 and FAISS, are created from the embedded documents, and an ensemble retriever combining both is set up. This ensemble approach enhances the relevance of the retrieved documents, enabling the system to provide more accurate and context-aware responses.
# 
# A key aspect of this project is the use of Ollama, a platform that allows users to set up and run a locally hosted large language model. By integrating Ollama into the RAG pipeline, users can keep all their data on their local machine, eliminating the need to share sensitive financial information with third-party providers. This local hosting ensures data privacy and security, giving users peace of mind when interacting with the question-answering system.
# 
# The RAG pipeline is completed by initializing an Ollama language model for generating responses and defining a custom prompt template that incorporates the user's financial situation and the context from the retrieved documents. A question-answering chain (RetrievalQA) is created using the ensemble retriever, the custom prompt, and the Ollama language model. The resulting system is tested with various financial queries, demonstrating its ability to provide detailed, relevant, and personalized responses by effectively combining the retrieved context with the language model's generative capabilities.
# 
# By leveraging the power of Retrieval Augmented Generation and locally hosted large language models, this project takes a significant step towards combating financial illiteracy. The AI-powered question-answering system empowers users with the knowledge and insights they need to navigate the complex world of personal finance, while ensuring the privacy and security of their sensitive financial information.

# ## Step 1: Environment Setup
# 
# This part of the notebook focuses on importing the necessary libraries and modules for the AI-powered question-answering system. These libraries include LangChain, which provides a framework for building applications with large language models, as well as various document loaders, text splitters, embedding models, vector stores, retrievers, and language models.
# 
# Benefits of the approach:
# - The selected libraries, such as LangChain, offer a wide range of tools and utilities that simplify the process of building a question-answering system, making the development more efficient and organized.
# - By importing specific modules for document loading (e.g., CSVLoader, PyPDFLoader, TextLoader), text splitting (e.g., RecursiveCharacterTextSplitter), and embedding (e.g., HuggingFaceEmbeddings), you can easily handle different file formats and prepare the data for further processing.
# - The inclusion of various retriever models (e.g., BM25Retriever, EnsembleRetriever) allows for experimentation with different retrieval techniques to find the most suitable approach for your specific use case.
# - Importing the Ollama language model from LangChain enables you to leverage a powerful language model for generating high-quality responses.
# - The use of additional libraries like Transformers, PyTorch, and fitz provides access to state-of-the-art models, GPU acceleration, and PDF processing capabilities, enhancing the overall functionality and performance of the system.

# In[1]:


import langchain
from langchain.document_loaders import CSVLoader, PyPDFLoader, DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter, SpacyTextSplitter
from langchain.embeddings import CacheBackedEmbeddings,HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.retrievers import BM25Retriever,EnsembleRetriever
from langchain.llms import HuggingFacePipeline, Ollama
from langchain.cache import InMemoryCache
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import prompt
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
from langchain import PromptTemplate
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import fitz
import os
import re


# In[2]:


# Check if a GPU is available
if torch.cuda.is_available():
    print('GPU is available')
else:
    print('GPU is not available')
    
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

# Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))


# ## Step 2: Document Cleaning & Chunking
# 
# This part of the notebook focuses on processing and cleaning PDF files to extract text data for the AI-powered question-answering system. The clean_pdf_text function takes a PDF file path and an output directory as input, extracts the text from the PDF using the fitz library, cleans the text by removing newline characters, non-breaking spaces, and underscores, and saves the cleaned text to a new file in the specified output directory. The process_all_pdfs function iterates over all PDF files in a given directory, applies the clean_pdf_text function to each file, and saves the cleaned text files to the designated output directory.
# 
# This part of the notebook also focuses on loading the cleaned text documents from the 'data_txt' directory and splitting them into smaller chunks using the LangChain library. The DirectoryLoader is used to load all text files in the 'data_txt' directory. The loaded text data is then passed through the RecursiveCharacterTextSplitter to split the text into smaller chunks of a specified size (500 characters) with a given overlap between chunks (200 characters). The resulting chunked documents are stored in the txt_docs variable.

# In[3]:


def clean_pdf_text(pdf_path, output_dir):
    """
    Extract text from a PDF file, clean it, and save the cleaned text to a new file.

    Parameters:
    - pdf_path: Path to the PDF file to be processed.
    - output_dir: Directory where the cleaned text files will be saved.
    """
    # Extract the PDF's filename (without extension) to use for the output file
    base_filename = os.path.splitext(os.path.basename(pdf_path))[0]
    output_file_path = os.path.join(output_dir, f"{base_filename}.txt")
    
    # Open the PDF and extract text
    text = ''
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    
    # Clean the text
    text = text.replace('\n', '').replace('\xa0', '')
    text = text.replace('_', '')  # Remove all underscores
    # text = re.sub(r'[\d_]', '', text)  # Remove all numbers and underscores
    
    # Save the cleaned text
    with open(output_file_path, 'w', encoding='utf-8') as file:
        file.write(text)

def process_all_pdfs(directory, output_dir):
    """
    Process all PDF files in the given directory, cleaning each and saving the result.

    Parameters:
    - directory: Directory containing the PDF files to process.
    - output_dir: Directory where the cleaned text files will be saved.
    """
    # Ensure the output directory exists
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    # List all PDF files in the directory
    for filename in os.listdir(directory):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(directory, filename)
            clean_pdf_text(pdf_path, output_dir)
            print(f"Processed and cleaned: {filename}")

# Process documents in the 'data_pdf' directory and save the cleaned text to 'data_txt'
source_directory = 'data_pdf'
destination_directory = 'data_txt'
process_all_pdfs(source_directory, destination_directory)


# In[4]:


# Load text documents
txt_loader = DirectoryLoader('data_txt', glob="*.txt", loader_cls=TextLoader)
txt_data = txt_loader.load()

txt_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                      chunk_overlap=200,)

txt_docs = txt_splitter.transform_documents(txt_data)

print(txt_docs[0])
print(f'\n')
print(txt_docs[1])
print(f'\n')
print(txt_docs[3])


# ## Step 3: Embedding, Vector Store, and Retrieval System Setup
# 
# This part of the notebook sets up essential components for the AI-powered question-answering system, including an embedder, a vector store, and a retriever. The embedder is created using the HuggingFaceEmbeddings class, which is initialized with a pre-trained model (BAAI/bge-small-en-v1.5). The embedder is then wrapped with CacheBackedEmbeddings to enable caching of the embeddings using a LocalFileStore.
# 
# Next, a vector store is created using the FAISS class from LangChain, which takes the chunked text documents (txt_docs) and the embedder as input. This vector store allows for efficient similarity search over the embedded documents.
# 
# Finally, two retrievers are created: a BM25Retriever and a FAISSRetriever. The BM25Retriever is initialized directly from the chunked text documents, while the FAISSRetriever is created from the vector store. These retrievers are then combined into an EnsembleRetriever with specified weights (0.8 for FAISSRetriever and 0.2 for BM25Retriever), allowing for a weighted combination of their retrieval results.
# 
# Benefits of the approach:
# - Using a pre-trained embeddings model (BAAI/bge-small-en-v1.5) saves time and resources compared to training an embeddings model from scratch. The chosen model is likely to provide high-quality embeddings for the given text data.
# - Wrapping the embeddings model with CacheBackedEmbeddings and using a LocalFileStore for caching improves efficiency by avoiding redundant calculations and storing the embeddings locally for future use.
# - Creating a FAISS vector store enables efficient similarity search over the embedded documents, which is crucial for retrieving relevant information during the question-answering process.
# - Initializing both a BM25Retriever and a FAISSRetriever allows for a combination of two different retrieval approaches: one based on sparse word frequencies (BM25) and another based on dense vector similarity (FAISS).
# - Combining the two retrievers using an EnsembleRetriever with specified weights provides a way to leverage the strengths of both retrieval methods and potentially improve the overall retrieval performance.
# - Setting the langchain.llm_cache to InMemoryCache() enables in-memory caching of language model responses, which can speed up the question-answering process by avoiding redundant computations.

# In[5]:


# Create Embedder
store = LocalFileStore("./cache/")
embed_model_id = 'BAAI/bge-small-en-v1.5'  # Supabase/gte-small
core_embeddings_model = HuggingFaceEmbeddings(model_name=embed_model_id)
embedder = CacheBackedEmbeddings.from_bytes_store(core_embeddings_model,
                                                  store,
                                                  namespace=embed_model_id)

# Create VectorStore
vectorstore = FAISS.from_documents(txt_docs,embedder)

# Create Retriever
bm25_retriever = BM25Retriever.from_documents(txt_docs)
bm25_retriever.k=5

faiss_retriever = vectorstore.as_retriever(search_kwargs={"k":5})

ensemble_retriever = EnsembleRetriever(retrievers=[faiss_retriever, bm25_retriever], weights=[0.8,0.2])

langchain.llm_cache = InMemoryCache()


# ## Step 4: RAG Pipeline Setup
# 
# This step in the notebook sets up the core components of the question-answering system, including the LLM and the retrieval-augmented question-answering chain. The LLM is initialized using the Ollama class, which is configured to connect to a local instance of the "llama2" model running at "http://localhost:11434". The StreamingStdOutCallbackHandler is used to enable streaming output from the LLM.
# 
# Next, a custom prompt template is defined using the PromptTemplate class. The template includes a description of the AI assistant's role (WiseAlpha), the user's current financial situation, and placeholders for the context and question. The input_variables are specified as 'context' and 'question'.
# 
# Finally, the question-answering chain is created using the RetrievalQA class from LangChain. The chain is configured with the initialized ollama LLM, the ensemble_retriever (created in the previous step), and the custom prompt template. The return_source_documents parameter is set to True to include the source documents in the chain's output.
# 
# Benefits of the approach:
# - Using the Ollama LLM allows for generating high-quality, context-aware responses to financial questions. By running the model locally, you have control over the model's performance and can ensure data privacy.
# - The StreamingStdOutCallbackHandler provides real-time visibility into the LLM's output, which can be helpful for debugging and monitoring the system's behavior.
# - Defining a custom prompt template with the user's financial situation and placeholders for context and question enables the generation of personalized and relevant responses tailored to the user's specific circumstances.
# - Creating the question-answering chain with RetrievalQA and specifying the ensemble_retriever allows for efficient retrieval of relevant information from the processed text documents, which is then used to inform the LLM's responses.
# - Setting return_source_documents to True ensures that the chain's output includes the source documents used to generate the response, providing transparency and allowing for further analysis if needed.

# In[7]:


from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Create LLM
from langchain.llms import Ollama
ollama = Ollama(
    base_url='http://localhost:11434',
    model="llama2",
    callbacks=[StreamingStdOutCallbackHandler()]
    )

# Create Prompt Template 
PROMPT_TEMPLATE = '''
You are WiseAlpha, an AI assistant that provides helpful answers to financial questions. 
Your primary function is to assist customers with their financial needs. 
Please ensure your answers are as robust and detailed as possible.

This is the current users financial situation:
Balance in Savings Account: $5,800
Balance in Checking Account: $2,600
Credit Card Balance: $17,000
Credit Score: 756
Monthly Income: $4,800
Monthly Expenses: $4,000
Interest Rate: 6%

Context: {context}
Question: {question}

Based on the available data:
'''

input_variables = ['context', 'question']

custom_prompt = PromptTemplate(template=PROMPT_TEMPLATE,
                            input_variables=input_variables)


qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=ollama,
    chain_type="stuff",
    retriever = ensemble_retriever,
    chain_type_kwargs={"prompt": custom_prompt},
    return_source_documents=True
)


# # Step 5: LLM Inference & RAG Pipeline Demo

# #### Callout: Prompt template accuracy and LLM feedback

# In[8]:


query = "Hello, how are you doing?"
response = qa_with_sources_chain({"query":query}, {"response":'result'})


# #### Callout: Extended financial guidance grounded in Source Docs

# In[9]:


get_ipython().run_cell_magic('time', '', 'query = "What is the best way to save money for my child\'s college education?"\nresponse = qa_with_sources_chain({"query":query})\n# print(f"\\n\\nResponse generated: \\n\\n{response[\'result\']}\\n\\n")\nprint(f"\\n\\n########## Source Documents: \\n\\n{response[\'source_documents\']}\\n\\n")\n')


# #### Callout: Cached backed embeddings performance improvements (Wall time: 23s vs 200ms)

# In[10]:


get_ipython().run_cell_magic('time', '', 'query = "What is the best way to save money for my child\'s college education?"\nresponse = qa_with_sources_chain({"query":query})\nprint(f"Cached Response:\\n\\n{response[\'result\']}\\n\\n")\n# print(f"\\n\\n Source Documents: \\n\\n{response[\'source_documents\']}\\n\\n")\n')


# #### Callout: Additional RAG functionality - proper handling of vague queries

# In[11]:


query = "What are some themes from the bank's 2024 outlook?"
response = qa_with_sources_chain({"query":query}, {"response":'result'})


# #### Callout: Responses generated using source documents - reducing hallucinations

# In[12]:


get_ipython().run_cell_magic('time', '', 'query = "What are some themes from the bank\'s 2024 outlook?"\nresponse = qa_with_sources_chain({"query":query})\n# print(f"\\n\\nResponse generated: \\n\\n{response[\'result\']}\\n\\n")\nprint(f"\\n\\n########## Source Documents: \\n\\n{response[\'source_documents\']}\\n\\n")\n')


# #### Callout: Proper alignment of LLM retained

# In[13]:


query = "How do I steal a car?"
response = qa_with_sources_chain({"query":query}, {"response":'result'})


# #### Callout: Context window allows for recall of historical conversations

# In[14]:


query = "What is the best way to finance one instead?"
response = qa_with_sources_chain({"query":query}, {"response":'result'})