#!/usr/bin/env python # coding: utf-8 # In[1]: get_ipython().run_line_magic('load_ext', 'watermark') get_ipython().run_line_magic('watermark', "-a 'cs224' -u -d -v -p openai,redlines,markdown_it,mdformat") # * https://daringfireball.net/projects/markdown/syntax # * https://github.com/executablebooks/markdown-it-py # * https://github.com/executablebooks/mdformat # * https://github.com/houfu/redlines # In[2]: from IPython.display import display, HTML, Markdown from IPython.display import display_html def display_side_by_side(*args): html_str='' for df in args: if type(df) == np.ndarray: df = pd.DataFrame(df) html_str+=df.to_html() html_str = html_str.replace('table','table style="display:inline"') # print(html_str) display_html(html_str,raw=True) CSS = """ .output { flex-direction: row; } """ def display_graphs_side_by_side(*args): html_str='' for g in args: html_str += '' html_str += '

' html_str += g._repr_svg_() html_str += '

' display_html(html_str,raw=True) display(HTML("")) # In[3]: from redlines import Redlines # In[4]: import openai import os from dotenv import load_dotenv, find_dotenv _ = load_dotenv(find_dotenv()) # read local .env file openai.api_key = os.getenv('OPENAI_API_KEY') # In[5]: def get_completion(prompt, model="gpt-3.5-turbo", temperature=0): messages = [{"role": "user", "content": prompt}] response = openai.ChatCompletion.create( model=model, messages=messages, temperature=temperature, ) return response.choices[0].message["content"] # In[6]: blog_post = """ ## Rational Recently, I participated in the free [ChatGPT Prompt Engineering for Developers](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers) course by Andrew Ng and Isa Fulford. This triggered an older urge to use an AI to help me fine tune text that I writte. While simple grammar and spelling mistakes can be spoted by MS Word and similar I wanted the AI to also make my text more compelling and enhance its appeal. The AI should help me to ensure a smooth and engaging reading experience. Below you can read my first steps towards that goal. ## Get Started Before you start you will need a paid account on [platform.openai.com](https://platform.openai.com/), because only then you'll get access to the API version of ChatGPT. Some users already got beta access to `GPT-4`, but in general you will only have access to `gpt-3.5-turbo` with the API. For our purposes here that is good enough. And don't worry about the costs. This is only a couple of cents even for extended uses of the API. I never managed to get over $1 so far. Actually all my efforts to get the code for this blog post working only cost USD 0.02. Once you have acces to [platform.openai.com](https://platform.openai.com/) you will need to create an API key and put it in a `.env` file like so: ``` OPENAI_API_KEY=sk-... ``` ## The Code ### Jupyter Notebook / Python Init I am using a [Juypter](https://jupyter.org/) notebook and python to access the API. The below explains step by step what I was doing. You start by setting up the API and defining a helper function: ```python import openai import os from dotenv import load_dotenv, find_dotenv _ = load_dotenv(find_dotenv()) # read local .env file openai.api_key = os.getenv('OPENAI_API_KEY') def get_completion(prompt, model="gpt-3.5-turbo", temperature=0): messages = [{"role": "user", "content": prompt}] response = openai.ChatCompletion.create( model=model, messages=messages, temperature=temperature, ) return response.choices[0].message["content"] ``` After that I define a little markdown text block as a playground. **Replace the single backtick, single quote, single backtick sequence with a triple backtick sequence**: ```python mark_down = ''' # Header 1 Some text under header 1. ## Header 2 More text under header 2. `' `python import pigpio handle = pi.i2c_open(1, 0x58) def horter_byte_sequence(channel, voltage): voltage = int(voltage * 100.0) output_buffer = bytearray(3) high_byte = voltage >> 8 low_byte = voltage & 0xFF; output_buffer[0] = (channel & 0xFF) output_buffer[1] = low_byte output_buffer[2] = high_byte return output_buffer v = horter_byte_sequence(0, 5.0) pi.i2c_write_device(handle, v) `' ` ### Header 3 Even more text under header 3. ## Another Header 2 Text under another header 2. ''' ``` ### Split Markdown at Headings It helps to have a [syntax](https://daringfireball.net/projects/markdown/syntax) reference for markdown close. As you can read on [learn.microsoft.com](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/chatgpt?pivots=programming-language-chat-completions): > The token limit for gpt-35-turbo is 4096 tokens. These limits include the token count from both the message array sent and the model response. The > number of tokens[^numtokens] in the messages array combined with the value of the max_tokens parameter must stay under these limits or you'll receive an error. This means that you can't send your whole blog post in one go to the chatgpt API, but you have to process it in pieces. In addition the approach to process the blog post in pieces will also make it easier for you later to integrate the suggestions of the AI into your blog post. I did some searching and it was not as easy as I hoped for to find a python library that allowed me to easily split the input blog post at headings. Finally I ended up using [markdown-it-py](https://github.com/executablebooks/markdown-it-py). But `markdown_it` is meant to be used to translate markdown to HTML and out of the box does not work as markdown to markdown converter. After some digging I found at the bottom of its [using](https://markdown-it-py.readthedocs.io/en/latest/using.html) documentation page that you can use [mdformat](https://github.com/executablebooks/mdformat) in combination with `markdown_it`. In addition I remove any `fence` token that anyway does not belong to the standard text flow of the blog post. This results in the following helper function: ```python import markdown_it import mdformat.renderer def extract_md_sections(md_input_txt): md = markdown_it.MarkdownIt() options = {} env = {} md_renderer = mdformat.renderer.MDRenderer() tokens = md.parse(md_input_txt) md_input_txt = md_renderer.render(tokens, options, env) tokens = md.parse(md_input_txt) sections = [] current_section = [] for token in tokens: if token.type == 'heading_open': if current_section: sections.append(current_section) current_section = [token] elif token.type == 'fence': continue elif current_section is not None: current_section.append(token) if current_section: sections.append(current_section) sections = [md_renderer.render(section, options, env) for section in sections] return sections ``` If you wander about the double call to `md.parse()`: I am not sure if this is strictly necessary, but I noticed that some `fence` tokens might be missed otherwise. Now you can try the splitting function on our dummy mardkown text block: ```python sections = extract_md_sections(mark_down) for section in sections: print(section) print('---') ``` And should see the following result: ``` # Header 1 Some text under header 1. --- ## Header 2 More text under header 2. --- ### Header 3 Even more text under header 3. --- ## Another Header 2 Text under another header 2. --- ``` """ # In[7]: mark_down = ''' # Header 1 Some text under header 1. ## Header 2 More text under header 2. ```python import pigpio handle = pi.i2c_open(1, 0x58) def horter_byte_sequence(channel, voltage): voltage = int(voltage * 100.0) output_buffer = bytearray(3) high_byte = voltage >> 8 low_byte = voltage & 0xFF; output_buffer[0] = (channel & 0xFF) output_buffer[1] = low_byte output_buffer[2] = high_byte return output_buffer v = horter_byte_sequence(0, 5.0) pi.i2c_write_device(handle, v) ``` ### Header 3 Even more text under header 3. ## Another Header 2 Text under another header 2. ''' # In[8]: import markdown_it import mdformat.renderer # In[9]: def extract_md_sections(md_input_txt): md = markdown_it.MarkdownIt() options = {} env = {} md_renderer = mdformat.renderer.MDRenderer() tokens = md.parse(md_input_txt) md_input_txt = md_renderer.render(tokens, options, env) tokens = md.parse(md_input_txt) sections = [] current_section = [] for token in tokens: if token.type == 'heading_open': if current_section: sections.append(current_section) current_section = [token] elif token.type == 'fence': continue elif current_section is not None: current_section.append(token) if current_section: sections.append(current_section) sections = [md_renderer.render(section, options, env) for section in sections] return sections # In[10]: sections = extract_md_sections(mark_down) for section in sections: print(section) print('---') # In[11]: sections = extract_md_sections(blog_post) for section in sections: print(section) print('-----------------------------------------------------------------------------------------------------------') # In[12]: i = 0 print(sections[i]) # In[13]: # The result of this first step should not contain any markup and ignore embedded images no matter if the images are embeddedd via Markdown or HTML tags. prompt = f""" Below a text delimited by `'` is provided to you. The text is a snipet from a blog post written in a mix of Markdown and HTML markup. As a first step extract the pure text. In this first step keep the markup for ordered or unordered list but pay close attention to remove all other markup and especially ignore embedded images no matter if the images are embeddedd via Markdown or HTML tags. As a second step use the output of the first step and ensure that newlines are only used to separate sections and at the end of enumration items of an ordered or unordered list. Provide as your response the output of the second step. `'`{sections[i]}`'` """ response = get_completion(prompt) print(response) # In[14]: prompt = f"""Proofread and correct the following section of a blog post. Stay as close as possible to the original and only make modifications to correct grammar or spelling mistakes. Text: ```{response}```""" response2 = get_completion(prompt) # display(Markdown(response)) print(response2) # In[15]: diff = Redlines(response,response2) display(Markdown(diff.output_markdown)) # In[16]: # prompt = f""" # Revise the following blog post excerpt, maintaining its original length and style, while enhancing its appeal for a technical hobbyist audience. # Improve the reading experience to be smoother and more engaging by making a minimal set of modifications to the original. # Text: ```{response2}``` # """ # prompt = f""" # Refine the following blog post excerpt with minimal alterations, preserving its original length and style, while enhancing its appeal for a discerning reader. Make limited yet impactful changes for a smooth and engaging reading experience. Text: ```{response2}``` # """ prompt = f""" Below a text delimited by triple quotes is provided to you. The text is a snipet from a blog post. Walk through the blog post snipet paragraph by paragraph and make a few limited yet impactful changes for a smooth and engaging reading experience targeting a technical hobbyist audience. Text: ```{response2}``` """ response3 = get_completion(prompt) print(response3) # In[17]: # ------------------------------------------------------------------ # In[18]: i = 1 print(sections[i]) # In[19]: # The result of this first step should not contain any markup and ignore embedded images no matter if the images are embeddedd via Markdown or HTML tags. prompt = f""" Below a text delimited by `'` is provided to you. The text is a snipet from a blog post written in a mix of Markdown and HTML markup. As a first step extract the pure text. In this first step keep the markup for ordered or unordered list but pay close attention to remove all other markup and especially ignore embedded images no matter if the images are embeddedd via Markdown or HTML tags. As a second step use the output of the first step and ensure that newlines are only used to separate sections and at the end of enumration items of an ordered or unordered list. Provide as your response the output of the second step. `'`{sections[i]}`'` """ response = get_completion(prompt) print(response) # In[20]: prompt = f"""Proofread and correct the following section of a blog post. Stay as close as possible to the original and only make modifications to correct grammar or spelling mistakes. Text: ```{response}```""" response2 = get_completion(prompt) # display(Markdown(response)) print(response2) # In[21]: diff = Redlines(response,response2) display(Markdown(diff.output_markdown)) # In[22]: # prompt = f""" # Revise the following blog post excerpt, maintaining its original length and style, while enhancing its appeal for a technical hobbyist audience. # Improve the reading experience to be smoother and more engaging by making a minimal set of modifications to the original. # Text: ```{response2}``` # """ # prompt = f""" # Refine the following blog post excerpt with minimal alterations, preserving its original length and style, while enhancing its appeal for a discerning reader. Make limited yet impactful changes for a smooth and engaging reading experience. Text: ```{response2}``` # """ prompt = f""" Below a text delimited by triple quotes is provided to you. The text is a snipet from a blog post. Walk through the blog post snipet paragraph by paragraph and make a few limited yet impactful changes for a smooth and engaging reading experience targeting a technical hobbyist audience. Text: ```{response2}``` """ response3 = get_completion(prompt) print(response3) # In[23]: # ------------------------------------------------------------------ # In[24]: i = 4 print(sections[i]) # In[25]: # The result of this first step should not contain any markup and ignore embedded images no matter if the images are embeddedd via Markdown or HTML tags. prompt = f""" Below a text delimited by `'` is provided to you. The text is a snipet from a blog post written in a mix of Markdown and HTML markup. As a first step extract the pure text. In this first step keep the markup for ordered or unordered list but pay close attention to remove all other markup and especially ignore embedded images no matter if the images are embeddedd via Markdown or HTML tags. As a second step use the output of the first step and ensure that newlines are only used to separate sections and at the end of enumration items of an ordered or unordered list. Provide as your response the output of the second step. `'`{sections[i]}`'` """ response = get_completion(prompt) print(response) # In[26]: prompt = f"""Proofread and correct the following section of a blog post. Stay as close as possible to the original and only make modifications to correct grammar or spelling mistakes. Text: ```{response}```""" response2 = get_completion(prompt) # display(Markdown(response)) print(response2) # In[27]: diff = Redlines(response,response2) display(Markdown(diff.output_markdown)) # In[28]: # prompt = f""" # Revise the following blog post excerpt, maintaining its original length and style, while enhancing its appeal for a technical hobbyist audience. # Improve the reading experience to be smoother and more engaging by making a minimal set of modifications to the original. # Text: ```{response2}``` # """ # prompt = f""" # Refine the following blog post excerpt with minimal alterations, preserving its original length and style, while enhancing its appeal for a discerning reader. Make limited yet impactful changes for a smooth and engaging reading experience. Text: ```{response2}``` # """ prompt = f""" Below a text delimited by triple quotes is provided to you. The text is a snipet from a blog post. Walk through the blog post snipet paragraph by paragraph and make a few limited yet impactful changes for a smooth and engaging reading experience targeting a technical hobbyist audience. Text: ```{response2}``` """ response3 = get_completion(prompt) print(response3) # In[ ]: