在这个Jupyter Notebook中,我们构建一个包含数据和嵌入投影到2维空间中的Kangas DataGrid。
要开始使用,我们需要使用pip安装kangas,并导入它。
%pip install kangas --quiet
import kangas as kg
我们使用原始数据和嵌入来创建一个Kangas数据表格。数据由一系列评论行组成,而嵌入由1536个浮点值组成。在这个示例中,我们直接从github获取数据,以防您不是在OpenAI的存储库中运行此笔记本。
我们使用Kangas将CSV文件读入数据表格中,以便进行进一步处理。
data = kg.read_csv("https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv")
Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...
1001it [00:00, 2412.90it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]
我们可以查看CSV文件的字段:
data.info()
DataGrid (in memory) Name : fine_food_reviews_with_embeddings_1k Rows : 1,000 Columns: 9 # Column Non-Null Count DataGrid Type --- -------------------- --------------- -------------------- 1 Column 1 1,000 INTEGER 2 ProductId 1,000 TEXT 3 UserId 1,000 TEXT 4 Score 1,000 INTEGER 5 Summary 1,000 TEXT 6 Text 1,000 TEXT 7 combined 1,000 TEXT 8 n_tokens 1,000 INTEGER 9 embedding 1,000 TEXT
并查看第一行和最后一行:
data
row-id | Column 1 | ProductId | UserId | Score | Summary | Text | combined | n_tokens | embedding |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | B003XPF9BO | A3R7JR3FMEBXQB | 5 | where does one | Wanted to save | Title: where do | 52 | [0.007018072064 |
2 | 297 | B003VXHGPK | A21VWSCGW7UUAR | 4 | Good, but not W | Honestly, I hav | Title: Good, bu | 178 | [-0.00314055196 |
3 | 296 | B008JKTTUA | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
4 | 295 | B000LKTTTW | A14MQ40CCU8B13 | 5 | Best tomato sou | I have a hard t | Title: Best tom | 111 | [-0.00139322795 |
5 | 294 | B001D09KAM | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
... | 996 | 623 | B0000CFXYA | A3GS4GWPIBV0NT | 1 | Strange inflamm | Truthfully wasn | Title: Strange | 110 | [0.000110913533 |
997 | 624 | B0001BH5YM | A1BZ3HMAKK0NC | 5 | My favorite and | You've just got | Title: My favor | 80 | [-0.02086931467 |
998 | 625 | B0009ET7TC | A2FSDQY5AI6TNX | 5 | My furbabies LO | Shake the conta | Title: My furba | 47 | [-0.00974910240 |
999 | 619 | B007PA32L2 | A15FF2P7RPKH6G | 5 | got this for th | all i have hear | Title: got this | 50 | [-0.00521062919 |
1000 | 999 | B001EQ5GEO | A3VYU0VO6DYV6I | 5 | I love Maui Cof | My first experi | Title: I love M | 118 | [-0.00605782261 |
[1000 rows x 9 columns] | |||||||||
* Use DataGrid.save() to save to disk | |||||||||
** Use DataGrid.show() to start user interface |
现在,我们创建一个新的DataGrid,将数字转换为嵌入式:
import ast # 将字符串形式的数字列表转换为数字列表
dg = kg.DataGrid(
name="openai_embeddings",
columns=data.get_columns(),
converters={"Score": str},
)
for row in data:
embedding = ast.literal_eval(row[8])
row[8] = kg.Embedding(
embedding,
name=str(row[3]),
text="%s - %.10s" % (row[3], row[4]),
projection="umap",
)
dg.append(row)
新的DataGrid现在具有一个带有适当数据类型的嵌入列。
dg.info()
DataGrid (in memory) Name : openai_embeddings Rows : 1,000 Columns: 9 # Column Non-Null Count DataGrid Type --- -------------------- --------------- -------------------- 1 Column 1 1,000 INTEGER 2 ProductId 1,000 TEXT 3 UserId 1,000 TEXT 4 Score 1,000 TEXT 5 Summary 1,000 TEXT 6 Text 1,000 TEXT 7 combined 1,000 TEXT 8 n_tokens 1,000 INTEGER 9 embedding 1,000 EMBEDDING-ASSET
我们只需保存数据表格,就完成了。
dg.save()
dg.show()
按“分数”分组,查看每个组的行。
dg.show(group="Score", sort="Score", rows=5, select="Score,embedding")