在Kangas中可视化嵌入¶

在这个Jupyter Notebook中，我们构建一个包含数据和嵌入投影到2维空间中的Kangas DataGrid。

什么是Kangas？¶

Kangas是一个面向数据科学家的开源、混合媒体、类似数据框的工具。它是由Comet开发的，Comet是一家旨在帮助减少将模型投入生产过程中的摩擦的公司。

1. 设置¶

要开始使用，我们需要使用pip安装kangas，并导入它。

In [1]:

%pip install kangas --quiet

In [2]:

import kangas as kg

2. 构建Kangas数据表格¶

我们使用原始数据和嵌入来创建一个Kangas数据表格。数据由一系列评论行组成，而嵌入由1536个浮点值组成。在这个示例中，我们直接从github获取数据，以防您不是在OpenAI的存储库中运行此笔记本。

我们使用Kangas将CSV文件读入数据表格中，以便进行进一步处理。

In [3]:

data = kg.read_csv("https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv")

Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...

1001it [00:00, 2412.90it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]

我们可以查看CSV文件的字段：

In [4]:

data.info()

DataGrid (in memory)
    Name   : fine_food_reviews_with_embeddings_1k
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 INTEGER             
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 TEXT

并查看第一行和最后一行：

In [5]:

data

Out[5]:

row-id	Column 1	ProductId	UserId	Score	Summary	Text	combined	n_tokens	embedding
1	0	B003XPF9BO	A3R7JR3FMEBXQB	5	where does one	Wanted to save	Title: where do	52	[0.007018072064
2	297	B003VXHGPK	A21VWSCGW7UUAR	4	Good, but not W	Honestly, I hav	Title: Good, bu	178	[-0.00314055196
3	296	B008JKTTUA	A34XBAIFT02B60	1	Should advertis	First, these sh	Title: Should a	78	[-0.01757248118
4	295	B000LKTTTW	A14MQ40CCU8B13	5	Best tomato sou	I have a hard t	Title: Best tom	111	[-0.00139322795
5	294	B001D09KAM	A34XBAIFT02B60	1	Should advertis	First, these sh	Title: Should a	78	[-0.01757248118
...
996	623	B0000CFXYA	A3GS4GWPIBV0NT	1	Strange inflamm	Truthfully wasn	Title: Strange	110	[0.000110913533
997	624	B0001BH5YM	A1BZ3HMAKK0NC	5	My favorite and	You've just got	Title: My favor	80	[-0.02086931467
998	625	B0009ET7TC	A2FSDQY5AI6TNX	5	My furbabies LO	Shake the conta	Title: My furba	47	[-0.00974910240
999	619	B007PA32L2	A15FF2P7RPKH6G	5	got this for th	all i have hear	Title: got this	50	[-0.00521062919
1000	999	B001EQ5GEO	A3VYU0VO6DYV6I	5	I love Maui Cof	My first experi	Title: I love M	118	[-0.00605782261
[1000 rows x 9 columns]

* Use DataGrid.save() to save to disk
** Use DataGrid.show() to start user interface

现在，我们创建一个新的DataGrid，将数字转换为嵌入式：

In [8]:

import ast # 将字符串形式的数字列表转换为数字列表

dg = kg.DataGrid(
    name="openai_embeddings",
    columns=data.get_columns(),
    converters={"Score": str},
)
for row in data:
    embedding = ast.literal_eval(row[8])
    row[8] = kg.Embedding(
        embedding, 
        name=str(row[3]), 
        text="%s - %.10s" % (row[3], row[4]),
        projection="umap",
    )
    dg.append(row)

新的DataGrid现在具有一个带有适当数据类型的嵌入列。

In [9]:

dg.info()

DataGrid (in memory)
    Name   : openai_embeddings
    Rows   : 1,000
    Columns: 9
#   Column                Non-Null Count DataGrid Type       
--- -------------------- --------------- --------------------
1   Column 1                       1,000 INTEGER             
2   ProductId                      1,000 TEXT                
3   UserId                         1,000 TEXT                
4   Score                          1,000 TEXT                
5   Summary                        1,000 TEXT                
6   Text                           1,000 TEXT                
7   combined                       1,000 TEXT                
8   n_tokens                       1,000 INTEGER             
9   embedding                      1,000 EMBEDDING-ASSET

我们只需保存数据表格，就完成了。

In [ ]:

dg.save()

3. 渲染2D投影¶

要在笔记本中直接渲染数据，只需展示它。请注意，每一行都包含一个嵌入投影。

向右滚动以查看每行的嵌入投影。

投影空间中点的颜色代表分数。

In [11]:

dg.show()

按“分数”分组，查看每个组的行。

In [22]:

dg.show(group="Score", sort="Score", rows=5, select="Score,embedding")

这个数据网格的示例托管在这里：https://kangas.comet.com/?datagrid=/data/openai_embeddings.datagrid