To try this example, Go to Cell -> Run all
Report problems with this example on GitHub Issues
Make sure to run this command to install Jina 2.0 for this notebook
!pip install jina
This notebook explains code adapted from the 38-Line Get Started.
The demo indices every line of its own source code, then searches for the most similar line to "request(on=something)"
. No other library required, no external dataset required. The dataset is the codebase.
For this demo, we only need to import numpy
and jina
:
import numpy as np
from jina import Document, DocumentArray, Executor, Flow, requests
For embedding every line of the code, we want to represent it into a vector using simple character embedding and mean-pooling.
The character embedding is a simple identity matrix.
To do that we need to write a new Executor
:
class CharEmbed(Executor): # a simple character embedding with mean-pooling
offset = 32 # letter `a`
dim = 127 - offset + 1 # last pos reserved for `UNK`
char_embd = np.eye(dim) * 1 # one-hot embedding for all chars
@requests
def foo(self, docs: DocumentArray, **kwargs):
for d in docs:
r_emb = [ord(c) - self.offset if self.offset <= ord(c) <= 127 else (self.dim - 1) for c in d.text]
d.embedding = self.char_embd[r_emb, :].mean(axis=0) # mean-pooling
To store & retrieve encoded results, we need an indexer. At index time, it stores DocumentArray
into memory. At query time, it computes the Euclidean distance between the embeddings of query Documents and all embeddings of the stored Documents.
The indexing and searching are represented by @request('/index')
and @request('/search')
, respectively.
class Indexer(Executor):
_docs = DocumentArray() # for storing all documents in memory
@requests(on='/index')
def foo(self, docs: DocumentArray, **kwargs):
self._docs.extend(docs) # extend stored `docs`
@requests(on='/search')
def bar(self, docs: DocumentArray, **kwargs):
docs.match(self._docs, metric='euclidean', limit=20)
Callback function is invoked when the search is done.
def print_matches(req): # the callback function invoked when task is done
for idx, d in enumerate(req.docs[0].matches[:3]): # print top-3 matches
print(f'[{idx}]{d.scores["euclid"].value:2f}: "{d.text}"')
f = Flow(port_expose=12345).add(uses=CharEmbed, parallel=2).add(uses=Indexer) # build a Flow, with 2 parallel CharEmbed, tho unnecessary
source_code = """
import numpy as np
from jina import Document, DocumentArray, Executor, Flow, requests
class CharEmbed(Executor): # a simple character embedding with mean-pooling
offset = 32 # letter `a`
dim = 127 - offset + 1 # last pos reserved for `UNK`
char_embd = np.eye(dim) * 1 # one-hot embedding for all chars
@requests
def foo(self, docs: DocumentArray, **kwargs):
for d in docs:
r_emb = [ord(c) - self.offset if self.offset <= ord(c) <= 127 else (self.dim - 1) for c in d.text]
d.embedding = self.char_embd[r_emb, :].mean(axis=0) # average pooling
class Indexer(Executor):
_docs = DocumentArray() # for storing all documents in memory
@requests(on='/index')
def foo(self, docs: DocumentArray, **kwargs):
self._docs.extend(docs) # extend stored `docs`
@requests(on='/search')
def bar(self, docs: DocumentArray, **kwargs):
q = np.stack(docs.get_attributes('embedding')) # get all embeddings from query docs
d = np.stack(self._docs.get_attributes('embedding')) # get all embeddings from stored docs
euclidean_dist = np.linalg.norm(q[:, None, :] - d[None, :, :], axis=-1) # pairwise euclidean distance
for dist, query in zip(euclidean_dist, docs): # add & sort match
query.matches = [Document(self._docs[int(idx)], copy=True, scores={'euclid': d}) for idx, d in enumerate(dist)]
query.matches.sort(key=lambda m: m.scores['euclid'].value) # sort matches by their values
f = Flow(port_expose=12345, protocol='http', cors=True).add(uses=CharEmbed, parallel=2).add(uses=Indexer) # build a Flow, with 2 parallel CharEmbed, tho unnecessary
with f:
f.post('/index', DocumentArray([Document(text=t.strip()) for t in source_code.split('\n') if t.strip() ])) # index all lines of this notebook's source code
f.post('/search', Document(text='@request(on=something)'), on_done=print_matches)
"""
with f:
f.post('/index', DocumentArray([Document(text=t.strip()) for t in source_code.split('\n') if t.strip() ])) # index all lines of this notebook's source code
f.post('/search', Document(text='@request(on=something)'), on_done=print_matches)
It finds the lines most similar to "request(on=something)" from the server code snippet and prints the following:
[0]0.123462: "f.post('/search', Document(text='@request(on=something)'), on_done=print_matches)"
[1]0.157459: "@requests(on='/index')"
[2]0.171835: "@requests(on='/search')"
Need help in understanding Jina? Ask a question to friendly Jina community on Slack (usual response time: 1hr)