#!/usr/bin/env python # coding: utf-8 # # STAM Tutorial: Standoff Text Annotation for Pythonistas # # ## Introduction # # [STAM](https://github.com/annotation/stam) is a data model, and accompanied # tooling, for stand-off text annotation that allows researchers and developers # to model annotations on text. # # An *annotation* is any kind of remark, classification/tagging on any particular # portion(s) of a text, on the resource or annotation set as a whole, in which # case we can interpret annotations as *metadata*, or on another annotation # (*higher-order annotation*). # # Examples of annotation may be linguistic annotation, structure/layout # annotation, editorial annotation, technical annotation, or whatever comes to # mind. STAM does not define any vocabularies whatsoever. Instead, it provides a # framework upon which you can model your annotations using whatever # you see fit. # # The model is thoroughly explained [in its specification # document](https://github.com/annotation/stam/blob/master/README.md). We # summarize only the most important data structures here, these have direct # counterparts (classes) in the python library we will be teaching in this # tutorial: # # * `Annotation` - A instance of annotation. Associated with an annotation is a # `Selector` to select the target of the annotation, and one or more # `AnnotationData` instances that hold the *body* or *content* of the # annotation. This is explicitly decoupled from the annotation instance itself # as multiple annotations may hold the very same content. # * `Selector` - A selector identifies the target of an annotation and the part of the target that the annotation applies to. There are multiple types that are described [here](https://github.com/annotation/stam/blob/master/README.md#class-selector). The `TextSelector` is an important one that selects a target resource and a specific text selection within it by specifying an offset. # * `AnnotationData` - A key/value pair that acts as *body* or *content* for one or more annotations. The key is a reference to `DataKey`, the value is a `DataValue`. (The term *feature* is also seen for this in certain annotation paradigms) # * `DataKey` - A key as referenced by `AnnotationData`. # * `DataValue` - A value with some type information (e.g. string, integer, float). # * `TextResource` - A textual resource that is made available for annotation. This holds the actual textual content. # * `TextSelection` - A particular selection of text within a resource, i.e. a subslice of the text. # * `AnnotationDataSet` - An Annotation Data Set stores the keys (`DataKey`) and # values (`AnnotationData`) that are used by annotations. It effectively # defines a certain vocabulary, i.e. key/value pairs. How broad or narrow the # scope of the vocabulary is not defined by STAM but entirely up to the user. # * `AnnotationStore` - The annotation store is essentially your *workspace*, it holds all # resources, annotation sets (i.e. keys and annotation data) and of course the # actual annotations. In the Python implementation it is a memory-based store # and you can put as much as you like into it (as long as it fits in memory). # # STAM is more than just a theoretical model, we offer practical implementations # that allow you to work with it directly. In this tutorial we will be using Python and # the Python library `stam`. # # **Note**: The STAM Python library is a so-called Python binding to a STAM library # written in Rust. This means the library is not written in Python but is # compiled to machine code and as such offers much better performance. # # ## Installation # # First of all, you will need to install the STAM Python library from the [Python Package Index](https://pypi.org/project/stam/) as follows: # In[1]: get_ipython().system('pip install stam') # ## Annotating from scratch # # ### Adding a text # # Let us start with a mini corpus consisting of two quotes from the book *"Consider Phlebas"* by renowned sci-fi author Iain M. Banks. # In[2]: text = """ # Consider Phlebas $ author=Iain M. Banks ## 1 Everything about us, everything around us, everything we know [and can know of] is composed ultimately of patterns of nothing; that’s the bottom line, the final truth. So where we find we have any control over those patterns, why not make the most elegant ones, the most enjoyable and good ones, in our own terms? ## 2 Besides, it left the humans in the Culture free to take care of the things that really mattered in life, such as [sports, games, romance,] studying dead languages, barbarian societies and impossible problems, and climbing high mountains without the aid of a safety harness. """ # This format of the text for STAM is in no way prescribed other than: # # * It must be plain text # * It must be UTF-8 encoded # * It should ideally be in Unicode Normalization Form C. (don't worry if this means nothing to you yet) # # Before we can do anything we need to import the STAM library: # In[3]: import stam # Let's add this text resource to an annotation store so we can annotate it # In[4]: store = stam.AnnotationStore(id="tutorial") resource_banks = store.add_resource(id="banks", text=text) # Here we passed the text as a string, but it could just as well have been an # external text file instead, the filename of which can be passed via the `file=` keyword # argument. # # ### Creating an annotation dataset (vocabulary) # # Our example text is a bit Markdown-like, we have a title header *"Consider Phlebas"*, and # two subheaders (*1* and *2*) containing one quote from the book each. # # As our first annotations, let's try to annotate this coarse structure. At this # point we're already in need of some vocabulary to express the notions of *title # header*, *section header* and *quote*, as STAM does not define any vocabulary. # It is up to you to make these choices on how to represent the data. # # An annotation data set effectively defines an vocabulary. Let's invent our own # simple Annotation Data Set that defines the keys and values we use in this # tutorial. In our `AnnotationDataSet` We can define a `DataKey` with ID `structuretype`, and have it # takes values like `titleheader`, `sectionheader` and `quote`. # # We can explicitly add the set and the key. We give the dataset a public ID # (*tutorial-set*), just as we previously assigned a public ID to both the # annotationstore (*tutorial*) and the text resource (*banks*). It is good # practise to assign IDs, though you can also let the library auto-generate them # for you: # In[5]: dataset = store.add_dataset("tutorial-set") key_structuretype = dataset.add_key("structuretype") # ### The first annotations with text selectors # # To annotate the title header, we need to select the part of the text where it # occurs by finding the offset, which consists of a *begin* and *end* position. STAM # follows the same indexing format Python does, in which positions are 0-indexed # *unicode character points* (as opposed to (UTF-8) bytes) and where the end is # non-inclusive. After some clumsy manual counting on the source text we discover # the following coordinates hold: # In[6]: assert text[1:19] == "# Consider Phlebas" # And we make the annotation: # In[7]: annotation = store.annotate( target=stam.Selector.textselector(resource_banks, stam.Offset.simple(1,19)), data={"id": "Data1", "key": key_structuretype, "value": "titleheader", "set": dataset }, id="Annotation1") # A fair amount happened there. We selected a part of the text of # `resource_banks` by offset, and associated `AnnotationData` with the annotation # saying that the `structuretype` key has the value `titleheader`, both of which # we invented as part of our `AnnotationDataSet` with ID `tutorial-set`. Last, we # assigned an ID to both the `AnnotationData`, as well as to the `Annotation` as # a whole. In this example we reused some of the variables we had created # earlier, but we could have also written out in full as shown below: # # ``` # annotation = store.annotate( # target=stam.Selector.textselector(resource_banks, stam.Offset.simple(1,19)), # data={"id": "Data1", "key": "structuretype", "value": "titleheader", "set": "tutorial-set" }, # id="Annotation1") # ``` # # This would also have been perfectly fine, and moreover, it would also work fine # without us explicitly creating the `AnnotationDataSet` and the key as we did # before! Those would have been automatically created on-the-fly for us. The # only disadvantage is that under the hood more lookups are needed, so this is # slightly less performant than passing python variables. # # ### Inspecting data (1) # # We can inspect the annotation we just added: # In[8]: print("Annotation ID: ", annotation.id()) print("Target text: ", str(annotation)) print("Data: ") for data in annotation.data(): print(" - Data ID: ", data.id()) print(" Data Key: ", data.key().id()) print(" Data Value: ", str(data.value())) # In the above example, we obtained an `Annotation` instance from the return value of the `annotate()` method. Once any annotation is in the store, we can retrieve it simply by its public ID using the `annotation()` method. An exception will be raised if the ID does not exist. # In[9]: annotation = store.annotation("Annotation1") # A similar pattern holds for almost all other data structures in the STAM model: # In[10]: dataset = store.dataset("tutorial-set") #AnnotationDataSet resource_banks = store.resource("banks") #TextResource key_structuretype = dataset.key("structuretype") #DataKey data = dataset.annotationdata("Data1") #AnnotationData # There are also shortcut methods available to get keys and data directly from a # store, without needing to first retrieve a dataset yourself: # In[11]: key_structuretype = store.key("tutorial-set","structuretype") #DataKey data = store.annotationdata("tutorial-set","Data1") #AnnotationData # ### Annotating via `find_text()` # # We now continue by adding annotations for the two section headers. Counting offsets # manually is rather cumbersome, so we use the `find_text()` method on `TextResource` to find our target for annotation: # In[12]: results = resource_banks.find_text("## 1") section1 = results[0] print(f"Text {str(section1)} found at {section1.begin()}:{section1.end()}") annotation = store.annotate( target=stam.Selector.textselector(resource_banks, section1.offset()), data={"id": "Data2", "key": "structuretype", "value": "sectionheader", "set": "tutorial-set" }, id="Annotation2") # The `find_text()` method returns a list of `TextSelection` instances. These # carry an `Offset` which is returned by the `offset()` method. Hooray, no more # manual counting! # # We do the same for the last header: # In[13]: results = resource_banks.find_text("## 2") section2 = results[0] print(f"Text {str(section2)} found at {section2.begin()}:{section2.end()}") annotation = store.annotate( target=stam.Selector.textselector(resource_banks, section2.offset()), data={"id": "Data2", "key": "structuretype", "value": "sectionheader", "set": "tutorial-set" }, id="Annotation3") # ### Inspecting data (2) # # In the previous code the attentive reader may have noted that we are reusing the `Data2` ID # rather than introducing a new `Data3` ID, because the data for both # `Annotation2` and `Annotation3` is in fact, identical. # # This is an important feature of STAM; annotations and their data are # decoupled precisely because the data may be referenced by multiple annotations, and # if that's the case, we only want to keep the data in memory once. We don't want # a copy for every annotation. Say we have `AnnotationData` with key # `structuretype` and value `word`, and use that to tag all words in the # text, then it would be a huge amount of redundancy if there was no such # decoupling between data and annotations. The fact that they all share the same data, also # enables us to quickly look up all those annotations via a *reverse index* that is kept internally: # In[14]: for annotationdata in store.data(set="tutorial-set", key="structuretype", value="sectionheader"): for annotation in annotationdata.annotations(): assert annotation.id() in ("Annotation2","Annotation3") # This can also be done in one go, which is typically more performant: # In[15]: for annotation in store.data(set="tutorial-set", key="structuretype", value="sectionheader").annotations(): assert annotation.id() in ("Annotation2","Annotation3") # Here we used `data()` on the store as a whole, this method provides an easy way to retrieve data from scratch. # We could have also started from an annotation dataset or even a key within it if we already have an instance to it, in that case we use the `data()` method and pass the key (`DataKey`), which will act as a filter: # In[16]: key = dataset.key("structuretype") for annotation in dataset.data(key, value="sectionheader").annotations(): assert annotation.id() in ("Annotation2","Annotation3") # However, since we have the key already it is simpler and more performant to use # it directly and reduce the example to the following: # In[17]: key = dataset.key("structuretype") for annotation in key.data(value="sectionheader").annotations(): assert annotation.id() in ("Annotation2","Annotation3") # The ability to use any STAM object as a departing point for retrieval of other # objects is a characteristic of the API. The ability to pass arbitrary objects # as a filter is also a characteristic that you will find on multiple methods. # # The `data()` method can also be used to search for all values indiscriminately: # simply omit the `value` keyword parameter. Moreover, it can be used to search # for non-exact values, using the following keyword arguments: # # * `value_not` - Negates a values # * `value_greater` - Value must be greater than specified (int or float) # * `value_less` - Value must be less than specified (int or float) # * `value_greatereq` - Value must be greater than specified or equal (int or float) # * `value_lesseq` - Value must be less than specified or equal (int or float) # * `value_in` - Value must match any in the tuple (this is a logical OR statement) # * `value_not_in` - Value must not match any in the tuple # * `value_in_range` - Must be a numeric 2-tuple with min and max (inclusive) values # * `value_not_in_range` - Must be a numeric 2-tuple with min and max (inclusive) values # # The `data()` method takes filter parameter as positional arguments. You can # pass as many as you like. The object you pass as filter determines what is # being filtered, you can pass a `DataKey` instance, an `AnnotationData` instance, # or even an `Annotation`. You can also pass the result of earlier data or annotation # requests (`Data`, `Annotations`). If you want to filter against *one/any* of multiple # values, use a tuple or list of any homogeneous type. # # Searching for data and then retrieving the corresponding annotations is a very # common operation and easily accomplished by simply adding `.annotations()`, as # we've seen in the above examples. # # We can apply data filtering operations directly to `annotations()` using the # same keyword arguments we saw for `data()`. The following example provides # identical results as the earlier one, but the way of getting there is # slightly different (this takes all annotations first, and tests the data filter # on each, the other example takes the data first, and goes over all annotations # that make use of the data): # # ``` # key = dataset.key("structuretype") # for annotation in store.annotations(key, value="sectionheader"): # assert annotation.id() in ("Annotation2","Annotation3") # ``` # # If you're interested in the underlying text selections, then you can just add # `.textselections()`. This chaining of methods on collections is one of the # characteristics of the STAM API. # # ### Annotations via text selections # # Now we will annotate the quotes themselves. The first one starts after the first # subheader (Annotation2) and ends just before the next subheader (Annotation3). # That would include some ugly leading and trailing whitespace/newlines, though. # We use the `textselection()` method to obtain a textselection to our computed # offset and subsequently strip the whitespace using the `strip_text()` method, # effectively shrinking our textselection a bit: # In[18]: quote1_selection = resource_banks.textselection(stam.Offset.simple(section1.end(), section2.begin() - 1)).strip_text(" \t\r\n") quote1 = store.annotate( target=stam.Selector.textselector(resource_banks, quote1_selection.offset()), data={"id": "Data3", "key": "structuretype", "value": "quote", "set": "tutorial-set" }, id="AnnotationQuote1") # The second quote goes until the end of the text, which we can retrieve using # the `textlen()` method. This method is preferred over doing things in native # python like `len(str(banks))` because it is way more efficient: # In[19]: quote2_selection = resource_banks.textselection(stam.Offset.simple(section2.end(), resource_banks.textlen())).strip_text(" \t\r\n") quote2 = store.annotate( target=stam.Selector.textselector(resource_banks, quote2_selection.offset()), data={"id": "Data3", "set": "tutorial-set"}, id="AnnotationQuote2") # In this example we also show that, since we reference existing # `AnnotationData`, just specifying the ID and the set suffices. Or even shorter and better, you could pass # a variable that is an instance of `AnnotationData`. # # There is another structural type we could annotate: the lines with # corresponding line numbers. This is easy to do by splitting the text on # newlines, for which we use the method `split_text()` on `TextResource`. As you # see, various Python methods such as `split()`, `strip()`, `find()` have # counterparts in STAM that have a `*_text()` suffix and which return # `TextSelection` instances and carry offset information: # # In[20]: for linenr, line in enumerate(resource_banks.split_text("\n")): linenr += 1 #make it 1-indexed as is customary for line numbers print(f"Line {linenr}: {str(line)}") store.annotate( target=stam.Selector.textselector(resource_banks, line.offset()), data=[ {"id": "Data4", "key": "structuretype", "value": "line", "set": "tutorial-set" }, {"id": f"DataLine{linenr}", "key": "linenr", "value": linenr, "set": "tutorial-set" } ], id=f"AnnotationLine{linenr}") # In this example we also extended our vocabulary on-the-fly with a new field `linenr`. All line annotations carry two `AnnotationData` elements. Remember we can easily retrieve the data and any annotations on it with `data()` and `annotations()`: # In[21]: line8 = dataset.data(set="tutorial-set",key="linenr", value=8).annotations(limit=1)[0] print(str(line8)) # Methods that return collections such as `data()`,`annotations()`, `textselections()` often take an optional `limit` parameter (sometimes as a keyword argument, sometimes as a normal parameter). This parameter limits the amount of results returned. Using it can improve performance in certain cases. In the above example we know we're only going to use one result, so it is a good idea to set (here we happen to also know that there is only one result for `linenr` 8, so strictly speaking the parameter wouldn't be necessary, but we ignore that for sake of teaching the use of `limit`). # # When annotating, we don't have to work with the resource as a whole but can # also start relative from any text selection we have. Let's take line eight and # annotate the first word of it (*"everything"*) manually: # In[22]: line8_textselection = line8.textselections(limit=1)[0] #there could be multiple, but in our cases thus-far we only have one firstword = line8_textselection.textselection(stam.Offset.simple(0,10)) #we make a textselection on a textselection #internally, the text selection will always use absolute coordinates for the resource: print(f"Text selection spans: {firstword.begin()}:{firstword.end()}") annotation = store.annotate( target=stam.Selector.textselector(resource_banks, firstword.offset()), data= {"key": "structuretype", "value": "word", "set": "tutorial-set" }, id=f"AnnotationLine8Word1") # ### Converting offsets # # We know the first word of line eight is also part of quote one, for which we already made an annotation (`AnnotationQuote1`) before. # Say we are interested in knowing *where* in quote one the first word of line eight is, we can now easily compute so as follows: # In[23]: offset = firstword.relative_offset(quote1_selection) print(f"Offset in quote one: {offset.begin()}:{offset.end()}") # While we are at it, another conversion option that may come handy when working # on a lower-level is the conversion from/to UTF-8 byte offsets. Both STAM and # Python use unicode character points. Internally STAM already maps these to # UTF-8 byte offsets for things like text slicing, but if you need this # information you can extract it explicitly: # In[24]: beginbyte = resource_banks.utf8byte(firstword.begin()) endbyte = resource_banks.utf8byte(firstword.end()) print(f"Byte offset: {beginbyte}:{endbyte}") #and back again: beginpos = resource_banks.utf8byte_to_charpos(beginbyte) endpos = resource_banks.utf8byte_to_charpos(endbyte) assert beginpos == firstword.begin() assert endpos == firstword.end() # In this case they happen to be equal because we're basically only using ASCII # in our text, but as soon as you deal with multibyte characters (diacritics, # other scripts, etc), they will not! # # ### Tokenisation via regular expressions # # What else can we annotate? We can mark all individual words or tokens, # effectively performing simple *tokenisation*. For this, we will use the regular # expression search that is built into the STAM library, `find_text_regex()`. The # regular expressions follow [Rust's regular expression # syntax](https://docs.rs/regex/latest/regex/#syntax) which may differ slightly # from Python's native implementation. # In[25]: expressions = [ r"\w+(?:[-_]\w+)*", #this detects words,possibly with hyphens or underscores as part of it r"[\.\?,/]+", #this detects a variety of punctuation r"[0-9]+(?:[,\.][0-9]+)*", #this detects numbers, possibly with a fractional part ] structuretypes = ["word", "punctuation", "number"] for i, matchresult in enumerate(resource_banks.find_text_regex(expressions)): #(we only have one textselection per match, but an regular expression may result in multiple textselections if capture groups are used) textselection = matchresult['textselections'][0] structuretype = structuretypes[matchresult['expression_index']] print(f"Annotating \"{textselection}\" at {textselection.offset()} as {structuretype}") store.annotate( target=stam.Selector.textselector(resource_banks, textselection.offset()), data=[ {"key": "structuretype", "value": structuretype, "set": "tutorial-set" } ], id=f"AnnotationToken{i+1}") # In this code, each `matchresult` tracks which of the three expressions was # matches, in `matchresult['expression_index']`. We conveniently use that # information to tie new values for `structuretype`, all of which will be added # to our vocabulary (`AnnotationDataSet`) on-the-fly. # ### Annotating Metadata # # Thus-far we have only seen annotations directly on the text, using # `Selector.textselector()`, but STAM has various other selectors. Users may # appreciate if you add a bit of metadata about your texts. In STAM, these are # annotations that point at the resource as a whole using a # `Selector.resourceselector()`, rather than at the text specifically. We add one # metadata annotation with various new fields: # In[26]: annotation = store.annotate( target=stam.Selector.resourceselector(resource_banks), data=[ {"key": "name", "value": "Culture quotes from Iain Banks", "set": "tutorial-set" }, {"key": "compiler", "value": "Dirk Roorda", "set": "tutorial-set" }, {"key": "source", "value": "https://www.goodreads.com/work/quotes/14366-consider-phlebas", "set": "tutorial-set" }, {"key": "version", "value": "0.2", "set": "tutorial-set" }, ], id="Metadata1") # Similarly, we could annotate an `AnnotationDataSet` (our vocabulary) with metadata, using a `Selector.datasetselector()`. # ## Navigating through your data # # ### Basic iterating and counting # # If you followed all of the previous section, we now have a fair amount of annotations. In fact, we have: # In[27]: print(f"{store.annotations_len()} annotations") print(f"{store.resources_len()} resource") print(f"{store.datasets_len()} annotation dataset") print(f"{dataset.keys_len()} datakeys in our dataset") print(f"{dataset.data_len()} annotationdata instances in our dataset") # If we zoom in on the annotation data in our annotation dataset, we can extract some interesting frequency statistics right away: # In[28]: for data in dataset: count = data.annotations_len() print(f"{data.key()}: {data.value()} occurs in {count} annotation(s)") # We can also aggregate only by key, although that is slightly less informative for our example case: # In[29]: for key in dataset.keys(): count = key.annotations_count() #this one is called _count instead of _len because it is not instantaneous like the other one print(f"{key} occurs in {count} annotation(s)") # Just like we iterated over the annotation dataset above, we can also iterate over various things in the `AnnotationStore`. Let's write a small script that simply prints out most of the things in our store. At this point though, the output will get a bit verbose: # In[30]: print("Datasets:") for dataset in store.datasets(): print(f" - ID: {dataset.id()}") print("Resources:") for resource in store.resources(): print(f" - ID: {resource.id()}") print(f" - Text length: {resource.textlen()}") print("Annotations:") for annotation in store.annotations(): print(f" - ID: {annotation.id()}") print(f" Target selector type: {annotation.selector_kind()}") print(f" Target resources: {annotation.resources()}") print(f" Target offset: {annotation.offset()}") print(f" Target text: {annotation.text()}") print(f" Target annotations: ", [ a.id() for a in annotation.annotations_in_targets() ]) print(f" Data:") for data in annotation: print(f" - ID: {data.id()}") print(f" Set: {data.dataset().id()}") print(f" Key: {data.key()}") print(f" Value: {data.value()}") # ### Finding data # # We already introduced the methods `annotations()`, `data()` and `textselections()` in a previous sections. # They return collections, classes like `Annotations`, `Data` or `TextSelections`, which in turn # contain instances of `Annotation`, `AnnotationData`, and `TextSelection`, # respectively. # # Internally the STAM library maintains various forward and reverse indices, # representing relationships between all kinds of entities in the STAM model. The # aforementioned methods operate via these indices. # # The `annotations()` method is often a lookup via the reverse index. We have # already seen some example of it. Another nice example of the reverse index is # that it allows us to obtain annotations for any arbitrary selection of the text # we make: # In[31]: textselection = resource_banks.textselection(stam.Offset.simple(155,163)) for annotation in textselection.annotations(): print(f" - ID: {annotation.id()}") print(f" Text: {str(annotation)}") print(f" Data:") for data in annotation: print(f" {data.key()}={data.value()}") # Of course, I cheated a bit here and knew in advance there was going to be a # match for this offset, but the point to take home is that given any # *textselection*, you can easily get annotations that reference it. # # In the above example we iterate over all annotations and then over all the data # pertaining to the found annotations. Often though, you are searching for # specific data and would have some kind of extra test in there. This is # accomplished by passing filters via positional arguments or keyword arguments # like `value`, to the `annotations()` method. We have seen an example of this # before, here is another: # In[32]: textselection = resource_banks.textselection(stam.Offset.simple(155,163)) dataset = store.dataset("tutorial-set") key = dataset.key("structuretype") for annotation in textselection.annotations(key, value="word"): print(f" - ID: {annotation.id()}") print(f" Text: {str(annotation)}") # The use of filters in methods like `annotations()` and `data()` is always # preferable to manually writing it out in lower-level code, because the internal # library is more performant and passing data back and forth to Python always # comes with a performance penalty. # # In the example above, however, we see that we filter on data, but do not actually get the data that was matched as a return value. If you do want that, you need a two-step process as follows: # In[33]: textselection = resource_banks.textselection(stam.Offset.simple(155,163)) dataset = store.dataset("tutorial-set") key = dataset.key("structuretype") for annotation in textselection.annotations(key, value="word"): print(f" - ID: {annotation.id()}") print(f" Text: {str(annotation)}") annotationdata = annotation.data(key, value="word",limit=1)[0] print(f" Data: {str(annotationdata)}") # Sometimes you don't really care to retrieve the data or the annotations, but # merely want to test whether certain data is present on an annotation and return # a boolean. For this use can use methods like `test_annotations()` and `test_data()`, which take the same # keyword parameters for filtering as their counterparts `annotations()` and `data()`, but instead of returning a collection, it simply returns a boolean, which is more performant. # # This following example confirms to us that the textselection is indeed a word: # In[34]: textselection = resource_banks.textselection(stam.Offset.simple(155,163)) dataset = store.dataset("tutorial-set") key = dataset.key("structuretype") assert textselection.test_data(key, value="word") # It is possible to retrieve all *known* text selections for a given # resource. A text selection is 'known' if there is at least one annotation that # references it: # In[35]: for textselection in resource_banks.textselections(): print(textselection) # It's easy to see how you can combine some of the examples to retrieve all # annotations in a reverse way (i.e. via the text). # # You can consider a STAM model as a graph in which the annotations, resource, # data make up the nodes. The forward indices and reverse indices encode how # these nodes are related and form the edges of the graph. These edges can be # traversed in almost any direction using the various methods at your disposal in # this STAM library. Methods like `data()`,`annotations()`, `textselections()` # and their filtering abilities, as well as their test counterparts, essential # tools to accomplish this. # # ### Text Relations # # Now we get to the fun part. When you select any two parts of a text, i.e. create two text selections, then between these # text selections there can be a number of *relationships* that hold true or not: # # * The text selections may overlap # * The text selections may be embedded entirely in one another (one overlaps fully with the other) # * The text selections may come before or after another with any amount of distance in between # * The text selections may succeed or precede another, one's end is the other's begin of vice versa. # * The text selections may have the very same begin and/or end offset # # In STAM, the `TextSelectionOperator` captures these relationships. # # Remember our example in which we annotated the first word of line eight? The # textselection for this word *is embedded* within the textselection for line # eight as a whole. We can test that as follows using the `test()` method on # `TextSelection`: # In[36]: assert firstword.test(stam.TextSelectionOperator.embedded(), line8_textselection) # the reverse then also holds: assert line8_textselection.test(stam.TextSelectionOperator.embeds(), firstword) # an embedding is essentially a stricter form of an overlap relation, so this holds too: assert firstword.test(stam.TextSelectionOperator.overlaps(), line8_textselection) assert line8_textselection.test(stam.TextSelectionOperator.overlaps(), firstword) # Not only can we test any given text selections, we can use this functionality # to actively *find* text selections that are in a particular relationship with # another, in other words we find *related text selections*. This is a # core feature of the STAM library and a primary method of finding text # selections and their annotations. We use the `related_text()` method for this. # # Let's find all text selections (which we previously annotated) in line eight: # In[37]: for textselection in line8_textselection.related_text(stam.TextSelectionOperator.embeds()): print(f"{textselection} @{textselection.offset()}") # Often, what we are interested in is not the text selections as such, but the annotations that reference these text selections. # Simply add `.annotations()`: # In[38]: for annotation in line8_textselection.related_text(stam.TextSelectionOperator.embeds()).annotations(): print(f" - ID: {annotation.id()}") print(f" Text: {str(annotation)}") print(f" Data:") for data in annotation: print(f" {data.key()}={data.value()}") # The `related_text()` method is available on `TextSelection` (and `TextSelections`) and `Annotation` (and `Annotations`) in # which case the latter is again a shortcut so you don't have to retrieve the # text selections yourself first. As said before: do use all the shortcuts the # library offers, because the more the library can do for you, the more # performant things are, as it's compiled to machine code and not written in # Python itself. # # In the last output, you may note that we got two annotations for the first word # of line eight, that's because we did one manually, and the other one via our # regular-expression based tokeniser. # # In the previous example all we got was data with key `structuretype` and value `word`. We could have specifically selected for this by adding some filters to `annotations()`: # In[42]: key = store.dataset("tutorial-set").key("structuretype") for annotation in line8_textselection.related_text(stam.TextSelectionOperator.embeds(), debug=True).annotations(key, value="word",debug=True): print(f" - ID: {annotation.id()}") print(f" Text: {str(annotation)}") print(f" Data:") for data in annotation: print(f" {data.key()}={data.value()}") # ### Querying with STAMQL # # Instead of querying data using the various python objects and methods we have # seen thus-far, it is also possible to formulate a query in a query language # called STAMQL. The query language is `described in detail here # _`. We # will only cover some of the basics here and show how to call it from Python. # # A query starts with a `SELECT` statement, then a return type specifying what # kind of data you want the query to return (`ANNOTATION`, `DATA`, `TEXT`, # `KEY`,`DATASET`). Then you must specify a variable name to bind the results we obtain to (variables always start with a `?` in STMAQL) and `WHERE` statement introducing a series of one or more constraints, each ends with a semicolon. # # Let's illustrate all this with an example, we obtain obtain line 8 from our # data, which we had explicitly annotated earlier: # In[40]: query = """ SELECT ANNOTATION ?a WHERE DATA "tutorial-set" "linenr" = 8; """ for result in store.query(query): annotation = result['a'] assert isinstance(annotation, stam.Annotation) print("ID: ", annotation.id()) print("Text: ", str(annotation)) # Here we formulated a query in STAMQL and passed it to the `query()` method as a # string, and this gives us the results back in a list of dictionaries. The keys # in the dictionary correspond to the variable binds we chose in the `SELECT` # statement (without the `?` prefix). In this case we obtain one result # containing one variable `a`. # # Instead of querying for the annotation, we could have queried directly for the # text as well, we could also add extra constraints that must all be satisfied: # In[41]: query = """ SELECT TEXT ?t WHERE DATA "tutorial-set" "linenr" = 8; DATA "tutorial-set" "structuretype" = "line"; """ for result in store.query(query): print(result['t']) # Querying for text rather than annotations has a subtle difference when you add multiple `DATA` constraints like we did above. If we query for text, then it selects text which has annotations with the specified data. The data does not necessarily have to pertain to the same annotation (as long as it covers the same text). If you query for annotations and have multiple `DATA` constraints, then a *single annotation* must have both data items. # # The query language supports *query composition* to chain multiple queries/subqueries together. A subquery is introduced using curled braces. Take a look at the following example where we again select line 8, and then all words in line 8 (here we use a textual overlap relation): # # In[42]: query = """ SELECT ANNOTATION ?line WHERE DATA "tutorial-set" "linenr" = 8; { SELECT ANNOTATION ?word WHERE RELATION ?line EMBEDS; DATA "tutorial-set" "structuretype" = "word"; } """ for result in store.query(query): #the ?line annotation will be returned for each assert 'line' in result annotation = result['word'] assert isinstance(annotation, stam.Annotation) print("ID: ", annotation.id()) print("Text: ", str(annotation)) # The constraint ``RELATION ?line EMBEDS;`` in the subquery is essential here, it # can be read as "?line embeds ?word" and ensures that there is a specific # textual relation between the two select statements. . It is even a requirement # in a subquery to have a constraint that refers back to the parent query. Each # subquery can itself have a subquery to you can build long chains. # # Aside from `EMBEDS`, there other relations you can use such as `OVERLAPS`, # `PRECEDES`, `SUCCEEDS`, `BEFORE`, `AFTER`, `SAMEBEGIN`, `SAMEEND`, `EQUALS`. # These are the STAMQL keywords representing the `TextSelectionOperator` you have # already seen before. # # You have the choice whether to express your queries through STAMQL or using # Python objects and methods. Internally, the stam library will convert the # latter to the former whenever you apply any filtering, so there is not too much # difference performance-wise. There is some performance overhead though in the # conversion of results when you call `query()` explicitly with a STAMQL query. # # When calling `query()`, you may inject context variables yourself via keyword # arguments. These will subsequently be available to be used as constraints in # your query. As an example, we repeat the previous query but inject the line # variable manually, we already had an instance to it laying around anyway: # # In[43]: query = """ SELECT ANNOTATION ?word WHERE RELATION ?line EMBEDS; DATA "tutorial-set" "structuretype" = "word"; """ for result in store.query(query, line=line8_textselection): annotation = result['word'] assert isinstance(annotation, stam.Annotation) print("ID: ", annotation.id()) print("Text: ", str(annotation)) # ## Advanced annotation # # ### Higher-order Annotation # # All annotations we have done so far reference the text as a whole with absolute # offsets via a *TextSelector*, even though we formulated some of these offsets # (first word of line eight) in relative terms. # # STAM also allows you to adopt another annotation paradigm in which you point an # annotation not at a text via *TextSelector*, but at another annotation via an # *AnnotationSelector*, and that other annotation, or the final one of however # many there are in between, points at the text with a *TextSelector*. You can # specify an offset, which will then be interpreted relative to the # [text selection of] the targeted annotation: # In[44]: line8 = store.annotation("AnnotationLine8") annotation = store.annotate( target=stam.Selector.annotationselector(line8, stam.Offset.simple(0,10)), data= {"key": "structuretype", "value": "word", "set": "tutorial-set" }, id=f"AnnotationLine8Word1_explicit") # Here we are effectively annotating an annotation, so we call this a form of # *higher-order annotation*. We explicitly capture and model a relationship. # Whether to do this explicitly or use the STAM library's functionality to # resolve it implicitly is entirely up to you, the modeller, and your use-case! # # We can also do higher-order annotation to associate metadata with annotations, # such as encoding the person who did the annotation. In such cases, we can choose # not to reference the text at all, because the annotation no longer says something # about the text. # In[45]: line8 = store.annotation("AnnotationLine8") annotation = store.annotate( target=stam.Selector.annotationselector(line8), data= [ {"key": "annotator", "value": "Maarten van Gompel", "set": "tutorial-set" }, {"key": "datetime", "value": "2023-04-18T17:48:56", "set": "tutorial-set" }, ], id=f"AnnotationAnnotator") # Note that we invented some more keys that were added on-the-fly to our annotation dataset (i.e. the vocabulary). # # This too, needn't be a higher-order annotation, you can chose to associate the # `AnnotationData` directly with the annotation. The idea about an annotation # though, is that once it is made, it is immutable; no adding/editing of # annotation data or targets at later points in time. Information such as # annotators and date/time information could well be associated with the # annotation upon creation, but sometimes there may be data which you want to # associate with an annotation at a later point in time. That would be a use case # for higher-order annotation. # # ### Complex selectors # # Rather than point at a single target, sometimes you want to annotate something # that can not be captured by a single simple selector. Take for example, again, line eight from our text: # # *everything we know [and can know of] is composed ultimately of patterns of nothing* # # Say we want to annotate the parts of the sentence without the portion in square # brackets, then a single text selection could not capture it because it is # discontinuous. Two text selections, however, do the job. To combine the two text selectors (or any other type of simple selector) # STAM has the *CompositeSelector*: # In[46]: part1 = line8_textselection.textselection(stam.Offset.simple(0,18)) part2 = line8_textselection.textselection(stam.Offset.simple(37,82)) line8mainsentence = store.annotate( target=stam.Selector.compositeselector( stam.Selector.textselector(resource_banks, part1.offset()), stam.Selector.textselector(resource_banks, part2.offset()), ), data= [ {"key": "structuretype", "value": "mainsentence", "set": "tutorial-set" }, ], id=f"AnnotationLine8Mainsentence") # If we ask the STAM library to get the text using `str()`, it will concatenate # the parts with a space, which may not always be appropriate: # In[47]: print(f"\"{line8mainsentence}\"") assert str(line8mainsentence) == "everything we know is composed ultimately of patterns of nothing" # Use the `text()` method instead if you want to retain the parts: # In[48]: print(line8mainsentence.text()) assert line8mainsentence.text() == ["everything we know", "is composed ultimately of patterns of nothing"] # In a similar fashion, you can also call the `textselections()` methods to # obtain all text selections. We already used this method before and remarked it # always returns a `TextSelections` collection and not just a single text `TextSelection`, now you know why. # # When the composite selector is used, the target *must* be interpreted jointly; # the annotation applies to the whole composition rather than to individual # parts. # # There is also the *MultiSelector*, which selects multiple targets and the # annotation applies to each of them individually and independently. It offers a # convenient way to express multiple annotations more concisely, conserving # memory usage. # # Last, there is the *DirectionalSelector* which expresses multiple targets with # a very specific order that is meaningful. For example, taking line eight again, # we can express the dependency relation where the word *ultimately* is an # adverbial modifier to the verb *composed*: # In[49]: head = line8_textselection.textselection(stam.Offset.simple(41,48)) dependant = line8_textselection.textselection(stam.Offset.simple(50,59)) line8mainsentence = store.annotate( target=stam.Selector.directionalselector( stam.Selector.textselector(resource_banks, head.offset()), stam.Selector.textselector(resource_banks, dependant.offset()), ), data= [ {"key": "dependency", "value": "advmod", "set": "tutorial-set" }, ], id=f"AnnotationDependency") # You can interpret the different selectors under a directional selector akin to # positional function parameters. You, the modeller, determine how the ordering # is interpreted. # # ## Editing annotations # # We already explained how this is a bad idea and should be avoided: the canonical # way to edit an annotation is to remove the old annotation from the store and # make a new one. Removing an annotation, or any other STAM object, can be done by passing it to `AnnotationStore.remove()`. # # We can dive into the motivation behind this constraint a bit more: From a # semantic perspective annotations are essentially a commentary about something # else. If that what you comment on is subject to change, possibly unbeknownst to # you, then such a change might invalidate your commentary, as it is no # longer the same thing as what you based your comment on! The STAM model # prevents these pitfalls. # # Nevertheless, at the low-level there are ways around this constraint. After # all, as long as you don't publish the annotations you have some liberty in editing them. # Currently though, the Python library does not yet expose this. # # When using `AnnotationStore.remove()` on any variable, you must yourself take # care not to use that variable again. Also note that removing an item will # removing everything that depends on it. So if you remove an item like an # annotation, text resource or data item, then all annotations on it and # everything that references it will be automatically removed as well." # # ## Saving and loading data # # All this time we've been annotating but have not committed our results to any # form of persistent storage. You will likely want to save your annotation store # to file, and load it all again at any later point in time. # # STAM's canonical serialisation format is [STAM JSON](https://github.com/annotation/stam#stam-json): # In[50]: store.set_filename("tutorial.store.stam.json") store.save() # The `save()` method will use the filename that the annotation store was # initially loaded from. We had none yet, so we set it via `set_filename()` # first. In our current example, everything is saved into a single JSON file. # # However, the `set_filename()` method is also available on `AnnotationDataSet` # and `TextResource`. If set, these are kept in stand-off files. Annotation data # sets usually use STAM JSON, but text resources generally just use plain text. # The extension you use determines the file format. # # There is also a [STAM CSV # format](https://github.com/annotation/stam/tree/master/extensions/stam-csv) # defined as an extension which is supported by this library. Whereas # the JSON format is *very* verbose (=large files), the CSV is a bit more concise. # # Loading an annotation store (including all stand-off files) is as simple as: # In[51]: store2 = stam.AnnotationStore(file="tutorial.store.stam.json") # ## Visualising annotations # # Having made annotations, you may want to visualize them. This can be done via # the `view()` method on `AnnotationStore`. It takes as input a *selection query* # and zero or more *highlight queries*, all in STAMQL; and produces either HTML # output or colored ANSI text. The outputted HTML is a self-contained and standalone document. # # The selection query determines what the main selection is and can # be anything you can query that has text (i.e. resources, annotations, text # selections). # # The *highlight queries* determine what parts of the selections produced by the # selection query you want to highlight. Highlighting is done by drawing a line # underneath the text and optionally by a *tag* that shows extra information. # Specific display options are configurable via *attributes* (starting with `@`) # that precede the actual STAMQL query. # # Tags can be enabled by prepending the query with one of the following attributes: # # * `@KEYTAG` - Outputs a tag with the key, pertaining to the first DATA constraint in the query # * `@KEYVALUETAG` - Outputs a tag with the key and the value, pertaining to the first DATA constraint in the query # * `@VALUETAG` - Outputs a tag with the value only, pertaining to the first DATA constraint in the query # * `@IDTAG` - Outputs a tag with the public identifier of the ANNOTATION that has been selected # # If you don't want to match the first DATA constraint, but the n-th, then specify a number to refer to the DATA constraint (1-indexed) in the order specifies. Note that only DATA constraints are counted: # # * `@KEYTAG=`*n* - Outputs a tag with the key, pertaining to the *n*-th DATA constraint in the query # * `@KEYVALUETAG=`*n* - Outputs a tag with the key and the value, pertaining to the *n*-th DATA constraint in the query # * `@VALUETAG=`*n* - Outputs a tag with the value only, pertaining to the *n*-th DATA constraint in the query # # Attributes may also be provided for styling HTML output: # # * `@STYLE=`*class* - Will associate the mentioned CSS class (it's up to you to associate a proper stylesheet). The default one predefines only a few simple classes: `italic`, `bold`, `red`,`green`,`blue`, `super`. # * `@HIDE` - Do not add the highlight underline and do not add an entry to the legend. This may be useful if you only want to apply `@STYLE`. # # If no attribute is provided, there will be no tags or styling shown for that query, only a # highlight underline in the HTML output. # # **Note:** This is the same functionality as is exposed in the collection of # command-line tools called # [stam-tools](https://github.com/annotation/stam-tools). # # To display HTML in this Jupyter Notebook we import this first: # In[52]: from IPython.display import display, HTML # Let's take a look at the data we have been creating thus-far. Let's first just query for the text of the two quotes our document consists of: # In[53]: display(HTML(store.view('SELECT ANNOTATION ?quote WHERE DATA "tutorial-set" "structuretype" = "quote";'))) # We can add additional queries to *highlight* parts of this output, such as the words or line eight, both of which we have annotated earlier: # In[54]: display(HTML(store.view('SELECT ANNOTATION ?quote WHERE DATA "tutorial-set" "structuretype" = "quote";', \ 'SELECT ANNOTATION ?word WHERE RELATION ?quote EMBEDS; DATA "tutorial-set" "structuretype" = "word";', \ 'SELECT ANNOTATION ?line_8 WHERE RELATION ?quote EMBEDS; DATA "tutorial-set" "linenr" = 8;'))) # It is important that highlight queries always reference the variable from the primary selection query (`?quote` in the above example), otherwise they query too much and performance is drastically suboptimal. # # We can also output additional tags by prepending an attribute (`@IDTAG`,`@KEYTAG`,`@VALUETAG` or `@KEYVALUETAG`) to a highlight query: # In[55]: display(HTML(store.view('SELECT ANNOTATION ?quote WHERE DATA "tutorial-set" "structuretype" = "quote";', \ '@IDTAG SELECT ANNOTATION ?word WHERE RELATION ?quote EMBEDS; DATA "tutorial-set" "structuretype" = "word";', \ '@KEYVALUETAG SELECT ANNOTATION ?line_8 WHERE RELATION ?quote EMBEDS; DATA "tutorial-set" "linenr" = 8;'))) # Alternatively, you can output annotations as text with ANSI escape sentences, by setting keyword argument `format=ansi`. This is designed for terminal output, but it can also be visualised here: # In[56]: print(store.view('SELECT ANNOTATION ?quote WHERE DATA "tutorial-set" "structuretype" = "quote";', \ 'SELECT ANNOTATION ?word WHERE RELATION ?quote EMBEDS; DATA "tutorial-set" "structuretype" = "word";', \ '@KEYVALUETAG SELECT ANNOTATION ?line_8 WHERE RELATION ?quote EMBEDS; DATA "tutorial-set" "linenr" = 8;', format="ansi")) # ## Text alignment # # The STAM library provides algorithms to automatically align two similar versions of a text on a character level. The algorithms used come in two variants, originally developed in bioinformatics were they are used for DNA/RNA sequence alignment: # # * Smith Waterman - Local alignment - Does fuzzy matching of text A in a larger text B. # * Needleman-Wunsch - Global alignment - Texts A and B are fuzzy-matched as a whole. # # The result of a text alignment is called a *transposition* and stored in the STAM model as an annotation. The formal specification for transpositions can be found in the [STAM Transpose extension](https://github.com/annotation/stam/tree/master/extensions/stam-transpose). # A transpositions points at both texts and in each identifies an *identical* sequence of characters. There may be gaps in this sequence, with respect to the original text, and the sequence order may differ. # # Let's look at an example. First we add an extra resource with some text to match against our example store: # In[57]: aligntest = store.add_resource(id="aligntest", text="patterns of something") # And then we align the text of that resource with the text of the first quote in the model. The `align_texts()` method does the actual work and returns a list of annotations, where each annotation is a transposition. The `alignments()` method can subsequently be used to obtain the two constituent parts of the transposition: # In[58]: aligntextsel = aligntest.textselection(stam.Offset.whole()) quote1textsel = next(store.annotation("AnnotationQuote1").textselections()) print(f"Looking for alignment of \"{aligntextsel}\" in \"{quote1textsel}\"") alignments = aligntextsel.align_texts(quote1textsel, algorithm="local") print() print(f"Found {len(alignments)} alignment (transposition):") for alignment in alignments: for left, right in alignment.alignments(): assert left.text() == right.text() print(f"\"{left.text()}\"\t{left.offset()}@{left.resource().id()}\t{right.offset()}@{right.resource().id()}") # An important feature about transpositions is that they allow you to *transpose* existing annotations that reference one part of the transposition, to the reference other part. That is to say, transpositions allow you to map different coordinate systems on texts. # # Let's illustrate this with an example where we transpose one of the words in our aligned example: # # In[59]: #Grab the annotation that corresponds to the word 'patterns', we learnt its identifier in the visualisation example in the previous section: annotation = store.annotation("AnnotationToken28") #The first alignment result is a transposition that covers this word: transposition = alignments[0] # Now we can transpose the annotation via the transposition transposed = next(annotation.transpose(transposition)) print(f"Transposed \"{annotation}\" {annotation.offset()}@{annotation.resources()[0].id()} to \"{transposed}\" {transposed.offset()}@{transposed.resources()[0].id()} with data {transposed.data()[0].key()} = {transposed.data()[0].value()}") # The transposed annotation is a copy of the original annotation, containing all its data, but transposed to refer to a new target text. # Sometimes you're not so much interested in an exact alignment, in the transposition, but you're more interested in a # simplified approximate alignment. This can be computed by passing the keyword argument `grow=True` to the `align_texts()` method. The parameter is called `grow` because it effectively computes a transposition first, and then grows that alignment into something larger that is no longer a precise match. Additional keyword arguments like `max_errors` and `minimal_align_length` may be set to adjust the behaviour of the grow algorithm. # # This method no longer produces transpositions as result, but *translations* instead. We use a wide definition of term *translation* here: it is an annotation that maps on text to another but *where the textual content is no longer identical*. These are documented here: [STAM Translate extension](https://github.com/annotation/stam/tree/master/extensions/stam-translate). # # The following example illustrates this, compare it to the transposition example earlier: # # In[60]: aligntextsel = aligntest.textselection(stam.Offset.whole()) quote1textsel = next(store.annotation("AnnotationQuote1").textselections()) print(f"Looking for alignment of \"{aligntextsel}\" in \"{quote1textsel}\"") alignments = aligntextsel.align_texts(quote1textsel, algorithm="local", grow=True) print() print(f"Found {len(alignments)} alignment (translation):") for alignment in alignments: for left, right in alignment.alignments(): print(f"\"{left.text()}\"\t{left.offset()}@{left.resource().id()} --> \"{right.text()}\"\t{right.offset()}@{right.resource().id()}") # ## Conclusion # This concludes this tutorial. We hope to have shown you how to use the STAM python library. # In[ ]: