What are the data structures?¶

What exactly do we mean by data structures? Generally speaking, those are the objects capable of storing and retrieving an arbitrary number of values of any type, in a systematic way. In all other senses, data structures are similar to basic data types - they can be stored via variables, removed, changed, etc.

Built-in data structures provide a standard and high-performant way to work with the bulks of data; However, there is simply no silver-bullet one-suits-all data structure - benefits and shortcomings of each are built-in and inseparable from its general design.

Lists¶

Lists are probably the most frequent type of data structure in python. They are the simple, one-dimensional ordered array of elements, Each element has its own index number, starting with 0 - so the last element always has an index of L-1, where L is the number of elements in the list. Lists can store any mix of data types. It can also store any other data structure - for example, a matrix (2d array) can be represented as a list of lists; 3d-matrix can be represented as a list of lists of lists, and so on. They are also dynamic — you can add or drop items mutable - you can add or remove values, or even change their order in place, without rebuilding list from the scratch.

You can create a new list by using list() function, or using square brackets, with elements separated by commas:

In [1]:

fruits = ['banana', 'apple', 'plum']

If needed, you can create an empty list as well:

In [2]:

basket  = []
another_basket = list()

As with strings, we can check the length of the list using len built-in function:

In [3]:

len(fruits)

Out[3]:

In [5]:

len(basket)

Out[5]:

We always can add new elements to the end of the list via .append method :

In [7]:

fruits.append('pineapple')

In [8]:

fruits

Out[8]:

['banana', 'apple', 'plum', 'pineapple']

Or insert them into a specific position:

In [13]:

fruits.insert(2, 'orange')

In [14]:

fruits

Out[14]:

['banana', 'apple', 'orange', 'plum', 'pineapple']

We can also merge it with another list, using .extend:

In [16]:

fruits.extend(['melon', 'watermelon'])

In [17]:

fruits

Out[17]:

['banana', 'apple', 'orange', 'plum', 'pineapple', 'melon', 'watermelon']

n some cases, you want to retrieve one element, and also remove it from the list - for example, if you need to process elements, one by one. For that, .pop() method is ideal - it will get elements from the end of the list, removing it from the list at the same time:m

In [19]:

len(fruits)

Out[19]:

In [20]:

fruits.pop()

Out[20]:

'watermelon'

In [21]:

len(fruits)

Out[21]:

Finally, you can use the in statement to check if the list contains certain element; This will also work with any other iterables:

In [22]:

'melon' in fruits

Out[22]:

True

Slicing¶

As with strings, in order to get a single value from the list, you need to use its index in the square brackets, after the value:

In [23]:

fruits[0]  # first element

Out[23]:

'banana'

You can also obtain a sub-list, using slices - intervals of indexes, defined by 2 numbers, separated by the colon. Those numbers represent start and end indices; remarkably, the former one is inclusive, but the latter one is not. If one or both numbers are missing, python assumes those to be ends of the list:

In [24]:

fruits[0:2]

Out[24]:

['banana', 'apple']

In [25]:

fruits[:2]

Out[25]:

['banana', 'apple']

This is exactly the same interface we used with strings - and it called slicing. As with the strings, we can use negative indices - representing values, indexed in reverse, starting from the end, e.g. “-1” will represent the last element in the list, no matter what it’s actual index. One more feature of slicing is its ability to define the step of the increase. By default, it is equal to 1, — each element, one after another. However, if you state it as 2, each other element will be retrieved; Similarly, -1 will retrieve all elements, but in the reverse order:

In [26]:

fruits[::2]

Out[26]:

['banana', 'orange', 'pineapple']

In [27]:

fruits[:2:-1]

Out[27]:

['melon', 'pineapple', 'plum']

Tuples¶

Tuples often stay in the shadow of lists. On the surface, they are very similar - they are also 1-dimensional arrays, can mix different types, use same indices and can be sliced. Similar to lists, tuples can be created using tuple() function, or with the colons:

In [28]:

breakfast = ('oatmill', 'scrambled eggs', 'orange juice')

There is one major difference that makes tuple work better for some tasks - and inconvenient for the others. As we mentioned before, lists are dynamic - they can be modified in place to your likes. Tuples are not - once built, tuple remains static and cannot be changed (therefore, they don’t have append and extend methods). This property has a name - immutability.

In [30]:

breakfast.append(3)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-30-ba4eefcf99ae> in <module>
----> 1 breakfast.append(3)

AttributeError: 'tuple' object has no attribute 'append'

Dictionaries¶

Dictionaries are a different type of structures - instead of being ordered arrays, they are key-value storages; Dictionaries do not have any order per se* - instead, it stores everything as a key-value pair. As physical keys, the dictionary’s keys have to be unique and unambiguously static - in other words, immutable. Hence, there cannot be two keys of the same value, and lists cannot be used as keys - but tuples can. Most frequently, however, keys are strings - as it allows to add some sort of semantics to the structure:

In [32]:

person = {
 'name': 'Jim',
 'surname': 'Hawkins',
 'age':17
}

As you can see, dictionaries are defined by the curly brackets, with key-value pairs separated by the colon and split by the comma. Once the dictionary is assigned, you can retrieve values, using square brackets - same as for lists and tuples - but using keys instead of indices:

In [34]:

person['age']

Out[34]:

As there cannot be two keys of the same value, another assignment will override previous value - with no warnings!

In [36]:

person['hair'] = 'ginger'

As with lists, you can merge two dictionaries. One way to do that is through .update() method:

In [37]:

additional_info = {
    'gender': 'male',
    'nationality': 'british',
    'age': 16
}

person.update(additional_info)

Similar to lists, dictionaries have .pop() method - but in this case, it requires a specific key, for which the value to retrieve and remove:

In [38]:

person.pop('age')

Out[38]:

keys and values¶

On top of the methods above, dictionaries have a couple of handful methods up their sleeve. First, you can get either keys or values as list-like structures, using .keys() or .values(), respectfully. Sometimes, it is convenient to get iterable of key-value pairs - this can be achieved using .items() method:

если мы не уверены что ключ есть, но запросить нужно, можно использовать .get с дефолтной величиной:

In [39]:

print({'name':'Jim', 'surname': 'Hawkins'}.items())

dict_items([('name', 'Jim'), ('surname', 'Hawkins')])

get¶

When you try to get value by submitting a key that is not in the dictionary, a Keyerror is raised. In some cases, you don't want that behavior; In this case, use .get() method - it takes two values, first for the key, and second for the default value - one that .get() will return if there is no such data in the dictionary.

In [40]:

person.get('eye color', 'brown')

Out[40]:

'brown'

Last, dictionaries can behave as iterables; you can loop through it (more on that in the next chapter), or check if an element is in the dict; In both cases, however, it will work with the keys, not the values:

In [45]:

'name' in person

Out[45]:

True

In [46]:

'Jim' in person

Out[46]:

False

Sets¶

Sets are, in a way, dictionaries without values - first, they use same curly brackets, and second, their members cannot be duplicates - similar to dictionary keys. Because of that, they are handy to use for deduplication or membership tests. On top of that, sets have built-in mathematical operations, — union, intersection, difference, and symmetric difference.

As with other structures, you can create sets by either using curly brackets, or set() function:

In [52]:

names = set(['Sam', 'John', 'James', 'Sam'])
names

Out[52]:

{'James', 'John', 'Sam'}

In [54]:

other_names = {'James', 'Nikolai', 'Iliah'}

In [55]:

names.difference(other_names)

Out[55]:

{'John', 'Sam'}

elements which are either in names or other names - but not both

In [56]:

names.symmetric_difference(other_names)

Out[56]:

{'Iliah', 'John', 'Nikolai', 'Sam'}

Membership test time comparison¶

In [58]:

l = ['apple', 'banana', 'orange', 'grapefruit', 'plum', 'grape', 'pear']
s = set(l)

In [59]:

%timeit 'pear' in l

84.9 ns ± 1.99 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [61]:

%timeit 'pear' in s

31.6 ns ± 1.21 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [ ]:

As you can see, even for the short array, performance is essentially more than 2 times better. This difference will only increase on the larger arrays.

More data structures¶

Lists, tuples, dictionaries, and sets are the most popular and widespread data structures in python. However, they are not the only ones, and there are many more - even in the default python itself. As they are way more niche, we’ll cover each with a brief overview. You can find additional information in Python documentation.

Frozensets¶

In [64]:

frozenset('Hello')

Out[64]:

frozenset({'H', 'e', 'l', 'o'})

Defaultdict¶

defailtdicts live in the built-in collections module. It has a default value set up on creation - if a missing key is passed, it will return this default value instead of raising KeyError. While this behavior can be achieved through the get method of an ordinary dictionary, defaultdicts perform twice as fast on those missing values retrieval:

In [66]:

from collections import defaultdict
d = defaultdict(str)  # returns empty string as default value

In [67]:

d['name'] = 'John'

In [68]:

d['surname']

Out[68]:

''

Counter¶

In [69]:

from collections import Counter

Counter('Hello')

Out[69]:

Counter({'H': 1, 'e': 1, 'l': 2, 'o': 1})

In [70]:

c1 = Counter({'banana': 2, 'apple': 1})
c1['apple'] += 1
c1

Out[70]:

Counter({'banana': 2, 'apple': 2})

Queues¶

In [71]:

from queue import Queue

In [72]:

Q = Queue(maxsize=2)
Q.put('wash dishes')
Q.put('water flowers')

Q.qsize()

Out[72]:

In [73]:

Q.put_nowait('check mail') 

---------------------------------------------------------------------------
Full                                      Traceback (most recent call last)
<ipython-input-73-52ab83e4fe85> in <module>
----> 1 Q.put_nowait('check mail')

~/anaconda3/envs/py36/lib/python3.6/queue.py in put_nowait(self, item)
    182         Otherwise raise the Full exception.
    183         '''
--> 184         return self.put(item, block=False)
    185 
    186     def get_nowait(self):

~/anaconda3/envs/py36/lib/python3.6/queue.py in put(self, item, block, timeout)
    128                 if not block:
    129                     if self._qsize() >= self.maxsize:
--> 130                         raise Full
    131                 elif timeout is None:
    132                     while self._qsize() >= self.maxsize:

Full:

In [74]:

Q.get()

Out[74]:

'wash dishes'

In [75]:

Q.task_done()
Q.qsize()

Out[75]:

In [76]:

Q.put('check mail')
Q.qsize()

Out[76]:

Deques¶

In [77]:

from collections import deque
D = deque(['wash dishes', 'water flowers', 'check mail'])

In [78]:

D.pop()

Out[78]:

'check mail'

In [80]:

D.popleft()

Out[80]:

'wash dishes'

Named Tuples¶

The namedtuples collection are a convenient way to store data in an expressive way. You can think of them as a hybrid of tuples and dictionaries - getting small memory footprint and immutability from the former, and keyword-access from the latter. They are very effective if you work with multiple data points of the same structure and no bound logic - for example, users or listings. Before we actually create an instance, we'll have to specify it's properties, first:

In [81]:

from collections import namedtuple
user = namedtuple('User', 'name, surname, age')

peter = user(name='Peter', surname='Pan', age=13) 
wendy = user(name='Wendy', surname='Darling', age=13)

In [82]:

peter[0]

Out[82]:

'Peter'

In [83]:

peter[:2]

Out[83]:

('Peter', 'Pan')

In [84]:

wendy.surname

Out[84]:

'Darling'

In [85]:

hasattr(wendy, 'age')

Out[85]:

True

In [86]:

getattr(wendy, 'age')

Out[86]:

Namedtuples are classes, inheriting from a tuple - the resulting instances are 100% normal class instances. You can see the code under the hood by passing verbose=True argument.

With that being said, Python 3.7 introduces Dataclasses — pure python classes, which can solve the somewhat similar set of problems, and little more descriptive and flexible - although take somewhat more memory. We'll learn about data classes in chapter 7.

In [87]:

user = namedtuple('User', 'name, surname, age', verbose=True)

from builtins import property as _property, tuple as _tuple
from operator import itemgetter as _itemgetter
from collections import OrderedDict

class User(tuple):
    'User(name, surname, age)'

    __slots__ = ()

    _fields = ('name', 'surname', 'age')

    def __new__(_cls, name, surname, age):
        'Create new instance of User(name, surname, age)'
        return _tuple.__new__(_cls, (name, surname, age))

    @classmethod
    def _make(cls, iterable, new=tuple.__new__, len=len):
        'Make a new User object from a sequence or iterable'
        result = new(cls, iterable)
        if len(result) != 3:
            raise TypeError('Expected 3 arguments, got %d' % len(result))
        return result

    def _replace(_self, **kwds):
        'Return a new User object replacing specified fields with new values'
        result = _self._make(map(kwds.pop, ('name', 'surname', 'age'), _self))
        if kwds:
            raise ValueError('Got unexpected field names: %r' % list(kwds))
        return result

    def __repr__(self):
        'Return a nicely formatted representation string'
        return self.__class__.__name__ + '(name=%r, surname=%r, age=%r)' % self

    def _asdict(self):
        'Return a new OrderedDict which maps field names to their values.'
        return OrderedDict(zip(self._fields, self))

    def __getnewargs__(self):
        'Return self as a plain tuple.  Used by copy and pickle.'
        return tuple(self)

    name = _property(_itemgetter(0), doc='Alias for field number 0')

    surname = _property(_itemgetter(1), doc='Alias for field number 1')

    age = _property(_itemgetter(2), doc='Alias for field number 2')

Generators¶

Generators are not exactly data structures - they are functions. However, while “normal” functions compute their result and return it at once, generators can be stopped and resumed on the fly, resulting in an Iterable-ish behavior. In other words, you can loop over a generator, retrieving one value at a time. Unlike classic iterables, however, generators are lazy - they compute values once we ask for them, but not before. As a result of that, there are a few significant differences in their behavior:

First, generators use a fixed amount of memory; Even if you asked one to compute zillions of values, a generator will produce and store just one value every time you ask, which is great! In fact, generators can produce an infinite amount of values with no memory issues.
Second, as generators do not store the values, there is no way to retrieve values by their index - in order to get the third element, you need to compute the first two, first. Similarly, there is no way to get “back” to the previous element - if you didn’t store it, the value is lost. Also, there is no way to estimate the length of the generator, except than compute all the values - but again, generators can be infinite.

In order to create a generator function, just create some sort of loop within (we didn't cover loops yet - but bear with us, we'll explain them in the next chapter!), and use yield instead of return. Once the function is called, you can loop over it, as if it is a list or a tuple - or retrieve one value at a time, using Python's built-in next() function:

In [89]:

def my_generator(N, power=2):
    
    # `for` loop, which we'll cover in depth in the next chapter.
    # note that loops require another level of indentation
    for el in range(N):
        yield el**power

In [93]:

N = my_generator(4, power=2)

In [94]:

next(N)

Out[94]:

In [95]:

for el in N:
    print(el)

1
4
9

In the example above we used range() function, that takes 1 to 3 integer arguments, with only one required. If one provided, range will return a generator of numbers from o to this number, excluding it. If two - former will become the starter, and later - the end of the generator, again, excluded. If the third argument is provided, it will be used as a step. There are plenty of other functions within python which return generators. Don't worry if you need list or tuple as a result - just convert them:

In [97]:

list(range(5))

Out[97]:

[0, 1, 2, 3, 4]

range object has some "syntactic sugar" functionality - for example, it CAN check for inclusion without actually calculating the values (which is a very easy thing to do, if you think it thoroughly):

In [100]:

20346 in range(0, 100_000_000, 2)

Out[100]:

True

Useful Functions to use with data structures¶

`Sum`, `max` and `min`¶

In [102]:

sum({1:'A', 2:'B'})

Out[102]:

Note that min and max don't require elements to be integers or floats:

In [108]:

max({'A':1, 'B':2})

Out[108]:

'B'

`All` and `Any`¶

ll and any are self-explanatory as well - simply put. they are extended and and or operators, working with multiple values at once:

In [109]:

all([False, True, True])

Out[109]:

False

In [110]:

any([False, True, False])

Out[110]:

True

`Zip`¶

Zip is useful when you have N lists of M elements each, and you need to "transpose" - get M lists of N elements. As range, it will return generator:

In [111]:

data1, data2 = (1, 2, 3, 4, 5), ('A', 'B', 'C', 'D', 'E')

In [114]:

result = list(zip(data1, data2))
result

Out[114]:

[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'E')]

Zip can be done in reverse as well - using result variable as args:

In [115]:

list(zip(*result))

Out[115]:

[(1, 2, 3, 4, 5), ('A', 'B', 'C', 'D', 'E')]

zip can also be used to create data structures:

In [117]:

dict(zip(data1, data2))

Out[117]:

{1: 'A', 2: 'B', 3: 'C', 4: 'D', 5: 'E'}

Map, filter, and reduce¶

Other functions that you might find useful with data structures are map, filter, and reduce. Those are very useful in conjunction with other functionality - for example, lambdas.

map runs given function on every element of the iterable, returning generator:

In [118]:

data1, data2 = (1, 2, 3, 4, 5), ('A', 'B', 'C', 'D', 'E')

In [119]:

list(map(lambda x: x**2, data1))  # converting to list in order to se results

Out[119]:

[1, 4, 9, 16, 25]

In [120]:

list(map(lambda x: x.lower(), data2))

Out[120]:

['a', 'b', 'c', 'd', 'e']

Similarly, filter returns subarray of elements for which function returns true or truthy value:

In [121]:

list(filter(lambda x: x > 3, data1))

Out[121]:

[4, 5]

Comprehensions¶

Comprehensions are a nice and expressive way to work with data structures. Let's start with a simple example:

In [122]:

{el**2 for el in range(3)}

Out[122]:

{0, 1, 4}

Here, curly brackets define that we'll get set as a result. we use range to create the initial iterable, and then "loop" over its values, computing square of each. This is not a real loop, though - list comprehensions are actually faster than loops, and even map, as there is no lambdas - so, no additional costs for stack lookups:

In [126]:

%%timeit
s = set()

for el in range(10):
    s.add(el**2)

3.35 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [128]:

%timeit set(map(lambda x: x**2, range(10)))

3.72 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [127]:

%timeit {el**2 for el in range(10)}

3.11 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

On top of that, comprehensions can be nested, use if statements (thus, replace filter function), and operate on different data structures:

In [130]:

characters = [
    {'name': 'Peter Pan', 'age': 13, 'type': 'boy'},
    {'name': 'Wendy Darling', 'age': 14, 'type': 'girl'},
    {'name': 'Captain Cook', 'age': 45, 'type': 'pirate'} # just guessing
]

[el['name'] for el in characters if el['age'] < 15]

Out[130]:

['Peter Pan', 'Wendy Darling']

We can even swap keys and values in the dictionary that way!

In [131]:

D = {'A':1, 'B':2 }
{v:k for k, v in D.items()}

Out[131]:

{1: 'A', 2: 'B'}