Τμήμα Πληροφορικής και Τηλεπικοινωνιών - Άρτα 
Πανεπιστήμιο Ιωαννίνων 

Γκόγκος Χρήστος 
http://chgogos.github.io/

Τελευταία ενημέρωση: 16/3/2022

Κανονικές εκφράσεις¶

Μια κανονική έκφραση είναι μια ακολουθία χαρακτήρων που ορίζει ένα μοτίβο αναζήτησης (search pattern) για ένα κείμενο. Για τη σύνταξη μοτίβων αναζήτησης μπορούν να χρησιμοποιηθούν τα ακόλουθα:

Ειδικοί χαρακτήρες: . ^ $ * + ? { } [ ] \ | ( )

χαρακτήρας	περιγραφή
.	οποιοσδήποτε χαρακτήρας
\d	οποιοδήποτε ψηφίο (0-9)
\D	οτιδήποτε δεν είναι ψηφίο (0-9)
\w	χαρακτήρας λέξης (a-z, A-Z, 0-9, _)
\W	οτιδήποτε δεν είναι χαρακτήρας λέξης (a-z, A-Z, 0-9, _)
\s	κενός χαρακτήρας (διάστημα, tab, αλλαγή γραμμής)
\S	μη κενός χαρακτήρας (διάστημα, tab, αλλαγή γραμμής)

Οριοθέτες

οριοθέτης	περιγραφή
\b	όριο λέξης
\B	όχι όριο λέξης
^	αρχή συμβολοσειράς
$	τέλος συμβολοσειράς

Σύνολο χαρακτήρων: ορίζεται μέσα σε αγκύλες [], π.χ. το σύνολο χαρακτήρων [aei] αντιστοιχεί σε έναν από τους χαρακτήρες a,e,i

Ειδικοί χαρακτήρες για σύνολα χαρακτήρων: - ^

[a-zA-Z] αντιστοιχεί σε έναν χαρακτήρα από a μέχρι και z πεζό ή κεφαλαίο
[^a]bc αντιστοιχεί σε κείμενο που δεν ξεκινά με a και συνεχίζει με bc

Ποσοδείκτες

ποσοδείκτης	περιγραφή
*	0 ή περισσότερο
+	1 ή περισσότερο
?	0 ή 1
{3}	ακριβώς 3
{3,4}	περιοχή τιμών (ελάχιστο, μέγιστο)
{3,}	τουλάχιστον 3

Ομάδες: ορίζονται με παρενθέσεις ()

Το module re¶

Αναγνώριση προτύπων (patterns) σε κείμενο με χρήση της finditer

In [1]:

import re

print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']

In [2]:

# Η finditer βρίσκει όλα τα ταιριάσματα (matches) και τις θέσεις τους
# Πέρα από την finditer μπορούν να χρησιμοποιηθούν και άλλες συναρτήσεις όπως οι findall, match και search (παραδείγματα στη συνέχεια)

text = '''I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy.
'''

pattern = re.compile(r'happy|because')
matches =  pattern.finditer(text)
for match in matches:
    print(match)

print("#" * 40)

# συλλογή αποτελεσμάτων σε λίστα με comprehension (σε 1 γραμμή)
print([x for x in re.compile(r'happy|because').finditer(text)])

<re.Match object; span=(7, 12), match='happy'>
<re.Match object; span=(13, 20), match='because'>
<re.Match object; span=(43, 48), match='happy'>
<re.Match object; span=(53, 60), match='because'>
<re.Match object; span=(82, 87), match='happy'>
<re.Match object; span=(109, 114), match='happy'>
########################################
[<re.Match object; span=(7, 12), match='happy'>, <re.Match object; span=(13, 20), match='because'>, <re.Match object; span=(43, 48), match='happy'>, <re.Match object; span=(53, 60), match='because'>, <re.Match object; span=(82, 87), match='happy'>, <re.Match object; span=(109, 114), match='happy'>]

Παραδείγματα ειδικών χαρακτήρων, οριοθετών, ποσοδεικτών και ομάδων σε κανονικές εκφράσεις¶

In [3]:

# Ο ειδικός χαρακτήρας | 

text = 'προστακτικός λογικός συναρτησιακός'
pattern = re.compile(r'προστακτικός|λογικός|συναρτησιακός')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 12), match='προστακτικός'>, <re.Match object; span=(13, 20), match='λογικός'>, <re.Match object; span=(21, 34), match='συναρτησιακός'>]

In [4]:

# Παρενθέσεις

text = 'προγραμματισμός προγραμματιστής'
pattern = re.compile(r'προγραμματι(σμός|στής)')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 15), match='προγραμματισμός'>, <re.Match object; span=(16, 31), match='προγραμματιστής'>]

In [5]:

# Αλλαγή γραμμής

text = '''Γλώσσες
Προγραμματισμού'''
pattern = re.compile(r'Γλώσσες\nΠρογραμματισμού')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 23), match='Γλώσσες\nΠρογραμματισμού'>]

In [6]:

# Σύνολα χαρακτήρων με []

text = 'λογικό λογική'
pattern = re.compile(r'λογικ[όή]')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 6), match='λογικό'>, <re.Match object; span=(7, 13), match='λογική'>]

In [7]:

# Σύνολα χαρακτήρων με []

text = 'αβηθικ αγηθικ αδηθικ αεηθικ αζηθικ'
pattern = re.compile(r'α[βγδεζ]ηθικ') # ένας οποιοσδήποτε χαρακτήρας από τους β, γ, δ, ε ή ζ πρέπει να υπάρχει ανάμεσα στο α και στο ηθικ
print([x.group(0) for x in pattern.finditer(text)])

['αβηθικ', 'αγηθικ', 'αδηθικ', 'αεηθικ', 'αζηθικ']

In [8]:

# Σύνολα χαρακτήρων με []

text = 'αβχψω αγχψω αδχψω αεχψω ακχψω αΒχψω αΓχψω αΔχψω'
pattern = re.compile(r'α[β-εκΒ-Δ]χψω') # ένας οποιοσδήποτε χαρακτήρας από τους β μέχρι και ε, κ, Β μέχρι και Δ πρέπει να υπάρχει ανάμεσα στο α και στο χψω
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 5), match='αβχψω'>, <re.Match object; span=(6, 11), match='αγχψω'>, <re.Match object; span=(12, 17), match='αδχψω'>, <re.Match object; span=(18, 23), match='αεχψω'>, <re.Match object; span=(24, 29), match='ακχψω'>, <re.Match object; span=(30, 35), match='αΒχψω'>, <re.Match object; span=(36, 41), match='αΓχψω'>, <re.Match object; span=(42, 47), match='αΔχψω'>]

In [9]:

# Σύνολα χαρακτήρων με []

text = 'αβω αγω αδω αεω αζω'
pattern = re.compile(r'α[^βδ]ω') # οποιοσδήποτε χαρακτήρας εκτός από τους β και δ πρέπει να υπάρχει ανάμεσα από το α και το ω
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(4, 7), match='αγω'>, <re.Match object; span=(12, 15), match='αεω'>, <re.Match object; span=(16, 19), match='αζω'>]

In [10]:

# Ο ποσοδείκτης ?

text = 'αβγδεζ, αβγδζ'
pattern = re.compile(r'αβγδε?ζ') # ο χαρακτήρας ε μπορεί να υπάρχει ή να μην υπάρχει
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 6), match='αβγδεζ'>, <re.Match object; span=(8, 13), match='αβγδζ'>]

In [11]:

# Ο ποσοδείκτης *

text = 'αγδεζ αβγδεζ αββγδεζ αβββγδεζ'
pattern = re.compile(r'αβ*γδεζ') # ο χαρακτήρας β μπορεί να μην υπάρχει ή να υπάρχει 1 ή περισσότερες φορές
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 5), match='αγδεζ'>, <re.Match object; span=(6, 12), match='αβγδεζ'>, <re.Match object; span=(13, 20), match='αββγδεζ'>, <re.Match object; span=(21, 29), match='αβββγδεζ'>]

In [12]:

# Ο ποσοδείκτης +

text = 'αγδεζ αβγδεζ αββγδεζ αβββγδεζ'
pattern = re.compile(r'αβ+γδεζ') # ο χαρακτήρας β μπορεί να υπάρχει 1 ή περισσότερες φορές
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(6, 12), match='αβγδεζ'>, <re.Match object; span=(13, 20), match='αββγδεζ'>, <re.Match object; span=(21, 29), match='αβββγδεζ'>]

In [13]:

# Παρενθέσεις και +

text = 'αβγδεζ αβγβγδεζ αβγβγβγδεζ'
pattern = re.compile(r'α(βγ)+δεζ') # η συμβολοσειρά βγ μπορεί να υπάρχει 1 ή περισσότερες φορές
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 6), match='αβγδεζ'>, <re.Match object; span=(7, 15), match='αβγβγδεζ'>, <re.Match object; span=(16, 26), match='αβγβγβγδεζ'>]

In [14]:

# Ο ποσοδείκτης {}

text = 'α αα ααα αααα ααααα'
pattern = re.compile(r'α{2}') # η συμβολοσειρά αα
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(2, 4), match='αα'>, <re.Match object; span=(5, 7), match='αα'>, <re.Match object; span=(9, 11), match='αα'>, <re.Match object; span=(11, 13), match='αα'>, <re.Match object; span=(14, 16), match='αα'>, <re.Match object; span=(16, 18), match='αα'>]

In [15]:

# Ο ποσοδείκτης {}

text = 'α αα ααα αααα ααααα'
pattern = re.compile(r'α{2,4}') # οι συμβολοσειρές αα ααα αααα
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(2, 4), match='αα'>, <re.Match object; span=(5, 8), match='ααα'>, <re.Match object; span=(9, 13), match='αααα'>, <re.Match object; span=(14, 18), match='αααα'>]

In [16]:

# Ο ποσοδείκτης {}

text = 'α αα ααα αααα ααααα'
pattern = re.compile(r'α{2,}') # η συμβολοσειρές αα, ααα, αααα κλπ
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(2, 4), match='αα'>, <re.Match object; span=(5, 8), match='ααα'>, <re.Match object; span=(9, 13), match='αααα'>, <re.Match object; span=(14, 19), match='ααααα'>]

In [17]:

# Escape ειδικών χαρακτήρων με το \

text = 'C*B*L F*RTR.N'
pattern = re.compile(r'C\*B\*L|F\*RTR\.N') # escape των ειδικών χαρακτήρων * και .
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 5), match='C*B*L'>, <re.Match object; span=(6, 13), match='F*RTR.N'>]

In [18]:

# Ο χαρακτήρας \d (ψηφίο)

text = 'α 1 β 2 γ 3 δ 4 ... ω 24'
pattern = re.compile(r'\d')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(2, 3), match='1'>, <re.Match object; span=(6, 7), match='2'>, <re.Match object; span=(10, 11), match='3'>, <re.Match object; span=(14, 15), match='4'>, <re.Match object; span=(22, 23), match='2'>, <re.Match object; span=(23, 24), match='4'>]

In [19]:

# O οριοθέτης λέξεων \b

text = 'αβγ δεζη θι κλμ'
pattern = re.compile(r'\b\w{3}\b') # αρχή λέξης, 3 χαρακτήρες, τέλος λέξης
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 3), match='αβγ'>, <re.Match object; span=(12, 15), match='κλμ'>]

In [20]:

# O αντίστροφος οριοθέτης λέξεων \B

text = 'αβγ δεζη θι κλμ'
pattern = re.compile(r'\b\w{3}\B') # αρχή λέξης, 3 χαρακτήρες, όχι τέλος λέξης
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(4, 7), match='δεζ'>]

In [21]:

# Ο ειδικός χαρακτήρας .

text = 'αβγδε α123ε αχψωε'
pattern = re.compile(r'α...ε')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 5), match='αβγδε'>, <re.Match object; span=(6, 11), match='α123ε'>, <re.Match object; span=(12, 17), match='αχψωε'>]

In [22]:

# αρχή κειμένου (ο ειδικός χαρακτήρας ^)

text = 'αβγδ αβγδε αβγδεζ'
pattern = re.compile(r'^αβγ*')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 3), match='αβγ'>]

In [23]:

# τέλος κειμένου (ο ειδικός χαρακτήρας $)

text = 'αβγδεζ αβγδεζ αβγδεζ'
pattern = re.compile(r'δεζ$')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(17, 20), match='δεζ'>]

In [24]:

# οτιδήποτε δεν είναι ψηφίο 

text = 'α1β2 γ3δ4 ε5ζ6 η7θ8'
pattern = re.compile(r'\D')
print([x.group(0) for x in pattern.finditer(text)])

['α', 'β', ' ', 'γ', 'δ', ' ', 'ε', 'ζ', ' ', 'η', 'θ']

In [25]:

# οτιδήποτε είναι χαρακτήρας λέξης

text = '* @ α1β2 γ3δ4 ε5ζ6 η7θ8 _'
pattern = re.compile(r'\w')
print([x.group(0) for x in pattern.finditer(text)])

['α', '1', 'β', '2', 'γ', '3', 'δ', '4', 'ε', '5', 'ζ', '6', 'η', '7', 'θ', '8', '_']

In [26]:

# οτιδήποτε δεν είναι χαρακτήρας λέξης

text = '* @ α1β2 γ3δ4 ε5ζ6 η7θ8 _'
pattern = re.compile(r'\W')
print([x.group(0) for x in pattern.finditer(text)])

['*', ' ', '@', ' ', ' ', ' ', ' ', ' ']

In [27]:

# οτιδήποτε είναι κενό (whitespace)

text = '* @ α1β2 γ3δ4 ε5ζ6 η7θ8 _  '
pattern = re.compile(r'\s')
print([x.group(0) for x in pattern.finditer(text)])

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

In [28]:

# οτιδήποτε δεν είναι κενό (non whitespace)

text = '* @ α1β2 γ3δ4 ε5ζ6 η7θ8 _  '
pattern = re.compile(r'\S')
print([x.group(0) for x in pattern.finditer(text)])

['*', '@', 'α', '1', 'β', '2', 'γ', '3', 'δ', '4', 'ε', '5', 'ζ', '6', 'η', '7', 'θ', '8', '_']

Ομάδες (groups)¶

Με τις ομάδες είναι δυνατόν να ληφθούν τμήματα από το κείμενο που ταιριάζει στο πρότυπο που αναζητείται. Τα τμήματα αυτά σημειώνονται με παρενθέσεις και ονομάζονται ομάδες.

In [29]:

# ομάδες 
# group() ή group(0) είναι όλο το ταίριασμα
# group(1) είναι η πρώτη ομάδα
# group(2) είναι η πρώτη ομάδα 
# κ.ο.κ.

text = 'Επικοινωνήστε με τα emails: john.doe@company.com ή jane.doe@institute.org'
pattern = re.compile(r'([\w\.-]+)@([\w-]+\.(\w{2,3}))')
for match in pattern.finditer(text):
    print(f'match={match.group()} group1={match.group(1)} group2={match.group(2)} group3={match.group(3)}')

match=john.doe@company.com group1=john.doe group2=company.com group3=com
match=jane.doe@institute.org group1=jane.doe group2=institute.org group3=org

In [30]:

# αναφορές σε ομάδες
# π.χ. ζεύγη 2 χαρακτήρων που χωρίζονται με παύλα και επαναλαμβάνονται

text = 'αα-αα β-β αβ-αβ αβ-βα'
pattern = re.compile(r'(\w{2})-\1') # το \1 αναφέρεται στην πρώτη ομάδα
print([x.group(0) for x in pattern.finditer(text)])

['αα-αα', 'αβ-αβ']

In [50]:

# λέξεις που ξεκινούν και τελειώνουν με τον ίδιο χαρακτήρα

text = 'area circle bomb urgent example'
pattern = re.compile(r'\b([a-z])[a-z]*\1(\W|$)') # το \1 αναφέρεται στην πρώτη ομάδα
print([x.group(0) for x in pattern.finditer(text)])

['area ', 'bomb ', 'example']

Regexes που χρησιμοποιούνται συχνά¶

In [31]:

# Ακέραιοι αριθμοί σε συγκεκριμένο εύρος (π.χ. ακέραιοι από το 1 μέχρι και το 42)

text = '1 18 35 42 56'
pattern = re.compile(r'\b([1-9]|[123]\d|4[0-2])\b') 
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 1), match='1'>, <re.Match object; span=(2, 4), match='18'>, <re.Match object; span=(5, 7), match='35'>, <re.Match object; span=(8, 10), match='42'>]

In [32]:

# Πραγματικοί αριθμοί
text = '3.1415'
pattern = re.compile(r'^[+-]?([0-9]*[.])?[0-9]+$') 
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 6), match='3.1415'>]

In [33]:

# Διευθύνσεις email (δείτε και το http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html)

text = "john@company.com"
pattern = re.compile(r'^([\w\.-]+)@([\w-]+\.(\w{2,3}))$')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 16), match='john@company.com'>]

In [34]:

# Τηλεφωνικοί αριθμοί 

text = "2681012345"
pattern = re.compile(r'^\d{10}$')
print([x for x in pattern.finditer(text)])

[<re.Match object; span=(0, 10), match='2681012345'>]

Άνοιγμα αρχείου και αναζήτηση προτύπων σε αυτό με regex¶

In [35]:

# ταίριασμα κειμένου σε αρχείο

pattern = re.compile(r'happy')

# άνοιγμα αρχείου με context manager
with open('../../../datasets/text_for_regex_experiments.txt', 'r') as f:
    contents = f.read()

print(contents)

matches = pattern.finditer(contents)

for match in matches:
    print(match)

I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasnβ€™t really happy.
I felt sad because I saw the others were sad and because I knew I should feel sad, but I wasnβ€™t really sad.
<re.Match object; span=(7, 12), match='happy'>
<re.Match object; span=(43, 48), match='happy'>
<re.Match object; span=(82, 87), match='happy'>
<re.Match object; span=(111, 116), match='happy'>

Άλλες συναρτήσεις εντοπισμού regexes¶

Εκτός από τη finditer υπάρχουν και άλλες συναρτήσεις εντοπισμού προτύπων όπως οι findall, match και search

In [36]:

# Η findall επιστρέφει τα ταιριάσματα ως μια λίστα λεκτικών, αν το πρότυπο περιέχει ομάδες θα επιστρέψει μόνο τις ομάδες

pattern = re.compile(r'happy')

with open('../../../datasets/text_for_regex_experiments.txt', 'r') as f:
    contents = f.read()

matches = pattern.findall(contents)
print(matches)

print('*'*40)

pattern = re.compile(r'ha(pp)y')
with open('../../../datasets/text_for_regex_experiments.txt', 'r') as f:
    contents = f.read()

matches = pattern.findall(contents)
print(matches)

['happy', 'happy', 'happy', 'happy']
****************************************
['pp', 'pp', 'pp', 'pp']

In [37]:

# H match επιστρέφει μόνο το πρώτο ταίριασμα, εξετάζοντας μόνο από την αρχή του λεκτικού μέχρι το τέλος της πρώτης γραμμής

pattern = re.compile(r'sad')

with open('../../../datasets/text_for_regex_experiments.txt', 'r') as f:
    contents = f.read()

matches = pattern.match(contents)
print(matches)

None

In [38]:

# Η search επιστρέφει μόνο το πρώτο ταίριασμα, εξετάζοντας όλες τις γραμμές του λεκτικού

pattern = re.compile(r'sad')

with open('../../../datasets/text_for_regex_experiments.txt', 'r') as f:
    contents = f.read()

matches = pattern.search(contents)
print(matches)

<re.Match object; span=(125, 128), match='sad'>

Flags και regex με σχόλια¶

re.IGNORECASE -> re.I
re.VERBOSE -> re.X ()

In [39]:

# flags

pattern = r'''
    (\w+)   # Ομάδα 1: ταίριασμα 1 ή περισσότερων γραμμάτων, αριθμών ή κάτω παύλες
    \s+     # 1 ή περισσότεροι κενοί χαρακτήρες
    \1      # ταίριασμα με την πρώτη ομάδα
''' 
print(re.search(pattern, 'x_z  X_Z', re.VERBOSE | re.IGNORECASE))

<re.Match object; span=(0, 8), match='x_z  X_Z'>

Αντικατάσταση προτύπων regex σε κείμενο με την re.sub¶

Σύνταξη: re.sub(pattern, repl, string, count=0, flag=0)

Εντοπισμός ενός προτύπου (pattern) σε ένα λεκτικό (string) και αντικατάστασή του με ένα άλλο λεκτικό (string)

In [40]:

# sub

text = '''I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy.
'''

new_text = re.sub(r'happy|because', '<missing word>', text)
print(new_text)

I felt <missing word> <missing word> I saw the others were <missing word> and <missing word> I knew I should feel <missing word>, but I wasn’t really <missing word>.

In [41]:

# αντικατάσταση όλων των ψηφίων με *

new_text = re.sub(r'\d', repl='*', string='Ο αριθμός τηλεφώνου του χρήστη είναι 1234567890')
print(new_text)

Ο αριθμός τηλεφώνου του χρήστη είναι **********

In [42]:

# μπορεί να χρησιμοποιηθεί συνάρτηση στη θέση της repl παραμέτρου, έτσι ώστε να επιτευχθεί εξυπνότερη συμπεριφορά (π.χ. διατήρηση εμφάνισης των 3 τελευταίων ψηφίων του αριθμού)

def keep_last_3(s):
    return '*' * (len(s[0])-3) + s[0][-3:]

new_text = re.sub(r'\d+', repl=keep_last_3, string='Ο αριθμός τηλεφώνου του χρήστη είναι 1234567890')
print(new_text)

Ο αριθμός τηλεφώνου του χρήστη είναι *******890

Διάσπαση λεκτικών με την re.split¶

Η re.split επιστρέφει μια λίστα που προκύπτει από τη διάσπαση του λεκτικού στα σημεία που γίνεται match

In [43]:

# split

text = '''I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy.
'''

a_list = re.split(r'happy', text)

print(a_list)

['I felt ', ' because I saw the others were ', ' and because I knew I should feel ', ', but I wasn’t really ', '.\n']

Greedy και non-greedy ταιριάσματα¶

*  --> greedy 0 ή περισσότερα
*? --> non greedy 0 ή περισσότερα
+  --> greedy 1 ή περισσότερα
+? --> non greedy 1 ή περισσότερα

In [44]:

text = 'αβγδεζηθικαλμνξοπρστυφχψωα'
match = re.search(r'α.*α', text) # greedy match
print(match)
match = re.search(r'α.*?α', text) # non-greedy match
print(match)

<re.Match object; span=(0, 26), match='αβγδεζηθικαλμνξοπρστυφχψωα'>
<re.Match object; span=(0, 11), match='αβγδεζηθικα'>

Επιπλέον παραδείγματα με κανονικές εκφράσεις¶

In [1]:

txt = """
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
"""

txt = txt.replace("\n", " ")
txt

Out[1]:

" The Zen of Python, by Tim Peters  Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! "

In [10]:

import re

# όλες οι λέξεις με κεφαλαίο το πρώτο γράμμα
re.findall(r"[A-Z]\w*", txt)

Out[10]:

['The',
 'Zen',
 'Python',
 'Tim',
 'Peters',
 'Beautiful',
 'Explicit',
 'Simple',
 'Complex',
 'Flat',
 'Sparse',
 'Readability',
 'Special',
 'Although',
 'Errors',
 'Unless',
 'In',
 'There',
 'Although',
 'Dutch',
 'Now',
 'Although',
 'If',
 'If',
 'Namespaces']

In [11]:

# όλες οι λέξεις που ξεκινούν με το χαρακτήρα e
re.findall(r"\be.*?\b", txt)

Out[11]:

['enough', 'explicitly', 'explain', 'easy', 'explain']

In [12]:

# όλες οι λέξεις
re.findall(r"\b\S+?\b", txt)

Out[12]:

['The',
 'Zen',
 'of',
 'Python',
 'by',
 'Tim',
 'Peters',
 'Beautiful',
 'is',
 'better',
 'than',
 'ugly',
 'Explicit',
 'is',
 'better',
 'than',
 'implicit',
 'Simple',
 'is',
 'better',
 'than',
 'complex',
 'Complex',
 'is',
 'better',
 'than',
 'complicated',
 'Flat',
 'is',
 'better',
 'than',
 'nested',
 'Sparse',
 'is',
 'better',
 'than',
 'dense',
 'Readability',
 'counts',
 'Special',
 'cases',
 'aren',
 "'",
 't',
 'special',
 'enough',
 'to',
 'break',
 'the',
 'rules',
 'Although',
 'practicality',
 'beats',
 'purity',
 'Errors',
 'should',
 'never',
 'pass',
 'silently',
 'Unless',
 'explicitly',
 'silenced',
 'In',
 'the',
 'face',
 'of',
 'ambiguity',
 'refuse',
 'the',
 'temptation',
 'to',
 'guess',
 'There',
 'should',
 'be',
 'one',
 'and',
 'preferably',
 'only',
 'one',
 'obvious',
 'way',
 'to',
 'do',
 'it',
 'Although',
 'that',
 'way',
 'may',
 'not',
 'be',
 'obvious',
 'at',
 'first',
 'unless',
 'you',
 "'",
 're',
 'Dutch',
 'Now',
 'is',
 'better',
 'than',
 'never',
 'Although',
 'never',
 'is',
 'often',
 'better',
 'than',
 'right',
 'now',
 'If',
 'the',
 'implementation',
 'is',
 'hard',
 'to',
 'explain',
 'it',
 "'",
 's',
 'a',
 'bad',
 'idea',
 'If',
 'the',
 'implementation',
 'is',
 'easy',
 'to',
 'explain',
 'it',
 'may',
 'be',
 'a',
 'good',
 'idea',
 'Namespaces',
 'are',
 'one',
 'honking',
 'great',
 'idea',
 'let',
 "'",
 's',
 'do',
 'more',
 'of',
 'those']

In [18]:

# όλες οι λέξεις με 5 έως και 7 χαρακτήρες 
re.findall(r"\b[a-z]{5,7}\b", txt.lower())

Out[18]:

['python',
 'peters',
 'better',
 'better',
 'simple',
 'better',
 'complex',
 'complex',
 'better',
 'better',
 'nested',
 'sparse',
 'better',
 'dense',
 'counts',
 'special',
 'cases',
 'special',
 'enough',
 'break',
 'rules',
 'beats',
 'purity',
 'errors',
 'should',
 'never',
 'unless',
 'refuse',
 'guess',
 'there',
 'should',
 'obvious',
 'obvious',
 'first',
 'unless',
 'dutch',
 'better',
 'never',
 'never',
 'often',
 'better',
 'right',
 'explain',
 'explain',
 'honking',
 'great',
 'those']

In [16]:

# όλες οι προτάσεις
re.findall(r"\s(.*?)\.", txt)

Out[16]:

['The Zen of Python, by Tim Peters  Beautiful is better than ugly',
 'Explicit is better than implicit',
 'Simple is better than complex',
 'Complex is better than complicated',
 'Flat is better than nested',
 'Sparse is better than dense',
 'Readability counts',
 "Special cases aren't special enough to break the rules",
 'Although practicality beats purity',
 'Errors should never pass silently',
 'Unless explicitly silenced',
 'In the face of ambiguity, refuse the temptation to guess',
 'There should be one-- and preferably only one --obvious way to do it',
 "Although that way may not be obvious at first unless you're Dutch",
 'Now is better than never',
 'Although never is often better than *right* now',
 "If the implementation is hard to explain, it's a bad idea",
 'If the implementation is easy to explain, it may be a good idea']

In [20]:

# όλες οι λέξεις που ξεκινούν και τελειώνουν με τον ίδιο χαρακτήρα
[x[0] + x[1] for x in re.findall(r"\b([a-z])(\S+?\1)\b", txt)]

Out[20]:

['that']

In [ ]: