Make a new `JournalCrawler` (soup)¶

Where you need to update

gummy.utils.journal_utils.py
gummy.journals.py
tests.data.py
Wiki

You can create a new JournalCrawler whose crawl_type is "soup".

In [1]:

from gummy.utils import get_driver
from gummy.journals import *

Translation-Gummy ver.3.4.4
Checking available drivers... (if one of the drivers is built, there is no problem)
[success] local driver can be built. 
[failure] remote driver can't be built. 
> HTTPConnectionPool(host='selenium', port=4444): Max retries exceeded with url: /wd/hub/session (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x124f1c070>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
DRIVER_TYPE: local

In [2]:

class GoogleJournal(GummyAbstJournal):
    pass
self = GoogleJournal()

In [3]:

def get_soup_driver(url):
    with get_driver() as driver:
        soup = self.get_soup_source(url=url, driver=driver)
        cano_url = canonicalize(url=url, driver=driver)
    return soup, cano_url

In [4]:

def get_soup(url):
    cano_url = canonicalize(url=url, driver=None)
    soup = self.get_soup_source(url=url, driver=None)
    return soup, cano_url

In [5]:

url = input()

https://www.google.com/

create `get_contents_soup`¶

With Driver Ver.¶

In [6]:

soup, cano_url = get_soup_driver(url)
self._store_crawled_info(cano_url=cano_url)
print(f"canonicalized URL: {toBLUE(cano_url)}")

DRIVER_TYPE: local

/Users/iwasakishuto/Github/Translation-Gummy/gummy/gateways.py:117: GummyImprementationWarning: UselessGateWay doesn't support any individual journal, please define a method corresponding to a journal named Hoge with a name _pass2hoge
  warnings.warn(message=msg, category=GummyImprementationWarning)

Use UselessGateWay._pass2others method.
Wait up to 3[s] for all page elements to load.
Scroll down to the bottom of the page.

Decompose unnecessary tags to make it easy to parse.
==============================
Decomposed <i> tag (0)
Decomposed <link> tag (0)
Decomposed <meta> tag (4)
Decomposed <noscript> tag (0)
Decomposed <script> tag (13)
Decomposed <style> tag (25)
Decomposed <sup> tag (0)
Decomposed <None> tag (0)
canonicalized URL: https://www.google.com/

`get_title_from_soup`¶

In [7]:

title = find_target_text(soup=soup, name="div", attrs={"id": "SIvCob"}, strip=True, default=self.default_title)
print(f"title: {toGREEN(title)}")

title: Google 検索は次の言語でもご利用いただけます: English

`get_sections_from_soup`¶

In [8]:

sections = soup.find_all(name="center")
print(f"num sections: {toBLUE(len(sections))}")

num sections: 3

`get_head_from_section`¶

In [9]:

def get_head_from_section(section):
    head = section.find(name="input")
    return head

self.get_head_from_section = get_head_from_section

In [10]:

contens = self.get_contents_from_soup_sections(sections)

Show contents of the paper.
==============================
[1/3] 
[2/3] 
[3/3]

No Driver Ver.¶

In [11]:

soup, cano_url = get_soup(url)
self._store_crawled_info(cano_url=cano_url)
print(f"canonicalized URL: {toBLUE(cano_url)}")

Get HTML content from https://www.google.com/

Decompose unnecessary tags to make it easy to parse.
==============================
Decomposed <i> tag (0)
Decomposed <link> tag (0)
Decomposed <meta> tag (4)
Decomposed <noscript> tag (0)
Decomposed <script> tag (6)
Decomposed <style> tag (2)
Decomposed <sup> tag (0)
Decomposed <None> tag (0)
canonicalized URL: https://www.google.com/

`get_title_from_soup`¶

In [12]:

title = find_target_text(soup=soup, name="div", attrs={"id": "SIvCob"}, strip=True, default=self.default_title)
print(f"title: {toGREEN(title)}")

title: 2020-09-10@23.18.14

`get_sections_from_soup`¶

In [13]:

sections = soup.find_all(name="center")
print(f"num sections: {toBLUE(len(sections))}")

num sections: 1

`get_head_from_section`¶

In [14]:

def get_head_from_section(section):
    head = section.find(name="input")
    return head

self.get_head_from_section = get_head_from_section

In [15]:

contens = self.get_contents_from_soup_sections(sections)

Show contents of the paper.
==============================
[1/1]

Confirmation¶

NOTE: You also have to modify these variables:

In [16]:

from gummy import TranslationGummy

In [17]:

# model = TranslationGummy()
# model.toPDF(url=url)

If successful, edit here too:

In [ ]:

Make a new JournalCrawler (soup)¶

Where you need to update

create get_contents_soup¶

With Driver Ver.¶

get_title_from_soup¶

get_sections_from_soup¶

get_head_from_section¶

No Driver Ver.¶

get_title_from_soup¶

get_sections_from_soup¶

get_head_from_section¶

Confirmation¶

Make a new `JournalCrawler` (soup)¶

create `get_contents_soup`¶

`get_title_from_soup`¶

`get_sections_from_soup`¶

`get_head_from_section`¶

`get_title_from_soup`¶

`get_sections_from_soup`¶

`get_head_from_section`¶