import h2o
h2o.init()
from collections import OrderedDict
documents = [
'H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.',
'Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent\'s net to score goals. The sport is known to be fast-paced and physical.',
'An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.'
]
doc_ids = list(range(len(documents)))
input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),
column_types=['numeric', 'string'])
input_frame.head()
Parse progress: |█████████████████████████████████████████████████████████| 100%
DocID | Document |
---|---|
0 | H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. |
1 | Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. The sport is known to be fast-paced and physical. |
2 | An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. |
from h2o.information_retrieval.tf_idf import tf_idf
tf_idf_out = tf_idf(input_frame, "DocID", "Document", False, False)
tf_idf_out.head()
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. | 1 | 0.693147 | 0.693147 |
0 | h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. | 1 | 0.693147 | 0.693147 |
1 | ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. | 1 | 0.693147 | 0.693147 |
from IPython.display import DisplayObject, display
VALUES_CNT_TO_SHOW = 3
def tf_idf_output_summary(tf_idf_out):
for doc_id in doc_ids:
sorted_doc_tf_idfs = tf_idf_out[tf_idf_out['DocID'] == doc_id].sort(by='TF-IDF')
print('The highest TF-IDF values for document ' + str(doc_id) +':')
display(sorted_doc_tf_idfs.tail(VALUES_CNT_TO_SHOW))
print('The lowest TF-IDF values for document ' + str(doc_id) +':')
display(sorted_doc_tf_idfs.head(VALUES_CNT_TO_SHOW))
print('\n')
tf_idf_output_summary(tf_idf_out)
The highest TF-IDF values for document 0:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
0 | h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. | 1 | 0.693147 | 0.693147 |
The lowest TF-IDF values for document 0:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
0 | h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. | 1 | 0.693147 | 0.693147 |
The highest TF-IDF values for document 1:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
1 | ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. | 1 | 0.693147 | 0.693147 |
The lowest TF-IDF values for document 1:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
1 | ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. | 1 | 0.693147 | 0.693147 |
The highest TF-IDF values for document 2:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. | 1 | 0.693147 | 0.693147 |
The lowest TF-IDF values for document 2:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. | 1 | 0.693147 | 0.693147 |
preprocessed_data = [(doc_id, word) for doc_id, document in enumerate(documents) for word in document.split()]
preprocessed_input_frame = h2o.H2OFrame(preprocessed_data,
column_names=['DocID', 'Document'],
column_types=['numeric', 'string'])
preprocessed_input_frame.head()
Parse progress: |█████████████████████████████████████████████████████████| 100%
DocID | Document |
---|---|
0 | H2O |
0 | is |
0 | an |
0 | in-memory |
0 | platform |
0 | for |
0 | distributed, |
0 | scalable |
0 | machine |
0 | learning. |
tf_idf_out = tf_idf(preprocessed_input_frame, 'DocID', 'Document', preprocess=False)
tf_idf_out.head()
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | (Ab), | 1 | 0.693147 | 0.693147 |
2 | (Ig), | 1 | 0.693147 | 0.693147 |
2 | An | 1 | 0.693147 | 0.693147 |
0 | Flow | 1 | 0.693147 | 0.693147 |
0 | H2O | 2 | 0.693147 | 1.38629 |
0 | Hadoop | 1 | 0.693147 | 0.693147 |
1 | Ice | 1 | 0.693147 | 0.693147 |
0 | JSON | 1 | 0.693147 | 0.693147 |
0 | Java, | 1 | 0.693147 | 0.693147 |
0 | Python, | 1 | 0.693147 | 0.693147 |
tf_idf_output_summary(tf_idf_out)
The highest TF-IDF values for document 0:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
0 | works | 1 | 0.693147 | 0.693147 |
0 | H2O | 2 | 0.693147 | 1.38629 |
0 | like | 2 | 0.693147 | 1.38629 |
The lowest TF-IDF values for document 0:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
0 | and | 3 | 0 | 0 |
0 | is | 1 | 0 | 0 |
0 | an | 1 | 0.287682 | 0.287682 |
The highest TF-IDF values for document 1:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
1 | in | 2 | 0.693147 | 1.38629 |
1 | sport | 2 | 0.693147 | 1.38629 |
1 | their | 2 | 0.693147 | 1.38629 |
The lowest TF-IDF values for document 1:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
1 | and | 1 | 0 | 0 |
1 | is | 2 | 0 | 0 |
1 | known | 1 | 0.287682 | 0.287682 |
The highest TF-IDF values for document 2:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | viruses. | 1 | 0.693147 | 0.693147 |
2 | as | 2 | 0.693147 | 1.38629 |
2 | by | 2 | 0.693147 | 1.38629 |
The lowest TF-IDF values for document 2:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | and | 1 | 0 | 0 |
2 | is | 2 | 0 | 0 |
2 | a | 1 | 0.287682 | 0.287682 |
input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),
column_types=['numeric', 'string'])
Parse progress: |█████████████████████████████████████████████████████████| 100%
tf_idf_out = tf_idf(input_frame, 'DocID', 'Document', case_sensitive=False)
tf_idf_out.head()
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | (ab), | 1 | 0.693147 | 0.693147 |
2 | (ig), | 1 | 0.693147 | 0.693147 |
1 | a | 3 | 0.287682 | 0.863046 |
2 | a | 1 | 0.287682 | 0.287682 |
2 | also | 1 | 0.693147 | 0.693147 |
0 | an | 1 | 0.287682 | 0.287682 |
2 | an | 2 | 0.287682 | 0.575364 |
0 | and | 3 | 0 | 0 |
1 | and | 1 | 0 | 0 |
2 | and | 1 | 0 | 0 |
tf_idf_output_summary(tf_idf_out)
The highest TF-IDF values for document 0:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
0 | works | 1 | 0.693147 | 0.693147 |
0 | h2o | 2 | 0.693147 | 1.38629 |
0 | like | 2 | 0.693147 | 1.38629 |
The lowest TF-IDF values for document 0:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
0 | and | 3 | 0 | 0 |
0 | is | 1 | 0 | 0 |
0 | the | 1 | 0 | 0 |
The highest TF-IDF values for document 1:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
1 | in | 2 | 0.693147 | 1.38629 |
1 | sport | 2 | 0.693147 | 1.38629 |
1 | their | 2 | 0.693147 | 1.38629 |
The lowest TF-IDF values for document 1:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
1 | and | 1 | 0 | 0 |
1 | is | 2 | 0 | 0 |
1 | the | 1 | 0 | 0 |
The highest TF-IDF values for document 2:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | y-shaped | 1 | 0.693147 | 0.693147 |
2 | as | 2 | 0.693147 | 1.38629 |
2 | by | 2 | 0.693147 | 1.38629 |
The lowest TF-IDF values for document 2:
DocID | Word | TF | IDF | TF-IDF |
---|---|---|---|---|
2 | and | 1 | 0 | 0 |
2 | is | 2 | 0 | 0 |
2 | the | 1 | 0 | 0 |