client Package

client Package

webLyzard web service clients

classifier Module

Created on Jan 16, 2013

class weblyzard_api.client.classifier.Classifier(url='http://localhost:8080', usr=None, pwd=None)[source]

Bases: eWRT.ws.rest.MultiRESTClient

Classifier

Provides support for text classification.

Parameters:
  • url – URL of the jeremia web service
  • usr – optional user name
  • pwd – optional password
CLASSIFIER_WS_BASE_PATH = '/joseph/rest/'
classify_v2(classifier_profile, weblyzard_xml, search_agents=None, num_results=1)[source]

Classify weblyzard XML documents based on the given classifier profile using the new classifier interface

Parameters:
  • classifier_profile – the profile to use for classification (e.g. ‘COMET’, ‘MK’)
  • weblyzard_xml – weblyzard_xml representation of the document to classify
  • search_agents

    a list of search agent dictionaries which are composed as follows {

    {“name”:”Axa Winterthur”,
    “id”:9, “product_list”:[
    {“name”:”AXA WINTERTHUR VERS. PRODUKTE RP”,”id”:300682}, {“name”:”AXA WINTERTHUR FINANZ PERSONEN RP”,”id”:300803}, {“name”:”AXA WINTERTHUR FINANZ PRODUKTE RP”,”id”:300804},

    ]

    }

    ]

  • num_results – number of classes to return
Returns:

the classification result

hello_world()[source]

Simple hello world test.

train(classifier_profile, weblyzard_xml, correct_category, incorrect_category=None, document_timestamp=None)[source]

Trains (and corrects) the classifier’s knowledge base.

Parameters:
  • classifier_profile – the profile to use for classification (e.g. ‘COMET’, ‘MK’)
  • weblyzard_xml – weblyzard_xml representation of the document to learn
  • correct_category – the correct category for the document
  • incorrect_category – optional information on the incorrect category returned for this document
  • document_timestamp – an optional timestamp, specifying when the document has been classified (used for retraining temporal knowledge bases)
Returns:

a response object with a status code and message.

domain_specificity Module

class weblyzard_api.client.domain_specificity.DomainSpecificity(url='http://localhost:8080', usr=None, pwd=None)[source]

Bases: eWRT.ws.rest.MultiRESTClient

Domain Specificity Web Service

Determines whether documents are relevant for a given domain by searching for domain relevant terms in these documents.

Workflow

  1. submit a domain-specificity profile with add_profile()
  2. obtain the domain-speificity of text documents with get_domain_specificity(), parse_documents() or search_documents().
Parameters:
  • url – URL of the jeremia web service
  • usr – optional user name
  • pwd – optional password
URL_PATH = 'rest/domain_specificity'
add_profile(profile_name, profile_mapping)[source]

Adds a domain-specificity profile to the Web service.

Parameters:
  • profile_name – the name of the domain specificity profile
  • profile_mapping – a dictionary of keywords and their respective domain specificity values.
get_domain_specificity(profile_name, documents, is_case_sensitive=True)[source]
Parameters:
  • profile_name – the name of the domain specificity profile to use.
  • documents – a list of dictionaries containing the document
  • is_case_sensitive – whether to consider case or not (default: True)
has_profile(profile_name)[source]

Returns whether the given profile exists on the server.

Parameters:profile_name – the name of the domain specificity profile to check.
Returns:True if the given profile exists on the server.
list_profiles()[source]
Returns:a list of all available domain specificity profiles.
meminfo()[source]
Returns:Information on the web service’s memory consumption
parse_documents(matview_name, documents, is_case_sensitive=False)[source]
Parameters:
  • matview_name – a comma separated list of matview_names to check for domain specificity.
  • documents – a list of dictionaries containing the document
  • is_case_sensitive – case sensitive or not
Returns:

dict (profilename: (content_id, dom_spec))

search_documents(profile_name, documents, is_case_sensitive=False)[source]

jeremia Module

class weblyzard_api.client.jeremia.Jeremia(url='http://localhost:8080', usr=None, pwd=None)[source]

Bases: eWRT.ws.rest.MultiRESTClient

Jeremia Web Service

Pre-processes text documents and returns an annotated webLyzard XML document.

Blacklisting

Blacklisting is an optional service which removes sentences which occur multiple times in different documents from these documents. Examples for such sentences are document headers or footers.

The following functions handle sentence blacklisting:

Jeremia returns a webLyzard XML document. The weblyzard_api provides the class XMLContent to process and manipulate the weblyzard XML documents.:

Note

Example usage

from weblyzard_api.client.recognize import Recognize
from pprint import pprint

docs = {'id': '192292', 
        'title': 'The document title.', 
        'body': 'This is the document text...', 
        'format': 'text/html', 
        'header': {}}
client = Jeremia()
result = client.submit_document(docs)
pprint(result)
Parameters:
  • url – URL of the jeremia web service
  • usr – optional user name
  • pwd – optional password
ATTRIBUTE_MAPPING = {'lang': 'lang', 'sentences_map': {'token': 'token', 'md5sum': 'id', 'pos': 'pos', 'value': 'value'}, 'content_id': 'id', 'sentences': 'sentence', 'title': 'title'}
URL_PATH = 'jeremia/rest'
clear_blacklist(source_id)[source]
Parameters:source_id – the blacklist’s source id

Empties the existing sentence blacklisting cache for the given source_id

commit(batch_id, sentence_threshold=None)[source]
Parameters:batch_id – the batch_id to retrieve
Returns:a generator yielding all the documents of that particular batch
get_blacklist(source_id)[source]
Parameters:source_id – the blacklist’s source id
Returns:the sentence blacklist for the given source_id
get_xml_doc(text, content_id='1')[source]

Processes text and returns a XMLContent object.

Parameters:
  • text – the text to process
  • content_id – optional content id
status()[source]
Returns:the status of the Jeremia web service.
submit(batch_id, documents, source_id=None, use_blacklist=False, sentence_threshold=None)[source]

Convenience function to submit documents. The function will submit the list of documents and finally call commit to retrieve the result

Parameters:
  • batch_id – ID of the batch
  • documents – list of documents (dict)
  • source_id
  • use_blacklist – use the blacklist or not
Returns:

result as a list with dicts

submit_document(document)[source]

processes a single document with jeremia (annotates a single document)

Parameters:document – the document to be processed
submit_documents(batch_id, documents)[source]
Parameters:
  • batch_id – batch_id to use for the given submission
  • documents – a list of dictionaries containing the document
submit_documents_blacklist(batch_id, documents, source_id)[source]

submits the documents and removes blacklist sentences

Parameters:
  • batch_id – batch_id to use for the given submission
  • documents – a list of dictionaries containing the document
  • source_id – source_id for the documents, determines the blacklist
update_blacklist(source_id, blacklist)[source]

updates an existing blacklist cache

Parameters:source_id – the blacklist’s source id
version()[source]
Returns:the current version of the jeremia deployed on the server

jesaja Module

class weblyzard_api.client.jesaja.Jesaja(url='http://localhost:8080', usr=None, pwd=None)[source]

Bases: eWRT.ws.rest.MultiRESTClient

Provides access to the Jesaja keyword service.

Jesaja extracts associations (i.e. keywords) from text documents.

Parameters:
  • url – URL of the jeremia web service
  • usr – optional user name
  • pwd – optional password
ATTRIBUTE_MAPPING = {'lang': 'xml:lang', 'sentences_map': {'token': 'token', 'md5sum': 'id', 'pos': 'pos', 'value': 'value'}, 'content_id': 'id', 'sentences': 'sentence', 'title': 'title'}
URL_PATH = 'jesaja/rest'
VALID_CORPUS_FORMATS = ('xml', 'csv')
add_or_update_corpus(corpus_name, corpus_format, corpus, profile_name=None, skip_profile_check=False)[source]

Adds/updates a corpus at Jesaja.

Parameters:
  • corpus_name – the name of the corpus
  • corpus_format – either ‘csv’, ‘xml’, or wlxml
  • corpus – the corpus in the given format.
  • profile_name – the name of the profile used for tokenization (only used in conjunction with corpus_format ‘doc’).

Note

Supported corpus_format

  • csv

  • xml

  • wlxml:

    # xml_content: the content in the weblyzard xml format
    corpus = [ xml_content, ... ]  
    

Attention

uploading documents (corpus_format = doc, wlxml) requires a call to finalize_corpora to trigger the corpus generation!

add_or_update_stoplist(name, stoplist)[source]

Deprecated since version 0.1: Use: add_stoplist() instead.

add_profile(profile_name, keyword_calculation_profile)[source]

Add a keyword profile to the server

Parameters:
  • profile_name – the name of the keyword profile
  • keyword_calculation_profile – the full keyword calculation profile (see below).

Note

Example keyword calculation profile

{ 
    'valid_pos_tags'                 : ['NN', 'P', 'ADJ'],
    'corpus_name'                    : reference_corpus_name,
    'min_phrase_significance'        : 2.0,
    'num_keywords'                   : 5,
    'keyword_algorithm'              : 'com.weblyzard.backend.jesaja.algorithm.keywords.YatesKeywordSignificanceAlgorithm', 
    'min_token_count'                : 5,
    'skip_underrepresented_keywords' : True,
    'stoplists'                      : [],
}

Note

Available keyword_algorithms

  • com.weblyzard.backend.jesaja.algorithm.keywords.YatesKeywordSignificanceAlgorithm
  • com.weblyzard.backend.jesaja.algorithm.keywords.LogLikelihoodKeywordSignificanceAlgorithm
add_stoplist(name, stoplist)[source]
Parameters:
  • name – name of the stopword list
  • stoplist – a list of stopwords for the keyword computation
change_log_level(level)[source]

Changes the log level of the keyword service

Parameters:level – the new log level to use.
classmethod convert_document(xml)[source]

converts an XML String to a dictionary with the correct parameters (ignoring non-sentences and adding the titles

Parameters:xml – str representing the document
Returns:converted document
Return type:dict
finalize_corpora()[source]

Note

This function needs to be called after uploading ‘doc’ or ‘wlxml’ corpora, since it triggers the computations of the token counts based on the ‘valid_pos_tags’ parameter.

finalize_profile(profile_name)[source]
get_cache_stats()[source]
get_cached_corpora()[source]
get_corpus_size(profile_name)[source]
classmethod get_documents(xml_content_dict)[source]

converts a list of weblyzard xml files to the json format required by the jesaja web service.

get_keywords(profile_name, documents)[source]
Parameters:
  • profile_name – keyword profile to use
  • documents – a list of webLyzard xml documents to annotate

Note

example documents list

documents = [
  {
     'title': 'Test document',
     'sentence': [
         {
           'id': '27150b5fae553ebab63332fe7b94d518',
           'pos': 'NNP VBZ VBN IN VBZ NNP . NNP VBZ NNP .',
           'token': '0,5 6,8 9,16 17,19 20,27 28,43 43,44 45,48 49,54 55,61 61,62',
           'value': 'CDATA is wrapped as follows <![CDATA[aha]]>. Ana loves Martin.'
         },
         {
           'id': 'f8ddd9b3c8cf4c7764a3348d14e84e79',
           'pos': 'NN IN CD ' IN JJR JJR JJR JJR CC CC CC : : JJ NN .',
           'token': '0,4 5,7 8,9 10,11 12,16 17,18 18,19 19,20 20,21 22,23 23,24 25,28 29,30 30,31 32,39 40,45 45,46',
           'value': '10µm in € ” with <><> && and // related stuff.'
         }
     ],
     'content_id': '123k233',
     'lang': 'en',
     }
]
get_keywords_xml(profile_name, documents)[source]

converts each document to a dictionary and calculates the keywords

has_profile(profile_name)[source]
list_profiles()[source]
list_stoplists()[source]
Returns:a list of all available stopword lists.
meminfo()[source]

pos Module

Part-of-speech (POS) tagging service

class weblyzard_api.client.pos.POS(url='http://voyager.srv.weblyzard.net/ws', usr=None, pwd=None)[source]

Bases: eWRT.ws.rest.RESTClient

Parameters:
  • url – URL of the jeremia web service
  • usr – optional user name
  • pwd – optional password
pos_tagging(text, lang)[source]

tags the following text using the given language dictionary

Returns:the corresponding ANNIE compatible annotations

recognize Module

class weblyzard_api.client.recognize.EntityLyzardTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

DOCS = [{'xml:lang': 'de', 'id': 99933, 'sentence': [{'token': '0,5 6,12 13,16 17,19 20,23 24,27 28,36 36,37', 'id': '50612085a00cf052d66db97ff2252544', 'value': u'Georg M\xfcller hat 10 Mio CHF gewonnen.', 'pos': 'NE NE VAFIN CARD NE NE VVPP $.'}, {'token': '0,4 5,12 13,19 20,23 24,27 28,35 36,39 40,42 43,46 47,50 50,51 52,55 56,59 60,65 66,72 73,84 84,85 86,92 93,101 101,102', 'id': 'a3b05957957e01060fd58af587427362', 'value': u'Herr Schmidt konnte mit dem Angebot von 10 Mio CHF, das ihm Georg M\xfcller hinterlegte, nichts anfangen.', 'pos': 'NN NE VMFIN APPR ART NN APPR CARD NE NE $, PRELS PPER NE NE VVFIN $, PIS VVINF $.'}]}, {'xml:lang': 'de', 'id': 99934, 'sentence': [{'token': '0,6 7,14 15,23 23,24 25,29 30,33 34,37 38,42 43,47 48,59 60,64 65,69 69,70', 'id': 'f98a0c4d2ddffd60b64b9b25f1f5657a', 'value': u'Rektor Kessler erkl\xe4rte, dass die HTW auch 2014 erfolgreich sein wird.', 'pos': 'NN NE VVFIN $, KOUS ART NN ADV CARD ADJD VAINF VAFIN $.'}]}]
DOCS_XML = ['\n <?xml version="1.0" encoding="UTF-8"?>\n <wl:page xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wl="http://www.weblyzard.com/wl/2013#" dc:title="" wl:id="99933" dc:format="text/html" xml:lang="de" wl:nilsimsa="030472f84612acc42c7206e07814e69888267530636221300baf8bc2da66b476" dc:related="http://www.heise.de http://www.kurier.at">\n <wl:sentence wl:id="50612085a00cf052d66db97ff2252544" wl:pos="NE NE VAFIN CARD NE NE VVPP $." wl:token="0,5 6,12 13,16 17,19 20,23 24,27 28,36 36,37" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Georg M\xc3\xbcller hat 10 Mio CHF gewonnen.]]></wl:sentence>\n <wl:sentence wl:id="a3b05957957e01060fd58af587427362" wl:pos="NN NE VMFIN APPR ART NN APPR CARD NE NE $, PRELS PPER NE NE VVFIN $, PIS VVINF $." wl:token="0,4 5,12 13,19 20,23 24,27 28,35 36,39 40,42 43,46 47,50 50,51 52,55 56,59 60,65 66,72 73,84 84,85 86,92 93,101 101,102" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Herr Schmidt konnte mit dem Angebot von 10 Mio CHF, das ihm Georg M\xc3\xbcller hinterlegte, nichts anfangen.]]></wl:sentence>\n </wl:page>\n ', '\n <?xml version="1.0" encoding="UTF-8"?>\n <wl:page xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wl="http://www.weblyzard.com/wl/2013#" dc:title="" wl:id="99934" dc:format="text/html" xml:lang="de" wl:nilsimsa="020ee211a20084bb0d2208038548c02405bb0110d2183061db9400d74c15553a" dc:related="http://www.heise.de http://www.kurier.at">\n <wl:sentence wl:id="f98a0c4d2ddffd60b64b9b25f1f5657a" wl:pos="NN NE VVFIN $, KOUS ART NN ADV CARD ADJD VAINF VAFIN $." wl:token="0,6 7,14 15,23 23,24 25,29 30,33 34,37 38,42 43,47 48,59 60,64 65,69 69,70" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Rektor Kessler erkl\xc3\xa4rte, dass die HTW auch 2014 erfolgreich sein wird.]]></wl:sentence>\n </wl:page>\n ']
IS_ONLINE = True
TESTED_PROFILES = ['de.people.ng', 'en.geo.500000.ng', 'en.organization.ng', 'en.people.ng']
setUp()[source]
test_entity_lyzard()[source]
test_geo()[source]
test_geo_swiss()[source]

Tests the geo annotation service for Swiss media samples.

Note

de_CH.geo.5000.ng detects Swiss cities with more than 5000 and worldwide cities with more than 500,000 inhabitants.

test_missing_profiles()[source]
test_organization()[source]
test_password()[source]
test_people()[source]
test_search_xml()[source]
xml = '\n <?xml version="1.0" encoding="UTF-8"?>\n <wl:page xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wl="http://www.weblyzard.com/wl/2013#" dc:title="" wl:id="99934" dc:format="text/html" xml:lang="de" wl:nilsimsa="020ee211a20084bb0d2208038548c02405bb0110d2183061db9400d74c15553a" dc:related="http://www.heise.de http://www.kurier.at">\n <wl:sentence wl:id="f98a0c4d2ddffd60b64b9b25f1f5657a" wl:pos="NN NE VVFIN $, KOUS ART NN ADV CARD ADJD VAINF VAFIN $." wl:token="0,6 7,14 15,23 23,24 25,29 30,33 34,37 38,42 43,47 48,59 60,64 65,69 69,70" wl:sem_orient="0.0" wl:significance="0.0"><![CDATA[Rektor Kessler erkl\xc3\xa4rte, dass die HTW auch 2014 erfolgreich sein wird.]]></wl:sentence>\n </wl:page>\n '
class weblyzard_api.client.recognize.Recognize(url='http://localhost:8080', usr=None, pwd=None)[source]

Bases: eWRT.ws.rest.MultiRESTClient

Provides access to the Recognize Web Service.

Workflow:
  1. pre-load the recognize profiles you need using the add_profile() call.
  2. submit the text or documents to analyze using one of the following calls:

Note

Example usage

from weblyzard_api.client.recognize import Recognize
from pprint import pprint

url = 'http://triple-store.ai.wu.ac.at/recognize/rest/recognize'
profile_names = ['en.organization.ng', 'en.people.ng', 'en.geo.500000.ng']
text = 'Microsoft is an American multinational corporation 
        headquartered in Redmond, Washington, that develops, 
        manufactures, licenses, supports and sells computer 
        software, consumer electronics and personal computers 
        and services. It was was founded by Bill Gates and Paul
        Allen on April 4, 1975.'

client = Recognize(url)
result = client.search_text(profile_names,
            text,
            output_format='compact',
            max_entities=40,
            buckets=40,
            limit=40)  
pprint(result)
Parameters:
  • url – URL of the jeremia web service
  • usr – optional user name
  • pwd – optional password
ATTRIBUTE_MAPPING = {'lang': 'xml:lang', 'sentences_map': {'token': 'token', 'md5sum': 'id', 'pos': 'pos', 'value': 'value'}, 'content_id': 'id', 'sentences': 'sentence'}
OUTPUT_FORMATS = ('standard', 'minimal', 'annie', 'compact')
URL_PATH = 'recognize/rest/recognize'
add_profile(profile_name, force=False)[source]

pre-loads the given profile

::param profile_name: name of the profile to load.

classmethod convert_document(xml)[source]

converts an XML String to a document dictionary necessary for transmitting the document to Recognize.

Parameters:xml – weblyzard_xml representation of the document
Returns:the converted document
Return type:dict

Note

non-sentences are ignored and titles are added based on the XmlContent’s interpretation of the document.

get_focus(profile_names, doc_list, max_results=1)[source]
Parameters:
  • profile_names – a list of profile names
  • doc_list – a list of documents to analyze based on the weblyzardXML format
  • max_results – maximum number of results to include
Returns:

the focus and annotation of the given document

get_xml_document(document)[source]
Returns:the correct XML representation required by the Recognize service
list_configured_profiles()[source]
Returns:a list of all profiles supported in the current configuration
list_profiles()[source]
Returns:a list of all pre-loaded profiles
>>> r=Recognize()
>>> r.list_profiles()
[u'Cities.DACH.10000.de_en', u'People.DACH.de']
remove_profile(profile_name)[source]

removes a profile from the list of pre-loaded profiles

search_document(profile_names, document, debug=False, max_entities=1, buckets=1, limit=1, output_format='minimal')[source]
Parameters:
  • profile_names – a list of profile names
  • document – a single document to analyze (see example documents below)
  • debug – compute and return an explanation
  • buckets – only return n buckets of hits with the same score
  • max_entities – number of results to return (removes the top hit’s tokens and rescores the result list subsequently
  • limit – only return that many results
  • output_format – the output format to use (‘standard’, ‘minimal’, ‘annie’)
Return type:

the tagged dictionary

Note

Example document

# option 1: document dictionary
{'content_id': 12, 
 'content': u'the text to analyze'}

# option 2: weblyzardXML
XMLContent('<?xml version="1.0"...').as_list()
search_documents(profile_names, doc_list, debug=False, max_entities=1, buckets=1, limit=1, output_format='annie')[source]
Parameters:
  • profile_names – a list of profile names
  • doc_list – a list of documents to analyze (see example below)
  • debug – compute and return an explanation
  • buckets – only return n buckets of hits with the same score
  • max_entities – number of results to return (removes the top hit’s tokens and rescores the result list subsequently
  • limit – only return that many results
  • output_format – the output format to use (‘standard’, ‘minimal’, ‘annie’)
Return type:

the tagged dictionary

Note

Example document

# option 1: list of document dictionaries
( {'content_id': 12,
   'content': u'the text to analyze'})

# option 2: list of weblyzardXML dictionary representations
(XMLContent('<?xml version="1.0"...').as_list(),
 XMLContent('<?xml version="1.0"...').as_list(),)
search_text(profile_names, text, debug=False, max_entities=1, buckets=1, limit=1, output_format='minimal')[source]

Search text for entities specified in the given profiles.

Parameters:
  • profile_names – the profile to search in
  • text – the text to search in
  • debug – compute and return an explanation
  • buckets – only return n buckets of hits with the same score
  • max_entities – number of results to return (removes the top hit’s tokens and rescores the result list subsequently
  • limit – only return that many results
  • output_format – the output format to use (‘standard’, ‘minimal’, ‘annie’)
Return type:

the tagged text

status()[source]
Returns:the status of the Recognize web service.

sentiment_analysis Module

class weblyzard_api.client.sentiment_analysis.SentimentAnalysis(url='http://voyager.srv.weblyzard.net/ws', usr=None, pwd=None)[source]

Bases: eWRT.ws.rest.RESTClient

Sentiment Analysis Web Service

Parameters:
  • url – URL of the jeremia web service
  • usr – optional user name
  • pwd – optional password
parse_document(text, lang)[source]

Returns the sentiment of the given text for the given language.

Parameters:
  • text – the input text
  • lang – the text’s language
Returns:

sv; n_pos_terms; n_neg_terms; list of tuples, where each tuple contains two dicts:

  • tuple[0]: ambiguous terms and its sentiment value after disambiguation
  • tuple[1]: the context terms with their number of occurrences in the document.

parse_document_list(document_list, lang)[source]

Returns the sentiment of the given text for the given language.

Parameters:
  • document_list – the input text
  • lang – the text’s language
Returns:

sv; n_pos_terms; n_neg_terms; list of tuples, where each tuple contains two dicts:

  • tuple[0]: ambiguous terms and its sentiment value after disambiguation
  • tuple[1]: the context terms with their number of occurrences in the document.

reset(lang)[source]

Restores the default data files for the given language (if available).

Parameters:lang – the used language

Note

Currently this operation is only supported for German and English.

update_context(context_dict, lang)[source]

Uploads the given context dictionary to the Web service.

Parameters:
  • context_dict – a dictionary containing the context information
  • lang – the used language
update_lexicon(corpus_dict, lang)[source]

Uploads the given corpus dictionary to the Web service.

Parameters:
  • corpus_dict – a dictionary containing the corpus information
  • lang – the used language
update_negation(negation_trigger_dict, lang)[source]

Uploads the given negation triggers to the Web service.

Parameters:
  • negation_trigger_list – a list of negation triggers to use with the given language example: {'doesn't': 'doesnt', ....}
  • lang – the used language