nltk on Shylock Hg

nltk on Shylock Hg /tags/nltk/ Recent content in nltk on Shylock Hg Hugo -- gohugo.io en-us Mon, 19 Feb 2018 00:00:00 +0000 Terms in NLP /post/2018/02/19/terms-in-nlp/ Mon, 19 Feb 2018 00:00:00 +0000 /post/2018/02/19/terms-in-nlp/ Terms Description lexicon Wordbook or dictionary homonym The word pronounced same as another stopword The any of number of commonly used words such as the,to and also synonym The words has the same or nearly meaning as another in the language hyponym The more specific word hypernym The more general word meronym The subject of object holonym The object contained subject code point The inner-value of char(such as ASCII or Unicode value of a char) Accessing web and local text /post/2018/02/17/accessing-web-and-local-text/ Sat, 17 Feb 2018 00:00:00 +0000 /post/2018/02/17/accessing-web-and-local-text/ 1.Handling plain web text 1.1.Accessing web text Accessing web text as bellow: import urllib url = 'http://www.gutenberg.org/files/2554/2554.txt' #get the string of text file raw = urllib.urlopen(url).read() #with proxy proxy = {'http':'http://www.yourproxy.com:443'} raw = urllib.urlopen(url,proxies=proxy).read() 1.2.Tokenizing the text Tokenizing a text(string) to produce a list of tokens. #context same as upper #tokenize the text string tokens = nltk.word_tokenize(raw) 1.3.Creating nltk.Text object We can handle text by nltk after creating nltk. Lexical Resource /post/2018/02/17/lexical-resource/ Sat, 17 Feb 2018 00:00:00 +0000 /post/2018/02/17/lexical-resource/ 1.Wordlist Corpora The corpora include only wordlists in /usr/dict/words.This is not a complete text but just collection of words in different area. The example as bellow: Corpora Description nltk.corpus.words The corpora of common words nltk.corpus.stopwords The any of number of commonly used words without special meaning,such as the nltk.corpus.names The first names categoried by gender that stored in two different files. wordnet /post/2018/02/17/wordnet/ Sat, 17 Feb 2018 00:00:00 +0000 /post/2018/02/17/wordnet/ 1.Overview The wordnet is semantically oriented dictionary of English. 2.Senses and Synonyms Accessing synonyms as bellow: from nltk.corpus import wordnet #return the synset that word 'motorcar' belong to. wordnet.synsets('motorcar') #return the list of lemma names about 'car.n.01' wordnet.synset('car.n.01').lemma_names #return the list of lemma about 'car.n.01' wordnet.synset('car.n.01').lemma #return string of definition of 'car.n.01' wordnet.synset('car.n.01').definition #return the example sentence of 'car.n.01' wordnet.synset('car.n.01').examples 3.The WordNet Hierarchy The relationship of father & child(is-a) As shown as bellow: Accessing hyponyms as bellow: FreqDist in NLTK /post/2018/02/16/freqdist-in-nltk/ Fri, 16 Feb 2018 00:00:00 +0000 /post/2018/02/16/freqdist-in-nltk/ 1.FreqDist. Usage as bellow: from nltk import FreqDist fdist = FreqDist([items]) This API FreqDist counting the count of each item in sequence and return the results as a list. But,it is one-dimensional frequency distribution.So you can just apply to one type items. API of FreqDist API description FreqDist(samples) create the FreqDsit object fdist.inc(sample) increment the count of this sample fdist[‘sample’] return the count of ‘sample’ fdist. Accessing Corpus by NLTK /post/2018/02/15/accessing-corpus-by-nltk/ Thu, 15 Feb 2018 00:00:00 +0000 /post/2018/02/15/accessing-corpus-by-nltk/ 1.Inner corpus. For inner corpus of NLTK , just import them as bellow: from nltk.corpus import <****> 2.User corpus. For user own corpus , you need load them by NLTK reader as bellow: #for plain text from nltk.corpus import PlaintextCorpusReader path = '~/text' pattern = r'*.txt' texts = PlaintextCorpusReader(path,pattern) #for mrg format text from nltk.corpus import BracketParserCorpusReader mrg_path = '~/mrg' mrg_pattern = r'.*/wsj_.*\.mrg' mrgs = BracketParserCorpusReader(mrg_path,mrg_pattern) 3.API of corpus obj API Description fileids() return list of all file ids of this corpus fileids([categories]) return list of file ids belong to corresponding categories categories() return list of all categories of this corpus categories([fileids]) return list of categories of these fileids raw return the list of all chars of the corpus raw(filedis=[fileids]) return the list of chars of these fileids raw(categories=[categories]) return the list of chars of these categories words same as raw sents same as words abspath(fileid) return the absolute path of fileid encoding(fileid) return the encoding of fileid open(fileid) open the file return file object root() return the root path of corresponding corpus readme() return the readme file of corresponding corpus Natural Language Understanding /post/2018/02/14/natural-language-understanding/ Wed, 14 Feb 2018 00:00:00 +0000 /post/2018/02/14/natural-language-understanding/ 1.Word Sense Disambiguation 2.Pronoun Resolution 2.1Anaphora Resolution 2.2Semantic Role Labeling 3.Generating Language Output 3.1Question Answering 3.2Machine Translation 4.Spoken Dialogue Systems To know the mean of question and organize the answer to reply. 5.Textual Entailment Simple statistics in NLP /post/2018/02/14/simple-statistics-in-nlp/ Wed, 14 Feb 2018 00:00:00 +0000 /post/2018/02/14/simple-statistics-in-nlp/ 1.Frequency Distributions of words. At most time,we can know topic about a text from the frequent words or infrequent words. So,in this case,we should know the Frequency Distribution of words or collocations. We can do this with nltk as bellow: fdist = FreqDist(text) 2.Select words by length. Sometime,the length of words will tell us some information of text,specially with distribution of length of words. We can do this with nltk as bellow: Tips of button detect /post/2018/02/06/tips-of-button-detect/ Tue, 06 Feb 2018 00:00:00 +0000 /post/2018/02/06/tips-of-button-detect/ 1.Problems There is a problem in usr button events detection of embeded system,when you need to detect multi events(such as click , double click and long press) in one button. As shown bellow: In the picture , you can see that there is a gap of time between click can be sured and double click and long press can be sured.So,you can’t handle click event right now before double click or long press can be sured.