PRELOADER

当前文章 : 《机器学习算法整理——贝叶斯算法(实现拼写检查器)》

5/10/2019 —— 

机器学习算法整理——贝叶斯算法(实现拼写检查器)

贝叶斯拼写检查器实现

1. import re, collections
1.  
1. def words(text): return re.findall('[a-z]+', text.lower()) 
1.  
1. def train(features):
1. model = collections.defaultdict(lambda: 1)
1. for f in features:
1. model[f] += 1
1. return model
1.  
1. NWORDS = train(words(open('big.txt').read()))
1.  
1. alphabet = 'abcdefghijklmnopqrstuvwxyz'
1.  
1. def edits1(word):
1. n = len(word)
1. return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
1.[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
1.[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
1.[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion
1.  
1. def known_edits2(word):
1. return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
1.  
1. def known(words): return set(w for w in words if w in NWORDS)
1.  
1. def correct(word):
1. candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
1. return max(candidates, key=lambda w: NWORDS[w])

求解:argmaxc P(c|w) -> argmaxc P(w|c) P(c) / P(w)

  • P(c), 文章中出现一个正确拼写词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大
  • P(w|c), 在用户想键入 c 的情况下敲成 w 的概率. 因为这个是代表用户会以多大的概率把 c 敲错成 w
  • argmaxc, 用来枚举所有可能的 c 并且选取概率最大的
1. # 把语料中的单词全部抽取出来, 转成小写, 并且去除单词中间的特殊符号
1. def words(text): return re.findall('[a-z]+', text.lower()) 
1.  
1. def train(features):
1. model = collections.defaultdict(lambda: 1)
1. for f in features:
1. model[f] += 1
1. return model
1.  
1. nwords = train(words(open('big.txt').read()))

要是遇到我们从来没有过见过的新词怎么办. 假如说一个词拼写完全正确, 但是语料库中没有包含这个词, 从而这个词也永远不会出现在训练集中.

于是, 我们就要返回出现这个词的概率是0. 这个情况不太妙, 因为概率为0这个代表了这个事件绝对不可能发生, 而在我们的概率模型中,

我们期望用一个很小的概率来代表这种情况. lambda: 1

nwords
1. defaultdict(<function __main__.train.<locals>.<lambda>()>,
1. {'the': 80031,
1.  'project': 289,
1.  'gutenberg': 264,
1.  'ebook': 88,
1.  'of': 40026,
1.  'adventures': 18,
1.  'sherlock': 102,
1.  'holmes': 468,
1.  'by': 6739,
1.  'sir': 178,
1.  'arthur': 35,
1.  'conan': 5,
1.  'doyle': 6,
1.  'in': 22048,
1.  'our': 1067,
1.  'series': 129,
1.  'copyright': 70,
1.  'laws': 234,
1.  'are': 3631,
1.  'changing': 45,
1.  'all': 4145,
1.  'over': 1283,
1.  'world': 363,
1.  'be': 6156,
1.  'sure': 124,
1.  'to': 28767,
1.  'check': 39,
1.  'for': 6940,
1.  'your': 1280,
1.  'country': 424,
1.  'before': 1364,
1.  'downloading': 6,
1.  'or': 5353,
1.  'redistributing': 8,
1.  'this': 4064,
1.  'any': 1205,
1.  'other': 1503,
1.  'header': 8,
1.  'should': 1298,
1.  'first': 1178,
1.  'thing': 304,
1.  'seen': 445,
1.  'when': 2924,
1.  'viewing': 8,
1.  'file': 22,
1.  'please': 173,
1.  'do': 1504,
1.  'not': 6626,
1.  'remove': 54,
1.  'it': 10682,
1.  'change': 151,
1.  'edit': 5,
1.  'without': 1016,
1.  'written': 118,
1.  'permission': 53,
1.  'read': 219,
1.  'legal': 53,
1.  'small': 528,
1.  'print': 48,
1.  'and': 38313,
1.  'information': 74,
1.  'about': 1498,
1.  'at': 6792,
1.  'bottom': 43,
1.  'included': 44,
1.  'is': 9775,
1.  'important': 286,
1.  'specific': 38,
1.  'rights': 169,
1.  'restrictions': 24,
1.  'how': 1316,
1.  'may': 2552,
1.  'used': 277,
1.  'you': 5623,
1.  'can': 1096,
1.  'also': 779,
1.  'find': 295,
1.  'out': 1988,
1.  'make': 505,
1.  'a': 21156,
1.  'donation': 11,
1.  'get': 469,
1.  'involved': 108,
1.  'welcome': 19,
1.  'free': 422,
1.  'plain': 109,
1.  'vanilla': 7,
1.  'electronic': 59,
1.  'texts': 8,
1.  'ebooks': 55,
1.  'readable': 14,
1.  'both': 530,
1.  'humans': 3,
1.  'computers': 8,
1.  'since': 261,
1.  'these': 1232,
1.  'were': 4290,
1.  'prepared': 139,
1.  'thousands': 94,
1.  'volunteers': 23,
1.  'title': 40,
1.  'author': 30,
1.  'release': 29,
1.  'date': 49,
1.  'march': 136,
1.  'most': 909,
1.  'recently': 31,
1.  'updated': 5,
1.  'november': 42,
1.  'edition': 22,
1.  'language': 62,
1.  'english': 212,
1.  'character': 175,
1.  'set': 325,
1.  'encoding': 6,
1.  'ascii': 12,
1.  'start': 68,
1.  'additional': 31,
1.  'editing': 7,
1.  'jose': 2,
1.  'menendez': 2,
1.  'contents': 51,
1.  'i': 7683,
1.  'scandal': 20,
1.  'bohemia': 16,
1.  'ii': 78,
1.  'red': 289,
1.  'headed': 38,
1.  'league': 54,
1.  'iii': 92,
1.  'case': 439,
1.  'identity': 12,
1.  'iv': 56,
1.  'boscombe': 17,
1.  'valley': 79,
1.  'mystery': 40,
1.  'v': 52,
1.  'five': 280,
1.  'orange': 24,
1.  'pips': 13,
1.  'vi': 38,
1.  'man': 1653,
1.  'with': 9741,
1.  'twisted': 22,
1.  'lip': 57,
1.  'vii': 35,
1.  'adventure': 35,
1.  'blue': 144,
1.  'carbuncle': 18,
1.  'viii': 40,
1.  'speckled': 6,
1.  'band': 55,
1.  'ix': 29,
1.  'engineer': 13,
1.  's': 5632,
1.  'thumb': 52,
1.  'x': 137,
1.  'noble': 49,
1.  'bachelor': 19,
1.  'xi': 29,
1.  'beryl': 5,
1.  'coronet': 30,
1.  'xii': 29,
1.  'copper': 27,
1.  'beeches': 13,
1.  'she': 3947,
1.  'always': 609,
1.  'woman': 326,
1.  'have': 3494,
1.  'seldom': 77,
1.  'heard': 637,
1.  'him': 5231,
1.  'mention': 47,
1.  'her': 5285,
1.  'under': 964,
1.  'name': 263,
1.  'his': 10035,
1.  'eyes': 940,
1.  'eclipses': 3,
1.  'predominates': 4,
1.  'whole': 745,
1.  'sex': 12,
1.  'was': 11411,
1.  'that': 12513,
1.  'he': 12402,
1.  'felt': 698,
1.  'emotion': 37,
1.  'akin': 15,
1.  'love': 485,
1.  'irene': 19,
1.  'adler': 17,
1.  'emotions': 11,
1.  'one': 3372,
1.  'particularly': 175,
1.  'abhorrent': 2,
1.  'cold': 258,
1.  'precise': 14,
1.  'but': 5654,
1.  'admirably': 8,
1.  'balanced': 7,
1.  'mind': 342,
1.  'take': 617,
1.  'perfect': 40,
1.  'reasoning': 42,
1.  'observing': 22,
1.  'machine': 40,
1.  'has': 1604,
1.  'as': 8065,
1.  'lover': 27,
1.  'would': 1954,
1.  'placed': 183,
1.  'himself': 1159,
1.  'false': 65,
1.  'position': 433,
1.  'never': 594,
1.  'spoke': 219,
1.  'softer': 11,
1.  'passions': 30,
1.  'save': 111,
1.  'gibe': 3,
1.  'sneer': 7,
1.  'they': 3939,
1.  'admirable': 15,
1.  'things': 322,
1.  'observer': 14,
1.  'excellent': 63,
1.  'drawing': 241,
1.  'veil': 17,
1.  'from': 5710,
1.  'men': 1146,
1.  'motives': 15,
1.  'actions': 78,
1.  'trained': 24,
1.  'reasoner': 7,
1.  'admit': 66,
1.  'such': 1437,
1.  'intrusions': 2,
1.  'into': 2125,
1.  'own': 786,
1.  'delicate': 55,
1.  'finely': 12,
1.  'adjusted': 17,
1.  'temperament': 6,
1.  'introduce': 24,
1.  'distracting': 2,
1.  'factor': 42,
1.  'which': 4843,
1.  'might': 537,
1.  'throw': 49,
1.  'doubt': 153,
1.  'upon': 1112,
1.  'mental': 38,
1.  'results': 230,
1.  'grit': 2,
1.  'sensitive': 36,
1.  'instrument': 36,
1.  'crack': 21,
1.  'high': 291,
1.  'power': 549,
1.  'lenses': 2,
1.  'more': 1998,
1.  'disturbing': 10,
1.  'than': 1207,
1.  'strong': 169,
1.  'nature': 171,
1.  'yet': 489,
1.  'there': 2973,
1.  'late': 166,
1.  'dubious': 2,
1.  'questionable': 4,
1.  'memory': 56,
1.  'had': 7384,
1.  'little': 1002,
1.  'lately': 23,
1.  'my': 2250,
1.  'marriage': 97,
1.  'drifted': 6,
1.  'us': 685,
1.  'away': 839,
1.  'each': 412,
1.  'complete': 146,
1.  'happiness': 144,
1.  'home': 296,
1.  'centred': 3,
1.  'interests': 119,
1.  'rise': 241,
1.  'up': 2285,
1.  'around': 272,
1.  'who': 3051,
1.  'finds': 24,
1.  'master': 142,
1.  'establishment': 41,
1.  'sufficient': 76,
1.  'absorb': 5,
1.  'attention': 192,
1.  'while': 769,
1.  'loathed': 2,
1.  'every': 651,
1.  'form': 508,
1.  'society': 170,
1.  'bohemian': 9,
1.  'soul': 169,
1.  'remained': 232,
1.  'lodgings': 12,
1.  'baker': 50,
1.  'street': 181,
1.  'buried': 22,
1.  'among': 452,
1.  'old': 1181,
1.  'books': 60,
1.  'alternating': 3,
1.  'week': 96,
1.  'between': 655,
1.  'cocaine': 5,
1.  'ambition': 14,
1.  'drowsiness': 5,
1.  'drug': 22,
1.  'fierce': 13,
1.  'energy': 46,
1.  'keen': 33,
1.  'still': 923,
1.  'ever': 275,
1.  'deeply': 78,
1.  'attracted': 37,
1.  'study': 145,
1.  'crime': 62,
1.  'occupied': 117,
1.  'immense': 78,
1.  'faculties': 9,
1.  'extraordinary': 75,
1.  'powers': 150,
1.  'observation': 40,
1.  'following': 209,
1.  'those': 1202,
1.  'clues': 4,
1.  'clearing': 30,
1.  'mysteries': 10,
1.  'been': 2600,
1.  'abandoned': 73,
1.  'hopeless': 18,
1.  'official': 92,
1.  'police': 95,
1.  'time': 1530,
1.  'some': 1537,
1.  'vague': 40,
1.  'account': 178,
1.  'doings': 12,
1.  'summons': 12,
1.  'odessa': 4,
1.  'trepoff': 2,
1.  'murder': 31,
1.  'singular': 37,
1.  'tragedy': 10,
1.  'atkinson': 2,
1.  'brothers': 51,
1.  'trincomalee': 2,
1.  'finally': 157,
1.  'mission': 35,
1.  'accomplished': 40,
1.  'so': 3018,
1.  'delicately': 4,
1.  'successfully': 26,
1.  'reigning': 4,
1.  'family': 211,
1.  'holland': 13,
1.  'beyond': 226,
1.  'signs': 99,
1.  'activity': 132,
1.  'however': 431,
1.  'merely': 190,
1.  'shared': 26,
1.  'readers': 12,
1.  'daily': 45,
1.  'press': 82,
1.  'knew': 497,
1.  'former': 178,
1.  'friend': 284,
1.  'companion': 82,
1.  'night': 386,
1.  'on': 6644,
1.  'twentieth': 20,
1.  'returning': 69,
1.  'journey': 70,
1.  'patient': 384,
1.  'now': 1698,
1.  'returned': 195,
1.  'civil': 178,
1.  'practice': 96,
1.  'way': 860,
1.  'led': 197,
1.  'me': 1921,
1.  'through': 816,
1.  'passed': 368,
1.  'well': 1199,
1.  'remembered': 121,
1.  'door': 499,
1.  'must': 956,
1.  'associated': 197,
1.  'wooing': 3,
1.  'dark': 182,
1.  'incidents': 15,
1.  'scarlet': 23,
1.  'seized': 115,
1.  'desire': 97,
1.  'see': 1102,
1.  'again': 867,
1.  'know': 1049,
1.  'employing': 8,
1.  'rooms': 87,
1.  'brilliantly': 6,
1.  'lit': 75,
1.  'even': 947,
1.  'looked': 761,
1.  'saw': 600,
1.  'tall': 75,
1.  'spare': 28,
1.  'figure': 104,
1.  'pass': 155,
1.  'twice': 85,
1.  'silhouette': 2,
1.  'against': 661,
1.  'blind': 24,
1.  'pacing': 27,
1.  'room': 961,
1.  'swiftly': 39,
1.  'eagerly': 40,
1.  'head': 726,
1.  'sunk': 28,
1.  'chest': 82,
1.  'hands': 456,
1.  'clasped': 12,
1.  'behind': 402,
1.  'mood': 52,
1.  'habit': 56,
1.  'attitude': 73,
1.  'manner': 136,
1.  'told': 491,
1.  'their': 2956,
1.  'story': 134,
1.  'work': 383,
1.  'risen': 31,
1.  'created': 63,
1.  'dreams': 17,
1.  'hot': 120,
1.  'scent': 18,
1.  'new': 1212,
1.  'problem': 77,
1.  'rang': 30,
1.  'bell': 66,
1.  'shown': 114,
1.  'chamber': 36,
1.  'formerly': 78,
1.  'part': 705,
1.  'effusive': 3,
1.  'glad': 151,
1.  'think': 558,
1.  'hardly': 174,
1.  'word': 299,
1.  'spoken': 93,
1.  'kindly': 87,
1.  'eye': 111,
1.  'waved': 30,
1.  'an': 3424,
1.  'armchair': 50,
1.  'threw': 97,
1.  'across': 223,
1.  'cigars': 8,
1.  'indicated': 89,
1.  'spirit': 168,
1.  'gasogene': 2,
1.  'corner': 129,
1.  'then': 1559,
1.  'stood': 384,
1.  'fire': 275,
1.  'introspective': 4,
1.  'fashion': 50,
1.  'wedlock': 2,
1.  'suits': 9,
1.  'remarked': 170,
1.  'watson': 84,
1.  'put': 436,
1.  'seven': 133,
1.  'half': 319,
1.  'pounds': 27,
1.  'answered': 227,
1.  'indeed': 140,
1.  'thought': 903,
1.  'just': 768,
1.  'trifle': 12,
1.  'fancy': 51,
1.  'observe': 38,
1.  'did': 1876,
1.  'tell': 493,
1.  'intended': 59,
1.  'go': 906,
1.  'harness': 28,
1.  'deduce': 15,
1.  'getting': 93,
1.  'yourself': 163,
1.  'very': 1341,
1.  'wet': 61,
1.  'clumsy': 9,
1.  'careless': 15,
1.  'servant': 47,
1.  'girl': 167,
1.  'dear': 450,
1.  'said': 3465,
1.  'too': 549,
1.  'much': 672,
1.  'certainly': 120,
1.  'burned': 78,
1.  'lived': 114,
1.  'few': 459,
1.  'centuries': 13,
1.  'ago': 109,
1.  'true': 206,
1.  'walk': 76,
1.  'thursday': 8,
1.  'came': 980,
1.  'dreadful': 69,
1.  'mess': 11,
1.  'changed': 135,
1.  'clothes': 63,
1.  't': 1319,
1.  'imagine': 97,
1.  'mary': 706,
1.  'jane': 3,
1.  'incorrigible': 3,
1.  'wife': 368,
1.  'given': 365,
1.  'notice': 99,
1.  'fail': 41,
1.  'chuckled': 8,
1.  'rubbed': 33,
1.  'long': 992,
1.  'nervous': 55,
1.  'together': 261,
1.  'simplicity': 31,
1.  'itself': 274,
1.  'inside': 44,
1.  'left': 835,
1.  'shoe': 12,
1.  'where': 978,
1.  'firelight': 3,
1.  'strikes': 20,
1.  'leather': 36,
1.  'scored': 5,
1.  'six': 177,
1.  'almost': 326,
1.  'parallel': 18,
1.  'cuts': 6,
1.  'obviously': 39,
1.  'caused': 103,
1.  'someone': 161,
1.  'carelessly': 15,
1.  'scraped': 22,
1.  'round': 557,
1.  'edges': 71,
1.  'sole': 71,
1.  'order': 405,
1.  'crusted': 3,
1.  'mud': 37,
1.  'hence': 33,
1.  'double': 50,
1.  'deduction': 13,
1.  'vile': 17,
1.  'weather': 43,
1.  'malignant': 89,
1.  'boot': 23,
1.  'slitting': 3,
1.  'specimen': 15,
1.  'london': 77,
1.  'slavey': 2,
1.  'if': 2373,
1.  'gentleman': 100,
1.  'walks': 11,
1.  'smelling': 6,
1.  'iodoform': 44,
1.  'black': 236,
1.  'mark': 39,
1.  'nitrate': 8,
1.  'silver': 129,
1.  'right': 711,
1.  'forefinger': 8,
1.  'bulge': 3,
1.  'side': 512,
1.  'top': 43,
1.  'hat': 106,
1.  'show': 214,
1.  'secreted': 3,
1.  'stethoscope': 3,
1.  'dull': 75,
1.  'pronounce': 10,
1.  'active': 97,
1.  'member': 51,
1.  'medical': 23,
1.  'profession': 23,
1.  'could': 1701,
1.  'help': 231,
1.  'laughing': 116,
1.  'ease': 45,
1.  'explained': 61,
1.  'process': 220,
1.  'hear': 184,
1.  'give': 524,
1.  'reasons': 65,
1.  'appears': 109,
1.  'ridiculously': 2,
1.  'simple': 140,
1.  'easily': 115,
1.  'myself': 228,
1.  'though': 651,
1.  'successive': 18,
1.  'instance': 51,
1.  'am': 747,
1.  'baffled': 9,
1.  'until': 326,
1.  'explain': 124,
1.  'believe': 184,
1.  'good': 745,
1.  'yours': 47,
1.  'quite': 503,
1.  'lighting': 17,
1.  'cigarette': 7,
1.  'throwing': 47,
1.  'down': 1129,
1.  'distinction': 20,
1.  'clear': 234,
1.  'example': 287,
1.  'frequently': 219,
1.  'steps': 189,
1.  'lead': 138,
1.  'hall': 84,
1.  'often': 444,
1.  'hundreds': 49,
1.  'times': 237,
1.  'many': 610,
1.  'don': 582,
1.  'observed': 132,
1.  'point': 224,
1.  'seventeen': 11,
1.  'because': 631,
1.  'interested': 66,
1.  'problems': 79,
1.  'enough': 176,
1.  'chronicle': 8,
1.  'two': 1139,
1.  'trifling': 13,
1.  'experiences': 12,
1.  'sheet': 30,
1.  'thick': 78,
1.  'pink': 28,
1.  'tinted': 10,
1.  'notepaper': 3,
1.  'lying': 119,
1.  'open': 326,
1.  'table': 297,
1.  'last': 566,
1.  'post': 118,
1.  'aloud': 29,
1.  'note': 116,
1.  'undated': 2,
1.  'either': 294,
1.  'signature': 10,
1.  'address': 77,
1.  'will': 1578,
1.  'call': 198,
1.  'quarter': 47,
1.  'eight': 129,
1.  'o': 258,
1.  'clock': 121,
1.  'desires': 23,
1.  'consult': 20,
1.  'matter': 366,
1.  'deepest': 16,
1.  'moment': 488,
1.  'recent': 55,
1.  'services': 39,
1.  'royal': 112,
1.  'houses': 118,
1.  'europe': 154,
1.  'safely': 12,
1.  'trusted': 17,
1.  'matters': 137,
1.  'importance': 118,
1.  'exaggerated': 29,
1.  'we': 1907,
1.  'quarters': 73,
1.  'received': 281,
1.  'hour': 158,
1.  'amiss': 7,
1.  'visitor': 75,
1.  'wear': 31,
1.  'mask': 13,
1.  'what': 3012,
1.  'means': 254,
1.  'no': 2349,
1.  'data': 18,
1.  'capital': 145,
1.  'mistake': 40,
1.  'theorise': 2,
1.  'insensibly': 3,
1.  'begins': 48,
1.  'twist': 15,
1.  'facts': 73,
1.  'suit': 26,
1.  'theories': 22,
1.  'instead': 138,
1.  'carefully': 73,
1.  'examined': 50,
1.  'writing': 70,
1.  'paper': 178,
1.  'wrote': 150,
1.  'presumably': 9,
1.  'endeavouring': 9,
1.  'imitate': 8,
1.  'processes': 36,
1.  'bought': 56,
1.  'crown': 62,
1.  'packet': 12,
1.  'peculiarly': 15,
1.  'stiff': 21,
1.  'peculiar': 85,
1.  'hold': 115,
1.  'light': 279,
1.  'large': 484,
1.  'e': 137,
1.  'g': 56,
1.  'p': 67,
1.  'woven': 6,
1.  'texture': 7,
1.  'asked': 778,
1.  'maker': 5,
1.  'monogram': 5,
1.  'rather': 220,
1.  'stands': 20,
1.  'gesellschaft': 2,
1.  'german': 197,
1.  'company': 193,
1.  'customary': 20,
1.  'contraction': 62,
1.  'like': 1081,
1.  'co': 31,
1.  'course': 390,
1.  'papier': 2,
1.  'eg': 2,
1.  'let': 507,
1.  'glance': 92,
1.  'continental': 47,
1.  'gazetteer': 2,
1.  'took': 574,
1.  'heavy': 140,
1.  'brown': 72,
1.  'volume': 31,
1.  'shelves': 4,
1.  'eglow': 2,
1.  'eglonitz': 2,
1.  'here': 692,
1.  'egria': 2,
1.  'speaking': 186,
1.  'far': 409,
1.  'carlsbad': 2,
1.  'remarkable': 78,
1.  'being': 919,
1.  'scene': 50,
1.  'death': 331,
1.  'wallenstein': 2,
1.  'its': 1636,
1.  'numerous': 51,
1.  'glass': 117,
1.  'factories': 30,
1.  'mills': 40,
1.  'ha': 76,
1.  'boy': 170,
1.  'sparkled': 6,
1.  'sent': 320,
1.  'great': 793,
1.  'triumphant': 17,
1.  'cloud': 31,
1.  'made': 1008,
1.  'precisely': 25,
1.  'construction': 26,
1.  'sentence': 27,
1.  'frenchman': 103,
1.  'russian': 462,
1.  'uncourteous': 2,
1.  'verbs': 2,
1.  'only': 1874,
1.  'remains': 74,
1.  'therefore': 187,
1.  'discover': 29,
1.  'wanted': 214,
1.  'writes': 21,
1.  'prefers': 3,
1.  'wearing': 88,
1.  'showing': 105,
1.  'face': 1126,
1.  'comes': 92,
1.  'mistaken': 60,
1.  'resolve': 15,
1.  'doubts': 40,
1.  'sharp': 84,
1.  'sound': 220,
1.  'horses': 263,
1.  'hoofs': 25,
1.  'grating': 11,
1.  'wheels': 48,
1.  'curb': 5,
1.  'followed': 330,
1.  'pull': 24,
1.  'whistled': 14,
1.  'pair': 41,
1.  'yes': 689,
1.  'continued': 292,
1.  'glancing': 99,
1.  'window': 187,
1.  'nice': 54,
1.  'brougham': 5,
1.  'beauties': 3,
1.  'hundred': 230,
1.  'fifty': 95,
1.  'guineas': 4,
1.  'apiece': 8,
1.  'money': 327,
1.  'nothing': 647,
1.  'else': 202,
1.  'better': 267,
1.  'bit': 64,
1.  'doctor': 184,
1.  'stay': 75,
1.  'lost': 225,
1.  'boswell': 2,
1.  'promises': 16,
1.  'interesting': 72,
1.  'pity': 76,
1.  'miss': 113,
1.  'client': 34,
1.  'want': 324,
1.  'sit': 90,
1.  'best': 269,
1.  'slow': 66,
1.  'step': 140,
1.  'stairs': 32,
1.  'passage': 111,
1.  'paused': 80,
1.  'immediately': 183,
1.  'outside': 111,
1.  'loud': 65,
1.  'authoritative': 3,
1.  'tap': 11,
1.  'come': 935,
1.  'entered': 283,
1.  'less': 368,
1.  'feet': 180,
1.  'inches': 17,
1.  'height': 37,
1.  'limbs': 68,
1.  'hercules': 5,
1.  'dress': 139,
1.  'rich': 93,
1.  'richness': 3,
1.  'england': 312,
1.  'bad': 156,
1.  'taste': 24,
1.  'bands': 28,
1.  'astrakhan': 2,
1.  'slashed': 4,
1.  'sleeves': 31,
1.  'fronts': 2,
1.  'breasted': 2,
1.  'coat': 173,
1.  'deep': 216,
1.  'cloak': 63,
1.  'thrown': 93,
1.  'shoulders': 126,
1.  'lined': 33,
1.  'flame': 16,
1.  'coloured': 22,
1.  'silk': 51,
1.  'secured': 49,
1.  'neck': 204,
1.  'brooch': 2,
1.  'consisted': 39,
1.  'single': 174,
1.  'flaming': 9,
1.  'boots': 92,
1.  'extended': 76,
1.  'halfway': 20,
1.  'calves': 4,
1.  'trimmed': 9,
1.  'tops': 4,
1.  'fur': 39,
1.  'completed': 26,
1.  'impression': 68,
1.  'barbaric': 3,
1.  'opulence': 4,
1.  'suggested': 70,
1.  'appearance': 136,
1.  'carried': 283,
1.  'broad': 93,
1.  'brimmed': 5,
1.  'hand': 835,
1.  'wore': 59,
1.  'upper': 131,
1.  'extending': 36,
1.  'past': 224,
1.  'cheekbones': 5,
1.  'vizard': 2,
1.  'apparently': 69,
1.  'raised': 213,
1.  'lower': 197,
1.  'appeared': 198,
1.  'hanging': 43,
1.  'straight': 125,
1.  'chin': 31,
1.  'suggestive': 12,
1.  'resolution': 58,
1.  'pushed': 82,
1.  'length': 64,
1.  'obstinacy': 8,
1.  'harsh': 23,
1.  'voice': 463,
1.  'strongly': 42,
1.  'marked': 139,
1.  'accent': 19,
1.  'uncertain': 31,
1.  'pray': 80,
1.  'seat': 171,
1.  'colleague': 8,
1.  'dr': 49,
1.  'occasionally': 90,
1.  'cases': 454,
1.  'whom': 490,
1.  'honour': 17,
1.  'count': 749,
1.  'von': 12,
1.  'kramm': 3,
1.  'nobleman': 12,
1.  'understand': 413,
1.  'discretion': 14,
1.  'trust': 69,
1.  'extreme': 73,
1.  'prefer': 22,
1.  'communicate': 16,
1.  'alone': 338,
1.  'rose': 244,
1.  'caught': 91,
1.  'wrist': 69,
1.  'back': 747,
1.  'chair': 136,
1.  'none': 111,
1.  'say': 756,
1.  'anything': 380,
1.  'shrugged': 36,
1.  'begin': 98,
1.  'binding': 19,
1.  'absolute': 57,
1.  'secrecy': 19,
1.  'years': 572,
1.  'end': 466,
1.  'present': 330,
1.  'weight': 71,
1.  'influence': 139,
1.  'european': 100,
1.  'history': 440,
1.  'promise': 68,
1.  'excuse': 54,
1.  'strange': 221,
1.  'august': 71,
1.  'person': 186,
1.  'employs': 3,
1.  'wishes': 43,
1.  'agent': 26,
1.  'unknown': 88,
1.  'confess': 37,
1.  'once': 570,
1.  'called': 451,
1.  'exactly': 48,
1.  'aware': 53,
1.  'dryly': 6,
1.  'circumstances': 108,
1.  'delicacy': 12,
1.  'precaution': 10,
1.  'taken': 439,
1.  'quench': 4,
1.  'grow': 75,
1.  'seriously': 64,
1.  'compromise': 72,
1.  'families': 46,
1.  'speak': 256,
1.  'plainly': 40,
1.  'implicates': 6,
1.  'house': 662,
1.  'ormstein': 3,
1.  'hereditary': 15,
1.  'kings': 28,
1.  'murmured': 19,
1.  'settling': 17,
1.  'closing': 36,
1.  'glanced': 177,
1.  ...})

编辑距离:

两个词之间的编辑距离定义为使用了几次插入(在词中插入一个单字母), 删除(删除一个单字母), 交换(交换相邻两个字母), 替换(把一个字母换成另一个)的操作从一个词变到另一个词.

1. #返回所有与单词 w 编辑距离为 1 的集合
1. alphabet = 'abcdefghijklmnopqrstuvwxyz'
1. def edits1(word):
1. n = len(word)
1. return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
1.[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
1.[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
1.[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion

与 something 编辑距离为2的单词居然达到了 114,324 个

优化:在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词,只能返回 3 个单词: ‘smoothing’, ‘something’ 和 ‘soothing’

  1. #返回所有与单词 w 编辑距离为 2 的集合
  2. #在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词
  3. def edits2(word):
  4. return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

正常来说把一个元音拼成另一个的概率要大于辅音 (因为人常常把 hello 打成 hallo 这样); 把单词的第一个字母拼错的概率会相对小, 等等.

但是为了简单起见, 选择了一个简单的方法: 编辑距离为1的正确单词比编辑距离为2的优先级高, 而编辑距离为0的正确单词优先级比编辑距离为1的高.

1. def known(words): return set(w for w in words if w in nwords)
1. 
1. #如果known(set)非空, candidate 就会选取这个集合, 而不继续计算后面的
1. def correct(word):
1. candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
1. return max(candidates, key=lambda w: nwords[w])
correct('knona')