• 藍色版面
  • 綠色版面
  • 橘色版面
  • 粉紅色版面
  • 棕色版面
帳號:guest(120.119.126.29)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

研究生: 焉德葳
研究生(外文): Dewei Yen
論文名稱: 搜尋引擎與資訊索引中文斷詞方法
論文名稱(外文): Chinese phrase segmentation method of Information Retrieval and Search Engine
指導教授: 洪朝貴陳毓璋
指導教授(外文): Chao-Kuei HungYu-Chang Chen
學位類別: 碩士
校院名稱: 樹德科技大學
系所名稱: 資訊工程學系
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 72
中文關鍵詞: 搜尋引擎資訊索引斷詞N-gramOzearch
外文關鍵詞: Search engineInformation retrievalphrase segmentationN-gramOzearch
相關次數:
  • 被引用:0
  • 點閱:276
  • 評分:*****
  • 下載:88
  • 書目收藏:4
搜尋引擎對大多數的人而言,是一項熟悉又陌生的技術。熟悉的部份是人們在網路的活動中不斷的使用它。而很多人都聽過其中知名的技術,甚至研究並改良它。但事實上,很少有人瞭解該如何建立一個完整的搜尋引擎。本論文試著將一個搜尋引擎裡如同萬花筒一般的技術跟理論,透過簡單易懂的圖形跟範例進行說明。並以本研究中實際建立的開放原始碼搜尋引擎Ozearch做為例子,將完整的實做列在各個部份。

在本研究中採用了為中文索引所量身打造的特殊斷詞方法,這是一種基於N-gram與詞彙法的連結方法。而這個N-gram 與詞彙合併法主要的概念是採用了兩方的優點,將N-gram與詞彙法的斷詞連結。以使得搜尋引擎在保持良好準確率(Precision)與召回率(Recall)的情況下,有效的降低頁面所使用的索引鍵數量。並於文中列舉出目前商業搜尋引擎斷詞方法所產生的缺失,同時提出了可行的改良的方法。
For most people, the techniques of search engine are both familiar and strange. It is familiar because people keep using it in the network activity. The well-known technology of search engine lets many people research to improve it. But only few people knew how to establish a search engine. This paper tries to explain the technology of search engine by graphs and examples. These researches present the details of each part by actually creating an open source search engine “Ozearch” as example.

This paper also presents an algorithm for segmenting Chinese phrases. It utilizes both the N-gram algorithm and the word-based algorithm to improve precision and recall of the search engine. In this paper, we also find few defect of segmenting Chinese phrases for now and presents workable method to improve it.
摘要  I
Abstract  II
誌謝  III
目錄  IV
圖目錄  VI
表目錄  VIII
第一章 導論  9
1.1 背景簡介  9
1.2 研究動機  10
1.3 論文架構  10
第二章 搜尋引擎技術  11
2.1 Crawler  11
2.2 Natural Language Process (NLP)  14
2.2.1 Stop Word(停用字)  14
2.2.2 Stemming(抽梗)  14
2.2.3 Lemmatisation(詞形化)  15
2.3 Indexes  15
2.3.1 Biword Indexes(雙詞索引)  15
2.3.2 Inverted Indexes(反向索引)  16
2.3.3 Positional Indexes(定址索引)  17
2.3.4 Combined Indexes(合併索引)  17
2.4 Index Processes  18
2.4.1 Store Index  18
2.4.2 Multiple Threads  19
2.4.3 Distributed Systems: GFS & MapReduce  19
2.5 Query  24
2.5.1 Merge algorithm  26
2.5.2 Description  28
2.6 Rank  30
2.6.1 PageRank  30
2.6.2 SGML Parser  37
2.6.3 Search Engine Spam  38
2.6.4 Ozearch所採用的評分的機制  39
2.7其它技巧  39
2.7.1使用Unicode(萬國碼)  40
2.7.2 Index Table  40
2.7.3資料壓縮  41
2.7.4圖片搜尋  42
2.7.5繁簡中文同義詞彙表  43
2.7.6 RSS/Atom搜尋  43
第三章 中文斷詞系統  47
3.1中文斷詞簡介  47
3.1.1 N-gram  47
3.1.2詞彙法  48
3.1.3各種方法間的比較  49
3.2 N-gram與詞彙合併法(N-gram and Word-based combined algorithm)  51
3.2.1先比對詞庫並取得詞彙位置  52
3.2.2非字典字的字串採用bigram切割,並與字典字進行連結  53
3.2.3關鍵字與詞類組合檢查,去除虛詞所形成的bigram  53
3.2.4 流程簡介  57
第四章 實驗設計與結果  60
4.1 N-gram的收斂  60
4.2 精準度測量  64
4.3 實驗結果  65
4.3.1字典錯誤  66
4.3.2詞類組合形成的錯誤  66
4.3.2索引切分問題  66
第五章 結論與未來研究方向  67
5.1 結論  67
5.2 未來研究方向  68
5.2.1中文斷詞部份  68
5.2.2 Ranking  68
5.2.3 MapReduce  68
5.2.4 Layout Structure Analyzer  68
5.2.5與其他Open Source搜尋引擎進行整合  69
參考文獻  70
[1]Brin, S. and Page, L., “The Anatomy of a Large Scale Hypertextual Web Search Engine”, The Seventh International World-Wide Web Conference, pp. 107-117, 1998.
[2]Seth Finkelstein, “10 Things You Might Not Know About Google”, May. 2006.
[3]Dan Formmer, “Nielsen: Google's Search Share Grew In January (GOOG) “, Silicon Alley Insider, Feb. 2008.
[4]Martijn Koster, “A Standard for Robot Exclusion”, 1994.
[5]Martijn Koster, “A Method for Web Robots Control”, Norobots-RFC, 1997.
[6]Raggett, D., Le Hors, A., and Jacobs, I., “HTML 4.01 Specification”, Dec. 1999.
[7]Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,  Introduction to Information Retrieval, Cambridge University Press, 2008.
[8]Donald E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition, Addison-Wesley, 1998.
[9]Hughe Williams, Justin Zobel, and Dirk Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Journal TOIS, Volume22 Issue 4, Oct. 2004.
[10]Dirk Bahle, Hughe Williams, and Justin Zobel, “Compaction techniques for nextword indexes”, 8th International Symposium on String Processing and Information Retrieval (SPIRE2001), pp. 33-45, 2001.
[11]Google Sparse Hash, http://goog-sparsehash.sourceforge.net/.
[12]Apache Hadoop Project,http://hadoop.apache.org/.
[13]Yahoo! Launches World's Largest Hadoop Production Application, http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html.
[14]Hajime BABA, “Google の秘密 - PageRank 徹底解説”, Jan. 2004.
[15]Zoltan Gyongyi and Hector Garcia-Molina, “Link Spam Alliances”, Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005.
[16]Bar-Ilan Judit, “Google bombing from a time perspective”, Journal of Computer-Mediated Communication, Volume 12 Issue 3, pp. 910-938, 2007.
[17]Anna Patterson, “Why Writing Your Own Search Engine Is Hard”, ACM Queue, Volume 2 Issue 2, pp. 48-53, 2004.
[18]RSS Advisory Board, “RSS 2.0 Specification”, 2.0.10, 2007.
[19]P. Resnick, Editor, "Internet Message Format", RFC 2822, 2001.
[20]M. Nottingham and R. Sayre, “The Atom Syndication Format”, RFC 4287, 2005.
[21]Wolf, M. and C. Wicksteed, “Date and Time Formats”, W3C NOTE NOTE-datetime-19980827, 1998.
[22]Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, 1999.
[23]曾元顯、林瑜一,“模糊搜尋、相關詞提示與相關詞回饋在 OPAC 系統中的成效評估”,中國圖書館學會會報61 期,Vol. 6,pp. 103-125,1998。
[24]王惠,“基於組合特徵的漢語名詞詞義消歧”,Computational Linguistics and Chinese Language Processing,Vol. 7 No.2,pp. 77-88,Aug. 2002。
[25]陳稼興、謝佳倫、許芳誠,“以遺傳演算法為基礎的中文斷詞研究”,資訊管理研究第二卷第二期,pp. 27-44,2000。
[26]中央研究院中文計算語言研究小組,中文詞知識庫小組,http://godel.iis.sinica.edu.tw/CKIP/。
[27]蔡志浩,“MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm”, http://technology.chtsai.org/mmseg/。
[28]香港中文大學,現代漢語常用字頻率統計, http://humanum.arts.cuhk.edu.hk/Lexis/chifreq/。
[29]林筱晴、陳信希,”語料庫統計值與全球資訊網統計值之比較:以中文斷詞應用為例”,第十六屆自然語言與語音處理研討會,pp. 89-100,2004。
[30]陳光華,“資訊檢索系統的評估-NTCIR會議,國立台灣大學圖書資訊學系四十週年系慶學術研討會論文集”,pp. 67-86,2001。
[31]Shafi, S.M and Rafiq A. Rather, ”Precision and Recall of Five Search Engines for Retrieval of Scholarly Information in the Field of Biotechnology. ”, Webology, Vol. 2 No 2, 2005.
[32]林千翔、張嘉惠,“基於特製隱藏式馬可夫模型之中文斷詞研究” , ROCLING XVIII: Conference on Computational Linguistics and Speech Processing,Session 5,2006。
[33]Fay Chang, et al., “Bigtable: A Distributed Storage System for Structured Data”, OSDI '06, pp. 205–218, 2006.
[34]Google: Cluster Computing and MapReduce, http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html.
[35]Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, “The Google File System”, ACM SIGOPS Operating Systems Review, Volume 37 Issue 5, pp. 29-43, 2003.
[36]Justin Zobel and Alistair Moffat, “Inverted files for text search engines”, ACM Computing Surveys (CSUR), Volume 38 Issue 2, 2006.
[37]張如瑩,多語系平行關鍵頁搜尋引擎之設計與建構,元智大學,碩士論文,2001。
[38]Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified data processing on large clusters”, Communications of the ACM, Volume 51 Issue 1, pp. 107-113, 2008.
[39]Barroso, L. A., Dean, J., and Urs Hölzle, “Web search for a planet: The Google cluster architecture”, IEEE Micro 23, pp. 22-28, 2003.
[40]Nivio Ziviani, et al., “Compression: A Key for Next-Generation Text Retrieval Systems”, IEEE Computer, Vol. 33 No.11, pp. 37-44, 2000.
[41]Ronen Feldman and James Sanger, The Text Mining Handbook, Cambridge University Press, 2006.
[42]Amit Singhal, “Modern Information Retrieval: A Brief Overview”, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24, pp. 35-43, 2001.
[43]L. Lim, et al., “Dynamic Maintenance of Web Indexes Using Landmarks”, In Proceedings of the 12th International World Wide Web Conference, pp. 102-111, 2003.
[44]Giuseppe DeCandia et al., Dynamo: Amazon's Highly Available Key-Value Store, ACM SOSP’07, pp. 205-220, 2007.
[45]Ozearch, http://ozearch.org/.
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top
無相關期刊
 
* *