Trustworthiness analysis of web search results

-- Views

November 01, 20

スライド概要

Trustworthiness analysis of web search results

profile-image

Nakamura Laboratory (Meiji University)

@nkmr-lab

スライド一覧

明治大学 総合数理学部 先端メディアサイエンス学科 中村聡史研究室

シェア

埋め込む »CMSなどでJSが使えない場合

各ページのテキスト
1.

ECDL 2007, Budapest, Hungary Trustworthiness Analysis of Web Search Results Satoshi Nakamura Shinji Konishi, Adam Jatowt, Hiroaki Ohshima, Hiroyuki Kondo, Taro, Tezuka, Satoshi Oyama, Katsumi Tanaka Kyoto University nakamura@dl.kuis.kyoto-u.a.jp

2.

Background • People use Web search engines to obtain information and knowledge in daily life • Trustworthiness of Web search results has become crucial – There are many commercial pages, phishing sites, spam weblogs, pages contain viruses, … – Widespread use of SEO (Search Engine Optimizer)

3.

Background: SEO Problems • How to earn much money by GoogleAdsense? – The page is ranked higher in Web search by SEO – The page has little or no content How to detect such a site before visiting??

4.

Which search result is trustworthy??? • Users cannot judge each Web search result is majority or minority in the Web, contains typical query topics in the Web, is supported uniformly throughout the world before visiting them The system should provide such additional information in the Web search result

5.

Goal of our work • Find what kind of additional information is useful for users to trust Web search result Official site Government Wikipedia

6.

Objective of our work • Survey about Web search engines’ users – Which factors cause search engine users trust Web search results • Enhance Web search by additional information based on survey results – Early prototyping of our system

7.

Survey about Web search engines’ users • Questionnaire – 26 questions about Web search • Situation, motivation, user’s trust level, about search ranking, about additional information, future search, … – Date: 2006/12/25 - 2006/12/26 – Subjects: 1000 Internet users (Japanese) Age Male Female 20-29 125 125 30-39 125 125 40-49 125 125 50-59 125 125

8.

Situations when users search Web • Browsing Web, Doing research are major situations • People often search Web without particular reason

9.

Reasons for searching Web • Obtain detailed information or explanation of query • Make comparison is third major reason

10.

How many results do users check? • More than 50% of users check only top five search results • Only about 20% of users actually go further than top five search results Top 5 business model, SEO problem – Low ranked Web pages are ignored by more than 50% of users

11.

Ranking algorithm??? • 18% of users believe that money paid to search engines is main reason influencing ranking of Web search results Money paid

12.

Users’ trust level of search results • 56.7% of users trust Web search results • 10.4% of users don’t trust Web search results Trust 56.7% Not trust 10.4%

13.

Characteristics of pages that users trust • Owner (author) information is important Positive factor: for trusting Author or owner of information, relevance to the search query, creation date Negative factor: for not trusting Spelling error, grammatical mistakes, biased information, uniqueness

14.

Failure by believing search result • 12.3% of users failed by believing search result – 3.5% of users accessed adult contents, pages containing viruses, or phishing sites – 5.2% of users failed in the real world because of believing Web search results which include old information, mistakes, biased content and so on • Restaurant is closed, tasteless, …

15.

About additional information • What additional information should be provided? – Contents date – Related words – Information about page author or owner – Scoring reflecting trustworthiness of page – Page type – Thumbnail image of page – Third party evaluations People require various additional information

16.

Future search that people want Additional Information (48.1%) Context-aware search Domain focused search (45.7%) Clustering Automatic analysis of trust level

17.

Prototype System • The purpose of our system is not to determine the trustworthiness of content by itself • Our system provides supplement information for users to judge trustworthiness

18.

Additional Information • Topic majority (43.4% respondents) – The number of similar pages to search results that exist in WWW or in the set of pages related to query • Topic coverage (63.2% respondents) – The number of topics in the page of search result • Locality of link sources – The page is supported by wide area or small area? • Other information – – – – Topic details (72.6% respondents) Publisher information (85.1% respondents) Number of social bookmarks (38.3% respondents) Last modified-date (61.1% respondents)

19.

Topic majority in the Web When the user inputs Q as a query … Wikipedia by Q A Q B Search results by Q Page 1 A X Q Page 2 A Q C Page 3 B Z Q DF(A&B&X&Q) < DF(A&B&C&Q) >> DF(A&B&Z&Q) 100 hits 500 hits 20 hits

20.

Topic coverage When the user inputs Q as a query … Wikipedia by Q A B Q C F D E Search results by Q Page 1 A B X Y 50% Page 2 F Q Y B W X 17% Page 3 A Z Q E C F D B 100% Q

21.

Locality of supporting pages

22.

Locality of supporting pages • Locality of supporting page (L) p, pi : Web pages d(p, pi): distance between p & pi n: number of linked pages • Process of obtaining geographical coordinates – System obtains linked URLs by link: operator – System converts URLs to IP address by DNS – System obtains geographical coordinates by IP address and GeoLite City by MaxMind

23.

Examples of locality of supporting pages • Google search engine: – L = 2.939 (http://www.google.com) • Government of South Africa: – L = 2.427 (http://www.gov.za) • Government of Australia: – L = 2.792 (http://www.australia.gov.au) • Alachua County Today (local news site in Florida): – L = 42.240 (http://www.alachuatoday.com)

24.

Screenshot of prototype system Coverage Majority

25.

Displaying locality of supporting pages Powered by GoogleMap

26.

Performance of our system • The average processing time of top 10 pages for each query is 7.2 seconds and that of top 50 pages is 28 seconds • Time analysis for locality support Plan to implement Ajax based system which processes additional information and shows them sequentially

27.

Wikipedia is trustworthy site? • Problem of using Wikipedia – Students who studied in Middlebury College used Wikipedia to make a report of history • About war of Shimabara – Wikipedia’s text sometimes includes mistakes

28.

Conclusion • Surveyed about Web search engines’ users – We understand the way they search the web – How they determine the trustworthiness of search results – Additional information is required • Enhanced Web search by displaying additional information based on survey results – Topic majority, topic coverage, locality of supporting pages, other information – Supporting information that our system provides must be computed in real-time when users execute queries

29.

Future work • Plan to do experimental test about additional information in the Web search result • Plan to survey about Web2.0 and Search2.0 • Plan to improve the algorithm to calculate the topic majority, topic coverage, and so on

30.

Future work • Integrate this work and other our lab’s works – SBRank (Social Bookmark Rank) • Use number of social bookmarks to calculate the majority of minority [Yanbe et al, JCDL2007, ICWE2007] – Journey to the past • Time analysis using Internet Archive [Jatowt et al, ACM HyperText 2006] – Honto? search • Obtain aggregate knowledge from the search results [Yamamoto et al, APWeb2007]

31.

Koszonom Szepen!! Thank you!! • Please check our paper or contact us if you are interested in our work Satoshi Nakamura Kyoto University nakamura@dl.kuis.kyoto-u.ac.jp http://calendar2.org/ http://webox.biz/