The large scale web search session are available at here. The search session is organized as:
Column Id | Explaination | Remark |
---|---|---|
Qid | query id | |
Query | The user issued query | Sequential token ids separated by “”. |
Query Reformulation | The subsequent queries issued by users under the same search goal. | Sequential token ids separated by “”. |
Pos | The document’s displaying order on the screen. | [1,30] |
Url_md5 | The md5 for identifying the url | |
Title | The title of document. | Sequential token ids separated by “”. |
Abstract | A query-related brief introduction of the document under the title. | Sequential token ids separated by “”. |
Multimedia Type | The type of url, for example, advertisement, videos, maps. | int |
Click | Whether the user clicked the document. | [0,1] |
- | - | - |
- | - | - |
Skip | Whether the user skipped the document on the screen. | [0,1] |
SERP Height | The vertical pixels of SERP on the screen. | Continuous Value |
Displayed Time | The document’s display time on the screen. | Continuous Value |
Displayed Time Middle | The document’s display time on the middle 1/3 of the screen. | Continuous Value |
First Click | The identifier of users’ first click in a query. | [0,1] |
Displayed Count | The document’s display count on the screen. | Discrete Number |
SERP’s Max Show Height | The max vertical pixels of SERP on the screen. | Continuous Value |
Slipoff Count After Click | The count of slipoff after user click the document. | Discrete Number |
Dwelling Time | The length of time a user spends looking at a document after they’ve clicked a link on a SERP page, but before clicking back to the SERP results. | Continuous Value |
Displayed Time Top | The document’s display time on the top 1/3 of screen. | Continuous Value |
SERP to Top | The vertical pixels of the SERP to the top of the screen. | Continuous Value |
Displayed Count Top | The document’s display count on the top 1/3 of screen. | Discrete Number |
Displayed Count Bottom | The document’s display count on the bottom 1/3 of screen. | Discrete Number |
Slipoff Count | The count of document being slipped off the screen. | |
- | - | - |
Final Click | The identifier of users’ last click in a query session. | |
Displayed Time Bottom | The document’s display time on the bottom 1/3 of screen. | Continuous Value |
Click Count | The document’s click count. | Discrete Number |
Displayed Count | The document’s display count on the screen. | Discrete Number |
- | - | - |
Last Click | The identifier of users’ last click in a query. | Discrete Number |
Reverse Display Count | The document’s display count of user view with a reverse browse order from bottom to the top. | Discrete Number |
Displayed Count Middle | The document’s display count on the middle 1/3 of screen. | Discrete Number |
- | - | - |
The expert annotation dataset is aviable at here. The Schema of the nips_annotation_data_0522.txt:
Columns | Explaination | Remark |
Query | The user issued query | Sequential token ids separated by "\x01". |
Title | The title of document. | Sequential token ids separated by "\x01". |
Abstract | A query-related brief introduction of the document under the title. | Sequential token ids separated by "\x01". |
Label | Expert annotation label. | [0,4] |
Bucket | The queries are descendingly split into 10 buckets according to their monthly search frequency, i.e., bucket 0, bucket 1, and bucket 2 are high-frequency queries while bucket 7, bucket 8, and bucket 9 are the tail queries | [0,9] |
The unigram_dict_0510_tokens.txt is a unigram set that records the high-frequency words using the desensitization token id.