The large scale web search session are available at here. The search session is organized as:
| Column Id | Explaination | Remark |
|---|---|---|
| Qid | query id | |
| Query | The user issued query | Sequential token ids separated by “”. |
| Query Reformulation | The subsequent queries issued by users under the same search goal. | Sequential token ids separated by “”. |
| Pos | The document’s displaying order on the screen. | [1,30] |
| Url_md5 | The md5 for identifying the url | |
| Title | The title of document. | Sequential token ids separated by “”. |
| Abstract | A query-related brief introduction of the document under the title. | Sequential token ids separated by “”. |
| Multimedia Type | The type of url, for example, advertisement, videos, maps. | int |
| Click | Whether the user clicked the document. | [0,1] |
| - | - | - |
| - | - | - |
| Skip | Whether the user skipped the document on the screen. | [0,1] |
| SERP Height | The vertical pixels of SERP on the screen. | Continuous Value |
| Displayed Time | The document’s display time on the screen. | Continuous Value |
| Displayed Time Middle | The document’s display time on the middle 1/3 of the screen. | Continuous Value |
| First Click | The identifier of users’ first click in a query. | [0,1] |
| Displayed Count | The document’s display count on the screen. | Discrete Number |
| SERP’s Max Show Height | The max vertical pixels of SERP on the screen. | Continuous Value |
| Slipoff Count After Click | The count of slipoff after user click the document. | Discrete Number |
| Dwelling Time | The length of time a user spends looking at a document after they’ve clicked a link on a SERP page, but before clicking back to the SERP results. | Continuous Value |
| Displayed Time Top | The document’s display time on the top 1/3 of screen. | Continuous Value |
| SERP to Top | The vertical pixels of the SERP to the top of the screen. | Continuous Value |
| Displayed Count Top | The document’s display count on the top 1/3 of screen. | Discrete Number |
| Displayed Count Bottom | The document’s display count on the bottom 1/3 of screen. | Discrete Number |
| Slipoff Count | The count of document being slipped off the screen. | |
| - | - | - |
| Final Click | The identifier of users’ last click in a query session. | |
| Displayed Time Bottom | The document’s display time on the bottom 1/3 of screen. | Continuous Value |
| Click Count | The document’s click count. | Discrete Number |
| Displayed Count | The document’s display count on the screen. | Discrete Number |
| - | - | - |
| Last Click | The identifier of users’ last click in a query. | Discrete Number |
| Reverse Display Count | The document’s display count of user view with a reverse browse order from bottom to the top. | Discrete Number |
| Displayed Count Middle | The document’s display count on the middle 1/3 of screen. | Discrete Number |
| - | - | - |
The expert annotation dataset is aviable at here. The Schema of the nips_annotation_data_0522.txt:
| Columns | Explaination | Remark |
| Query | The user issued query | Sequential token ids separated by "\x01". |
| Title | The title of document. | Sequential token ids separated by "\x01". |
| Abstract | A query-related brief introduction of the document under the title. | Sequential token ids separated by "\x01". |
| Label | Expert annotation label. | [0,4] |
| Bucket | The queries are descendingly split into 10 buckets according to their monthly search frequency, i.e., bucket 0, bucket 1, and bucket 2 are high-frequency queries while bucket 7, bucket 8, and bucket 9 are the tail queries | [0,9] |
The unigram_dict_0510_tokens.txt is a unigram set that records the high-frequency words using the desensitization token id.