Wednesday, 19 October 2016

Touch Screen Lexicon Forensics (TextHarvester/WaitList.dat)

By Barnaby Skeggs

Preamble

Since the release of Windows 8, and the ‘Metro’ interface, touch screen input has been implemented in a rapidly rising number of Windows devices including Microsoft Surface Pro/Book, 2-in-1s, convertible laptops and tablets. Microsoft has catered for this trend, implementing conversion between touch/pen handwriting to computer text in software such as OneNote. In this paper I will detail my research into the forensic artefact ‘Waitlist.dat’, which I believe to be associated with this functionality.
I identified the ‘WaitList.dat’ artefact while investigating a Windows 8.1 PC for the presence of a known email. I was provided with a copy of this email, and part of the investigation involved identifying whether or not this email ever existed on the custodian’s computer. After processing the .PST and .OST mailbox archives on the PC, I did not identify the existence of the email. I then processed shadow copies, carved and processed for various mailbox stores and email files, and still did not identify the email. As a final attempt, I ran a string search for the email subject line across the whole forensic image. I received 1 hit within ‘WaitList.dat’. Investigation of this 140mb file identified metadata, and full body text of over 36’000 emails and documents, spanning back 3 years.

Acknowledgements

Shaun Bettridge – Peer review, contribution to data structure analysis and being a sounding board for ideas throughout this analysis.
Carl - Peer review.

WaitList.dat

‘WaitList.dat’ (WaitList) is a data file which has been found to contain stripped text from email, contact and document files. The population of data within WaitList is associated with the ‘Microsoft Windows Search Indexer’ process. This process locks the WaitList file on a live system.
WaitList is located in the following directory on Windows 8.1 and 10 systems (may exist on other OS versions, however I do not have systems to test this):
C:\Users\%User%\AppData\Local\Microsoft\InputPersonalization\TextHarvester\WaitList.dat
I have only identified WaitList on PCs which have utilized touch screen handwriting recognition features. My own touch screen laptop did not contain this file, as I had not used the feature. In order to test its creation, I setup and began using the handwriting recognition in OneNote, and WaitList was soon automatically created. The following morning full text extracts of all emails I had received overnight were populated within WaitList.
Registry comparison before and after my test showed the following registry key modifications:
Key: HKEY_CURRENT_USER\SOFTWARE\Microsoft\InputPersonalization


Lexicon definition: the vocabulary of a person, language, or branch of knowledge

The "App Lexicon Timestamp" key is a Windows 64 bit FILETIME (Big Endian) timestamp which matches (within ~10 seconds) the installation time of the COM Class Object 'UserLexiconManager'. This is not the date when this key was created, or the date from when WaitList population commenced. On my PC, this date was prior to my purchase of the laptop, likely associated with the initial Windows 10 installation.

Alternatively, these registry key values can be created by enabling 'Personalised Handwriting Recognition' for a supported language in the Control Panel.

Control Panel\Clock, Language and Region\Language\Language options

Theory and Further Research

As of Windows Vista, Custom Dictionaries have been used to improve handwriting recognition results. This has worked by the Input Personalisation System (IPS) collecting user data, which a 'Text Trainer' 'tunes' and stores in 'lexicon blobs'.

The text trainer stores Application Lexicon Blobs, and User Lexicon blobs. Both blobs can be used by the Handwriting Recogniser, and both blobs are updated when new data is received by the IPS, thereby continually improving handwriting recognition accuracy.

For more information on this process, please read:
https://msdn.microsoft.com/en-us/library/bb265252.aspx


Representation of the relationship between Ink Applications and the IPS


The following files exist within the same directory as WaitList.dat:
%User%\AppData\Local\Microsoft\InputPersonalization\TextHarvester\TextHarvester.dat
%User%\AppData\Local\Microsoft\InputPersonalization\TrainedDataStore\en-AU\*


The 'DocID' (Offset 0x1C detailed in Data Structure below) appears to match entries within WaitList to values within TextHarvester.dat. This link will be investigated and detailed in a future blog post.

Whilst further research is definitely required it is possible that the 'Microsoft Windows Search Indexer' collects and stores user data in WaitList, following which TextHarvester acts as the 'Text Trainer', tuning the user data into TrainedDataStores (User Lexicon Blobs) for use by the 'Handwriting Recogniser'. 

If you know more about these files please contact me at b2dfir@REMOVE.gmail.com.

File Contents

The following data has been identified within WaitList.dat records.

Microsoft Outlook Email:
·        Date/Time
·        Email subject
·        Sent flag
·        Type (Email/Document/Contact)
·        Recipients (Does not distinguish between ‘To’, ‘CC’ and ‘BCC’)
Note: Does not store ‘From’ value, however this can often be identified in email signatures)
·        Meeting Location (only when email is a calendar invite)
·        Body of file
Contact:
·        Address
·        City
·        State
·        Country
·        Full Name
·        Title
·        Contact Details (email/phone/url)
Note: Contacts added from Skype/Lync may be recorded as a ‘sent’ email item, due to the way Outlook imports/stores the contact.
Documents (.pdf, .xlsx, .txt, .doc and .docx files have been tested):
·        Date/Time
·        DocumentID (use to compare document indexes over time) – format unknown
·        Body of file
·        Company
It is likely that other values are stored in additional data types, however this is the extent of data I have identified in my testing procedures.

Forensic Application

WaitList provides an additional source of evidence for email and document discovery. In addition to the existence and content of a document, WaitList will store multiple indexes for a single document over time. This provides a forensic examiner the ability to view historical iterations of a file, even when shadow copy is not enabled, or when the file has been deleted/wiped from the hard drive.
The population of data within WaitList.dat is associated with the ‘Microsoft Windows Search Indexer’ process. This process locks the WaitList file on a live system. Existence of an index record within WaitList only indicates the existence of the file on the computer. User interaction with the file can only be inferred when the metadata stored within the record (e.g. ‘Sent flag’, ‘Recipient’) indicates a user action.
An email or document can be recorded in WaitList without being read or opened by the user.
Limitations of the Microsoft Windows Search Indexer apply to all records within WaitList. For example, files within an archive (.zip, .rar etc.) or encrypted documents cannot be indexed with default 'Microsoft Windows Search Indexer' settings, and therefore will not be stored within WaitList. Scanned (non-text searchable) PDFs may appear as records, however the body text will be empty.
For more information on the 'Windows Search Indexer' visit:
https://msdn.microsoft.com/en-us/library/ee805985(v=vs.85).aspx
 ‘Microsoft Windows Search Indexer’ will index emails and their attachments at a similar time. As a result, these files will occur within close proximity of each other when they are written to ‘WaitList.dat’. Whilst there does not appear to be a direct parent to child relationship value, the attachment files will contain matching ‘Recipient’ values to their parent email. ‘Date/Time’ and ‘Recipient’ values can be used to associate emails with their likely attachments.

Parsing WaitList.dat

WLrip.py (WLrip) is a python program I have written to parse the contents of WaitList.dat, based on my understanding of the data structure specified below. WLrip will extract the metadata and body text of each record to a new .txt file, and produce a metadata report in .csv format.
Running WLrip with the ‘-x’ option will produce a .xlsx report with hyperlinks to each .txt file created. This is the recommended method to run WLrip, however it requires the Python ‘XLSXWriter’ module (https://github.com/jmcnamara/XlsxWriter).
Recommended execution of WLrip.py is as follows:
Wlrip.py -c -x -f <filename> -o <output directory>
Arguments:
Argument
Description
-c
Removes various null characters, in an attempt to clean up the text output.
-x
Produces a .xlsx report, as well as the default .csv report.
-k
Kills the ‘Microsoft Windows Search Indexer’ process, which will lock the WaitList.dat file on a live system. Requires administrator privileges.
-f
Specify WaitList.dat file location for processing.
-o
Specifies an output directory. If not included, the report will be generated within a new folder in the current directory.

I have done my best to write this program in a way that allows it to capture new values (which I have not yet encountered) in the ‘other’ field. Values captured in the ‘other’ field will be appended with a [type], to indicate the field value stored in the data structure. Please send unknown values to me and I can implement them in future releases.

https://github.com/B2dfir/wlrip

I have also compiled WLrip into a portable Windows executable using pyinstaller.

https://github.com/B2dfir/wlripEXE

Data Structure Analysis

I have performed analysis on the data structure of WaitList in order to understand how text and metadata are stored within each record. All values are in little endian.
Data Structure (Hex)
Data Structure (Detail)
Offset
Hex
Decimal
Length
Field Name and Description
0x00
6400000000
100
5 bytes
WaitList.dat File   Signature
0x05
03 0b 00 00
2819
4 bytes
Index Record Length (bytes)
0x09
03 0b 00 00
2819
4 bytes
Index Record Length (bytes) - repeated
0x0D
40 59 7B 44
58 F8 D1 01
-
8 bytes
Win 64bit FILETIME – Indexed file’s last modification time/date
0x15
0F 00 00 00
15
4 bytes
Record ID (incremental integer)
Note: Not included in WLrip output
0x19
00
0
1 byte
Sent Flag
00 = sent email
01 = everything else
Note: Local files, and email attachments will not contain the MailItem.Sent property*, and will default to 00.
0x1A
00
0
1 byte
Unknown – always 00 in currently tested files. Possibly a part of 'Sent Flag' (if 'Sent Flag' is a 2 byte int)
Note:  Included in report for as Unkn for community examination.
0x1B
01
1
1 byte
Type
00 = Not Email
01 = Email
0x1C
00 00 00 00
00 00 00 00
0
8 bytes
DocID – format unknown
Filter on this value to view multiple indexes of a document over time.
Known information:
- Value for emails is always 0s
- All documents contain a value here
- Not a timestamp I could identify
- Is not similar between documents with similar timestamps
- Is the same for duplicate documents
- Is the same for multiple index records pertaining to the same document (e.g. when more text has been saved to a report)
0x24
00
0
1 byte
More Metadata Flag
00 = More metadata stored in this record (e.g. Blue data / Orange data structures)
01 = No more metadata stored in this record prior to the body text.
Note: Not included in WLrip output.
0x25
07 00 00 00
7
4 bytes
Index Record Metadata Type Flag
04 00 00 00 = Recipient Email Address
06 00 00 00 = Subject
07 00 00 00 = Recipient Name
10 00 00 00 = Full Name (contact)
11 00 00 00 = Title (contact)
12 00 00 00 = Last Name
21 00 00 00 = State
0B 00 00 00 = Address (contact)
0C 00 00 00 = City (contact)
0D 00 00 00 = Country (contact)
0E 00 00 00 = Contact details (contact)
0F 00 00 00 = First Name (contact)
13 00 00 00 = Middle Name (contact)
1B 00 00 00 = Location (meetings)
0x29
00 00 00 00
0
4 bytes
Grammar Proofing Type
Note: See the following registry key for available proofing types: HKEY_CURRENT_USER\SOFTWARE\Microsoft\Shared Tools\Proofing Tools\Grammar\MSGrammar
Note: Not included in WLrip output.
0x2D
0c 00 00 00
12
4 bytes
Metadata Length (characters)
Multiply integer value by 2 for byte length
0x31
-
-
24 bytes
Metadata Text (Recipient name in this example)
0x49
00
0
1 byte
Another Metadata Value Flag
00 = Get another value
01 = No more values
0x4A
04 00 00 00
4
4 bytes
Same as blue offset 0x25
0x4E
00 00 00 00
0
4 bytes
Same as blue offset 0x29
0x52
13 00 00 00
19
4 bytes
Same as blue offset 0x2D
0x56
-
-
38 bytes
Metadata Text (Recipient email address in this example)
0x7C
01
1
1 byte
Same as blue offset 0x49
0x7D
00 00 00 00
0
4 bytes
Current Body Text   Offset
Increases with each length of indexed body text
0x81
05 00 00 00
5
4 bytes
Body Type Flag
05 00 00 00 = Email Body
17 00 00 00 = Contact
1d 00 00 00 = Document Body
0x85
09 0C 00 00
3081
4 bytes
Grammar Proofing Type
Refer to ‘HKCU\SOFTWARE\Microsoft\Shared Tools\Proofing Tools\Grammar’ for types available on your system.
Note: Not included in WLrip output.
0x89
81 02 00 00
641
4 bytes
Length of First Section of Body
Multiply integer value by 2 for byte length.
0x8D
-
-
1282 bytes
First Length of Body   Text
0x58F
01
1
1 byte
More Body Flag
01 = there is another section of body
00 = body is complete
0x590
80 02 00 00
640
4 bytes
Same as red offset 0x7D
0x594
05 00 00 00
5
4 bytes
Same as red offset 0x81
0x598
09 04 00 00
1033
4 bytes
Same as red offset 0x85
0x59C
1C 00 00 00
28
4 bytes
Same as red offset 0x89
0x5A0
-
-
56 bytes
Second Length of Body Text
0xAC9
0
0
1 byte
Same as red offset 0x58F
0xACA
06 00 00 00
4
4 bytes
Same as blue offset 0x25
0xACE
09 04 00 00
1033
4 bytes
Same as blue offset 0x29
0xAD2
1B 00 00 00
27
4 bytes
Same as blue offset 0x2D
0xAD6
-
-
54 bytes
Metadata Text (Email subject in this example)
0xB0C
00
0
1 byte
Same as blue offset 0x49
Doesn’t necessarily terminate on a 1 at the end of the record. WLrip has record length checks to mitigate parsing errors.

* For more information on MailItem objects, see:
https://msdn.microsoft.com/en-us/library/office/ff861332.aspx

Conclusion

WaitList is an additional source of email, contact and document evidence to add to our arsenal of forensic examination and e-discovery tools. Should you have any questions, recommendations or corrections for any of the detail in this blog, please post in the comments section below.
---------------------------------------------------------------------------------------------------------------------
Disclaimer: The information detailed within this report is based on my limited testing and analysis of the ‘WaitList.dat’ file. I do not currently claim to have a complete understanding of the structure or function of this file. Confirmation and testing of my findings by the broader forensic community is required before this information should be relied upon.

3 comments:

  1. Thanks for putting this out there. Very interesting and helpful.

    ReplyDelete
    Replies
    1. No worries at all. I am glad you found it useful!

      Delete
  2. Well done, this yielded some horrific results on my desktop I use to work from home, NB: Not a touch device. Something that bothered me was some of the data seems as though it could have only have come from a "clipboard" type vector whilst I was using a remote desktop / remote workspace connection. I CANNOT Confirm that, it is only speculation!!! I am investigating further at the moment...

    ReplyDelete