Machine Learning Assisted RDP Bitmap Cache Forensics

12 min readOct 21, 2023

I had an enjoyable time presenting my work at the Digital Forensics Research Conference — DFRWS APAC 2023. Here’s what I presented, for your reading pleasure.

1.1 How the cache comes about

Remote Desktop Protocol (RDP) has been around for more than 20 years. It is that service that allows the users to remotely access their Windows desktops. As the Windows desktop is a graphical interface, lots of images are loaded over the network. If there is high network latency, the user experience will be badly affected. In order to prevent that, the RDP clients provide an option to cache the images on the client’s end.

1.2 Where the cache is stored

Different RDP clients store the cache at different locations and in different formats. Here are 3 examples:

%AppData%\Local\Microsoft\Terminal Server Client\Cache\Cache****.bin

$HOME/.rdesktop/cache/pstcache*

Depends on the `/cache:persist-file:<filename>` parameter.

The focus of this presentation will be the cache that is produced by the Windows client: mstsc.exe.

After the RDP session terminates, the cache becomes a forensics artefact as it essentially contains snippets of screenshots of the RDP session. Sounds interesting? Let’s take a look!

1.3 What is inside the cache

If we open up the cache and lay the images in the sequential order of how they were stored, we get a collage that looks like this:

Collage of tiles created from the mstsc.exe’s cache. Source of cache: https://github.com/brimorlabs/rdpieces.git/Cache/Cache0002.bin

What. A. Mess. It does not resemble the desktop at all!

In fact, the desktop has been cut up into small tiles, and there are over 6000 of such tiles stored in the cache.

The cache also has a size limit so not every tile will be stored. This results in missing pieces.

We definitely need to do some post-processing work before we can make sense of it. That brings us to the existing approach and public tools to do so.

2.1 Existing approach and public tools

The following flow chart shows the existing approach and public tools that are used to examine the cache:

Firstly, we extract the tiles from the cache using BMC Tools and create the collage that was shown previously.

Next, we have 2 options.

Just analyse the cache. Eyeballing will work but our eyes will hurt. Otherwise, we can run OCR on the collage and let it extract the text. Subsequently the extracted text are matched heuristically against a set of pre-defined keywords to determine their occurrences.
Rearrange the tiles first. We can either do it manually using RdpCacheStitcher, or try to automate the process using RdpCacheStitcher as well or RDPieces. The catch is that both tools generate false positives. Alternatively, if we look into tools that solve jigsaw puzzles automatically, there is one called JigsawGAN which can solve a 16 pieces jigsaw puzzle automatically. However, our cache has over 6000 pieces, which is on a much larger scale. Furthermore, JigsawGAN attempts to solve a complete picture whereas our cache has missing tiles.

2.2.1 Not exactly solving jigsaw puzzles

We aren’t exactly solving jigsaw puzzles either.

As illustrated in the diagram above, these are 15 tiles taken from the same position of the desktop. If we are reconstructing the toolbar, all 15 tiles will fit in.

But is there really a need to reconstruct the toolbar?

Maybe not. We can already derive the context from these 15 tiles, i.e. the time period of the RDP session. Even if it is not the whole duration of the RDP session, this information can be used to supplement other event logs.

However, the premise is that we have already sorted out these 15 tiles and grouped them together beforehand.

2.2.2 Why grouping related tiles helps

This is another scenario why grouping related tiles into the same group helps.

In this ‘live’ RDP session, the 2 windows — command prompt and browser — don’t overlap, so they can be reconstructed separately without affecting each other.

As for the desktop background, it is mostly just blue and uninteresting so there is no need to spend time on it.

If we can obtain all the tiles for the command prompt, browser and background in separate groups, we will be able to reconstruct the 2 windows separately and discard the background.

2.2.3 Why grouping related tiles helps

Here is yet another scenario to demonstrate why grouping related tiles helps.

This snippet of the collage shows the word “Administrator” but it has been jumbled up. We can pick it up easily because the 3 tiles happen to be close to one another. It wouldn’t have been so easy if they are very far apart.

Therefore, it definitely helps to get related tiles to be in close proximity.

3. Where the gap is in the existing approach

I can’t emphasize enough the need to group related tiles together first, as that is the missing step before rearranging the tiles:

When it comes to grouping images, there are already well-established ways to do so in unsupervised machine learning, i.e., clustering.

So let’s utilize clustering to solve our problem!

4.1 Addressing the gap

Clustering of images is a multi-step process.

Broadly speaking, we start by extracting the features for 1 image, followed by generating the dataset for n images, then clustering and evaluating the results.

Steps 2 & 3 can be generalized once we have the dataset.

Step 1 is key. What features to extract and how to do so?

4.2 Feature extraction ideas

Let’s draw some ideas from our earlier scenarios on what features to extract.

4.2.1 Use colours to differentiate the tiles

We want all the black tiles to be in one group as that gives us all the tiles for the command prompt.

We want all the blue background tiles to be in another group so that we can discard it.

The browser tiles are more complicated so we will put it aside for now.

Next, we extract the colours of the image to get the features.

We take the image histogram, which consists of 3 channels: Red, Blue, Green. Each channel has 256 shades of its respective colour. Therefore, there are 768 features in total.

Next, we generate the dataset for n images, and apply Principal Component Analysis (PCA) to reduce the number of features to, say, a ballpark figure of 50.

Theoretically, rigorous Math is involved to justify the number of Principal Components. In practice, I find it much quicker to experiment with a ballpark figure and not get too fixated on theory.

4.2.2 Use the contents as features

The next idea is derived from the scenario where all the 15 tiles look similar because they contain the time, date and a blue horizontal strip at the top. Is there a way to recognise these contents?

Drawing ideas from Computer Vision, specifically the section on feature extraction, there are ways to get lines, edges, ridges, localized interest points, or more complex ones like texture and shape. Let’s try a few of them and see what we get.

The original image is as shown below:

Original image

If we use the Prewitt operator to detect and extract the horizontal edges, we get the following output image:

Horizontal edges detection using the Prewitt operator

If we use Histogram of Oriented Gradients (HOG) to extract the objects, we get the following output image:

Objects detection using the Histogram of Oriented Gradients (HOG)

After extracting either edges or objects, we obtain the features for 1 image by flattening the matrix output into a vector.

We can also extract objects using a Convolutional Neural Network (CNN) such as a pre-trained VGG19, but without the final classification layer so that it just detects objects for us.

Visual output of some filters taken from the block5_conv4 layer. Filters are responsible for detecting objects.

The output of the fully connected layer (fc2) produces a vector of size 4096 and it is used as the features for 1 image. Similarly, we generate the dataset for n images and apply PCA to reduce the number of dimensions to 50.

4.2.3 Separate tiles with different sizes

The third idea is less obvious. Actually the tiles in the cache can have different sizes. Depending on the screen resolution, tiles at the bottom row can be shorter than the rest.

When we group the tiles according to their sizes, we get 2 clusters.

Cluster 0: 64x56 tiles. Number of tiles = 167

Tiles in Cluster 0 form the toolbar, so our requirement to derive the time period of the RDP session is more or less met.

Cluster 1: 64x64 tiles. Number of tiles = 6132

Cluster 1 is still very large. Let’s use colours to split it up into smaller clusters.

5.1 Clustering using colours

The question is, how many clusters do we want? The silhouette score can be used to guide us. It measures how similar an object is to its own cluster as compared to the other clusters. Generally, higher scores indicate better clustering.

From the plot above, creating 2 clusters gives us the highest score, but that means we will still have on average 3000 tiles per cluster and that is still very large to deal with.

The next two highest scores are at 5 and 13 clusters. Let’s use the following 2 diagrams to determine whether 5 or 13 is better.

Even if we increase from 5 to 13 clusters, that large chunk on the right isn’t split up very much anyway. So let’s go with 5 clusters, create the collages and observe the results.

Cluster 0: mostly blue tiles and uninteresting

Cluster 1: this is that large cluster that cannot be split very much as it contains mostly white tiles

Cluster 2: control panel side bar which is a generic interface so it’s uninteresting

Cluster 3: mostly blue tiles, but it’s easier now to spot the those that look out of place and take them out

Cluster 4: command prompt tiles and probably the most interesting one to focus on first

We managed to get most of the command prompt tiles in the same group (Cluster 4). Some command prompt tiles seem to have ended up in Cluster 3 but it’s now easier to identify them and manually shift them to the correct cluster.

Cluster 1 is still large and using colours isn’t very effective. Let’s switch to clustering using contents.

5.2 Clustering using contents

We roughly see 3 types of tiles in Cluster 1 (replicated below), i.e. tiles containing:

File explorer icons
Lines of text
Blue strips

Therefore, we expect to split it into at least 3 subclusters.

Cluster 1 obtained after splitting by tile colours

5.2.1 Clustering using contents — VGG19

The scores below indicate that we should go for at least 4 subclusters rather than 3:

If we go for 4 subclusters, only 1 of them (subcluster 3 as shown below) will have below average scores, i.e. many samples inside subcluster 3 have ended up in the wrong cluster:

If we go for 12 subclusters, things will improve:

For demonstration purposes, let’s go with 4 subclusters and this is what we get:

Cluster 0: Horizontal blue strips, white tiles, etc

Cluster 3: File explorer icons, blue strips

As we can see in the collages above, even with 4 clusters, the results are not bad: file explorer icons in one cluster, lines of text in another, etc.

5.2.2 Clustering using contents — HOG

Let’s compare the output of HOG against VGG19.

Unlike VGG19, we should not go with 4 subclusters because almost the entire subcluster (number 2) has negative scores, indicating that the majority of the samples are in the wrong cluster:

Both 9 and 15 subclusters look reasonable because the wrongly clustered samples in subcluster 2 above have been corrected:

If we have really gone with 4 clusters, we can see from subcluster 2 below that the file explorer icons tiles are still mixed with the lines of text tiles:

Subcluster 1

Subcluster 2: unable to separate file explorer icons from lines of text

Subcluster 3

This is the result of splitting into 9 subclusters:

Subcluster 1: File explorer icons

Subcluster 8: **File explorer** address bar

Subcluster 3: **File explorer** icons, **file explorer** address bar, lines of text

It is much better now because we get 3 subclusters (1, 8, 3) that are related to the file explorer.

Subcluster 0: **Lines of text**, some icons

Most of the tiles that contain lines of text are put into subcluster 0.

Other subclusters that are not so interesting have been isolated, as shown below. Even if there are interesting ones embedded inside the wrong subcluster, we can now separate them much more easily.

Subcluster 5: Blank tiles, blue strips, etc

Subcluster 4: vertical blue strips

5.2.3 Clustering using contents — Prewitt

The Prewitt operator extract edges and it is the last algorithm to try.

We observe that this method is not ideal because the large chunk in the middle refuse to split into smaller ones, even if we increase from 4 to 14 subclusters:

From the collage, we see that all the useful information are still cluttered into Subcluster 0:

Subcluster 0: unable to separate file explorer icons from lines of text

6 Conclusion

By utilizing unsupervised Machine Learning, specifically clustering, we have plugged a gap in our toolset to examine RDP bitmap caches.

Grouping of related tiles can now be done effortlessly, which leads to precious time savings.

Having smaller clusters allows us to

Reduce the size of the rearrangement problem
Quickly identify interesting ones that we should focus on first, likewise discard those boring ones.
Derive the larger context of the information provided by the cluster, which may be more useful than trying to reconstruct the original image.

The accompanying code that is used to generate the features, dataset, graphs, etc is hosted at https://github.com/ng-ky/unravel.

Machine Learning Assisted RDP Bitmap Cache Forensics

Written by Ng KY

No responses yet