Conti Inc.: Understanding the Internal Discussions of a large Ransomware-as-a-Service Operator with Machine Learning

Ransomware-as-a-service (RaaS) is increasing the scale and complexity of ransomware attacks. Understanding the internal operations behind RaaS has been a challenge due to the illegality of such activities. The recent chat leak of the Conti RaaS operator, one of the most infamous ransomware operators on the international scene, offers a key opportunity to better understand the inner workings of such organizations. This paper analyzes the main topic discussions in the Conti chat leak using machine learning techniques such as Natural Language Processing (NLP) and Latent Dirichlet Allocation (LDA), as well as visualization strategies. Five discussion topics are found: 1) Business, 2) Technical, 3) Internal tasking/Management, 4) Malware, and 5) Customer Service/Problem Solving. Moreover, the distribution of topics among Conti members shows that only 4% of individuals have specialized discussions while almost all individuals (96%) are all-rounders, meaning that their discussions revolve around the five topics. The results also indicate that a significant proportion of Conti discussions are non-tech related. This study thus highlights that running such large RaaS operations requires a workforce skilled beyond technical abilities, with individuals involved in various tasks, from management to customer service or problem solving. The discussion topics also show that the organization behind the Conti RaaS oper5086933ator shares similarities with a large firm. We conclude that, although RaaS represents an example of specialization in the cybercrime industry, only a few members are specialized in one topic, while the rest runs and coordinates the RaaS operation.


Introduction
In the past 10 years, there has been an increase in the scale, complexity, and number of ransomware attacks (Ryan, 2021).This was facilitated by the rise of ransomware-as-a-service (RaaS) business models, which provide the infrastructure and technology to conduct ransomware attacks (Salvi, 2019;Meland et al., 2020;Maurya et al., 2018;Alwashali et al., 2021).There are well-known RaaS operators, among which Conti (formerly Ryuk) (Cimpanu, 2020) stands up as one of the most active and famous ones (Chainalysis, 2023).
Conti has been active since early 2020 (Shriebman, 2022), and its ransomware has targeted high-profile organizations, including government agencies, municipalities, healthcare facilities, law enforcement agencies, 9-1-1 dispatch centers and universities (CISA, 2020;2024).Attacks attributed to Conti are known to demand high ransom payments, generally in Bitcoin, while also threatening to publish the victim's data if the payment is not made (Fokker and Tologonov, 2022).
In early 2022, following the Russian invasion of Ukraine, the Conti RaaS operator announced its support to the Russian government.This announcement allegedly led to the leak of hundreds of thousands of their internal chat logs (Vx-underground, n.d.).This leak represents a key opportunity to better understand the inner workings of the Conti RaaS operator.However, given the hundreds of thousands of conversations, a manual analysis represents a time-consuming and monotonous task.
This study uses machine learning algorithms to uncover insights into the organization of Conti.More precisely, it leverages well-established machine learning methods, including Natural Language Processing (NLP) and Latent Dirichlet Allocation (LDA), coupled with visualization strategies, to uncover the main topic discussions in Conti leaked chats.The results of the analysis showed five distinct topics: (1) Business, (2) Technical, (3) Internal tasking/Management, (4) Malware, and (5) Customer Service/Problem-Solving that are distributed across discussions.How these topics are distributed among wellknown actors is compared with qualitative analyses conducted by other security researchers.The study's key takeaways are: • The discussion topics uncovered highlight the enterprise-like organization of the Conti RaaS operator.• A significant proportion of Conti discussions are non-tech related; large RaaS operations require a workforce skilled beyond technical abilities.• Only 4% of individuals have specialized discussions, while most individuals (96%) are all-rounders with diverse discussions.
The results of the study corroborate the idea that running such a large RaaS operation translates to developing an enterprise-like structure.The importance of non-tech talk, including business discussions, as well as internal tasking and management discussions, also shows that the coordination of large RaaS operations requires a workforce skilled beyond technical abilities.Moreover, only a few individuals need to be really specialized in one area, while the rest coordinate the activities between members and customers.Even for cybercrime organizations, the bigger the organization becomes, the more "all-rounder" individuals are required to sustain the economic activities.Finally, this study illustrates how to automatically extract actionable information on the organization of a sophisticated cybercrime organization.The rest of the paper is organized as follows: Sect.Background and context presents a short literature review on ransomware-as-a-service and the Conti group; Sect.Methods and data outlines the methods and data; Sect.Results presents the results of the study; Sect.Discussion provides a discussion; Sect.Study limitations and future research presents the limitations and future research; Sect.Conclusion is the conclusion.

Background and context
This section starts by presenting the state of research on ransomware and the rise of ransomware-as-a-service.
Then, what is known on the Conti organization is presented to provide context to the study's topic.

Ransomware-as-a-service (RaaS) business model
Ransomware attacks have devastating impacts on enterprises worldwide (Brewer, 2016;Oosthoek et al., 2022;Kamil et al., 2022).Such attacks refer to an extorting scheme in which an attacker compromises one or several devices and then locks the device(s) or encrypts the files and asks for money in return for either re-accessing the device(s) and/or obtaining the key that can be used to decrypt the files.Since the first known incidence of ransomware, identified as the AIDS Trojan (Peattie, 1995), ransomware attacks have become a central threat to information technologies and the topic of several studies aimed at preventing and detecting it (Kirda, 2017;Kok et al., 2019;Richardson and North, 2017;Scaife et al., 2016;Song et al., 2016;Lee et al., 2018).
In the past ten years, the threat has evolved.In 2015, Kharraz and colleagues (2015) analyzed 1359 samples from 15 ransomware families and found that the number of families with destructive capabilities was small.In the same vein, Gazet, (2010) conducted a comparative analysis of 15 ransomware in 2010 and concluded that ransomware attackers relied rather on small attacks for small ransoms, which led to high amounts due to mass propagation.In this study Gazet, (2010), the bulk of ransomware attackers rather followed a low-cost and low-risk business model.Since then, there has been an increase in the scale, complexity, and number of ransomware attacks (Ryan, 2021).Indeed, according to recent studies, ransomware attackers are now successful at compromising advanced information systems (Kalaimannan et al., 2017), and they are better at generating revenue through various extortion schemes (O'Kane et al., 2018).
Yet, such increase in capacities by ransomware attackers may also be due to the rise of as-a-service business models that now characterize the cybercrime industry (Huang et al., 2018;Manky, 2013;Hyslip, 2020.Specifically, the ransomware-as-service (RaaS) business model provides the infrastructure and technology to conduct ransomware attacks (Salvi, 2019;Meland et al., 2020;Maurya et al., 2018;Alwashali et al., 2021).RaaS clients, known as affiliates, can purchase pre-developed ransomware tools to execute attacks.Usually, affiliates will need to connect to a platform, download the ransomware file, conduct the attack, and manage the victims (Hyslip, 2020).Also, some RaaS operators provide more support to affiliates, such as negotiating ransoms and/or providing customer support.In some cases, the affiliate and the operator split the profit generated from the attack (Meland et al., 2020).In the end, RaaS models reduce the barriers to entry into the market but do not completely remove them as affiliates (those who use the service) still need to have good technical knowledge to purchase the service (Meland et al., 2020).
Recently, a study by Chainalysis, (2023) suggested that a small number of affiliates would be responsible for a large number of attacks and these affiliates would work with many RaaS operators.Such concentration was also observed for RaaS operators as, according to, again, Chainalysis, (2023), there exists a few prolific RaaS operators, including Conti.
Given the scale and professionalization of these prolific cybercrime operators, their structure may resemble that of an enterprise.Admittedly, Lusthaus, (2018) interviewed over 200 individuals linked to cybercrime and suggested that some cybercrime organizations may now be organized as firms with offices, floors, and work days.Such a corporate-like structure develops where the forces of illegality [as defined by Reuter, (1983)], and specifically the risks of arrests, are absent.Without the threat of law enforcement, individuals can openly organize (Lusthaus and Varese, 2021;Lusthaus, 2018).Nevertheless, note that most studies point towards cybercrime organizations being rather small and loosely organized (Leukfeldt et al., 2019;Leukfeldt, 2014;Leukfeldt et al., 2016;Leukfeldt Holt, 2020;Leukfeldt et al., 2017cLeukfeldt et al., , 2017a;;Leukfeldt et al., 2017;Leukfeldt et al., 2017b;Lusthaus, 2018).Yet, RaaS providers, and at least Conti, seem to be the exception to the rule.Understanding how these groups operate is key to countering their criminal activities.

The conti RaaS operator
Active since 2020, the Conti RaaS operator successfully ran more than 700 campaigns (CheckPoint Research, 2022), generating a revenue, in 2021, of over $2.7 billion in cryptocurrency (Shriebman, 2022).To spread ransomware in their victims' network, Conti was known to leverage phishing campaigns or exploit unpatched software vulnerabilities (Umar et al., 2021;Alzahrani et al., 2022).Their phishing campaigns usually contained a zip file or a link luring the victims into downloading a Trojan, which provided a backdoor to deploy their ransomware (Alzahrani et al., 2022).
Following the Russian invasion of Ukraine in February 2022, the Conti RaaS operator announced its support to the Russian government, which allegedly led to the leak of over 160,000 messages from their internal jabber chat logs (Vx-underground, n.d.).The person responsible for the leak used a newly created Twitter account under @ ContiLeaks, 20221 to release the files, which also include the source code for the Conti ransomware and other internal project source codes that the Conti organization used to facilitate its operations.
Since then, qualitative analyses of the chat log have been conducted by various security researchers from the private industry (Fokker and Tologonov, 2022;Cimpanu, 2020;Krebs, 2022;CheckPoint Research , 2022;Kovacs, 2022).These analyses support the idea that Conti is organized as a firm with physical office buildings, a regular pay schedule and predefined departments such as human resources, finance and reversing (Fokker and Tologonov, 2022;Cimpanu, 2020;Krebs, 2022;Check-Point Research, 2022;Kovacs, 2022).Conti's structure followed a classic organizational hierarchy, with team leaders who reported to upper management (CheckPoint Research, 2022).The operator had more than 100 people on its payroll, and employees were assigned a specific 5-day workweek (Krebs, 2022).
Recently, according to Kovacs, (2022), the Conti organization has shut down the "Conti brand", transitioning to a different organizational structure involving multiple subgroups (Kovacs, 2022).Still, the leaked chat log represents a golden opportunity to uncover insights into the organization of the Conti RaaS operator beyond these manual qualitative investigations.

Methods and data
This section covers the methods and data used to conduct the analysis and is detailed enough so any researchers who wish to reproduce the analysis on the Conti chat log, but also any other data corpus, can do so easily.The data source, data preprocessing (cleanup), and modeling strategy are presented below.The goal of the analysis was to automatically detect the discussion topics of Conti members.To do so, we used (1) NLP to clean the data, (2) LDA topic modeling to create clusters of groups, and (3) data visualizations to extract meanings from the results.

Dataset
The chat files used for the research were extracted from TheParmak GitHub (TheParmak, 2023), which was one of the first repositories providing an open source access to the Conti chats translated in English.
The available jabber chat logs cover the period from June 21, 2020, to March 2, 2022 . 2 The data consists of 168,711 chats.These chat logs list the discussions of 346 actors, including members of the organization as well as potential affiliates and customers.The files are in a JSON format, and each log contains the date, the sender, the receiver as well as the actual message.They are structured as follows: • "ts": "2021-12-11T08:48:06.821161", • "from": "Actor 34@q3mcco35auwcstmt.onion", • "to":"Actor 77@q3mcco35auwcstmt.onion", • "body": "hello" We aggregated all chats sent per actor.Table 1 shows a summary of the aggregated chats per actor after the processing.Such dataset is referred to below as the corpus.
When chats were posted as a general message in a channel containing several members, they were appearing more than once in an actor's corpus.For example, if Actor A posted "hello guys" in a channel, it would appear X number of times in the actor's corpus, with X being the number of people in the channel, even though Actor A posted this message only once, as illustrated in Table 2.Such repetitive chats were problematic for the model developed below for two reasons: (1) they distorted what an actor "really" posted; the actor's corpus would no longer be accurately representative of an actor's activity, and (2) they impaired the process of topic creation as a topic is a set of words that are often seen together throughout documents.They were thus removed.Each actor's corpus was then cleaned using Natural Language Processing (NLP), as explained below.

Natural language processing (NLP)
To clean the chats, we used Natural Language Processing (NLP).NLP is a subfield of artificial intelligence that focuses on allowing a machine to understand natural language, that is, human language (Chowdhary, 2020;Raina and Krishnamurthy, 2022).Basically, NLP teaches a machine to learn, understand, and derive meaning from a language.Natural language processing uses various algorithms to learn and follow grammatical rules, which are then used to derive meaning out of words and sentences (Chowdhary, 2020); Raina and Krishnamurthy, 2022).Some of the most commonly used algorithms are stemming (reducing words to their lexical root), lemmatization (converting a word into its canonical form), and tokenization (dividing the text into meaningful pieces).NLP is used in a myriad of diversified fields such as biology (Ofer et al., 2021), translation (Zong and Hong, 2018), business intelligence (Vashisht and Dharia, 2020) and psychology (Andrew Stephen Henning, 2017) to name a few.
Using NLP algorithms, we were able to clean the chat logs, keeping only relevant words, such as "hack", "pay" or "malware".To do so, we first used normalization, which changed all words to lowercase.Second, we removed all irrelevant material from the text, like stop words, punctuation, and HTML links.Stop words are commonly used words that are not essential to the context or meaning of the sentence: "I", "is", "the", "you".Third, we tokenized the text, which consisted in dividing the text into meaningful pieces or elements for the algorithm.The message "I like blue birds" then became "[like; blue; birds]".Fourth, we lemmatized the text, which is the process of converting a word into its "canonical form".In other words, "codes" became "code" and "talked" became "talk".Thus, words in the third person were changed to the first person, and verbs in past and future tenses were put in the present tense.
This process allowed us to identify some actors who stood out for their small corpus compared to others.Some had 4000 and plus words, whereas others only had ten relevant words or even two after the data processing.For the algorithm (presented below) to process the meaning of discussions, an actor has to have a substantial amount of chats.Hence, we removed actors whose corpus contained fewer than 100 words, reducing the

Latent dirichlet allocation (LDA)
To find the discussion topics of Conti members,3 we computed Latent Dirichlet Allocation (LDA) topic models based on actors' corpus.LDA is a topic modeling method based on a generative probabilistic model for text corpora.It is widely applied with NLP to uncover topics from unordered corpora of documents (Blei et al., 2003).The basic idea behind LDA is that each document is represented as a finite mixture of latent topics, and each topic is characterized by its own distribution over words.So the LDA extracts the latent topics from a corpus of documents and simultaneously assigns a probabilistic mixture of these topics to each document.Thus the topic probabilities provide an explicit representation of a document.Topic models are applied in various fields, including political science (Zhou and Na, 2019), medicine (Wu et al., 2011) and cybersecurity (Kolini and Janczewski, 2017).
The LDA model was implemented using mallet 2.0.8 (MALLET, 2018) and the gensim wrapper (Gensim, 2023).To find the best model, we developed a strategy that combined both the traditional coherence score along with heuristic interpretations of the main topics discussed in each cluster.The coherence score helped distinguish topics that were semantically interpretable topics from topics that were simple artifacts of statistical inference.Such score ranges from zero to one, and the higher the score, the better the model should be.
The clusters found were evaluated through visualizations created with WordClouds (to visualize the most important words) and semantic space using pyLDAvis (pyLDAvis, 2018).For the latter, the clusters were plotted onto a semantic space where two words in the same lexical field or synonyms were correlated and thus "close" to each other in the space.The larger the topic cluster, the more conversations actors had about that topic.The more the clusters (and thus words) were far apart, the more these clusters had their own vocabularies.Overlapping clusters had similar vocabularies.This way, a model with no overlapping clusters was considered good.The best model selected had the highest coherence score and the best visual representation, with far-apart clusters.
After training various models with a different number of topics (k), investigating the coherence scores, as shown in Fig. 1 and inspecting the resulting clusters (with WordClouds and semantic space representation of the topics), the most promising model was the one with k=5 topics.

Topic distribution
The five topics span across each actors' corpus with different weights as each actor can be represented as a mixture of determined topics: topic 1 may represent 100% of actor A's corpus, while 60% of Actor B's corpus.For example, Table 4 shows how the five topics are distributed in Actor 112's corpus and Actor 83's corpus.In this example, Actor 112's discussions revolve clearly around topic 1 whereas Actor 83's discussions revolve around the five topics.

Topic interpretation
The LDA model gives topics that are composed of a word list, often appearing together within chats.It is the researcher's role to make sense of these topics by giving them a theme or a name based on what they are made of.To do so, we went over the words in the five clusters, interpreting their meaning.We also took the main actors in each cluster (those whose corpus was mainly related to a topic) and read their discussions to have contextual information around the words.The interpretation of the topics is presented below, along with how the topics are distributed among actors.

Comparing the study results
Finally, to compare the results of the study, we went through summaries of qualitative analyses conducted by security researchers.We found four relevant blog articles by: CheckPoint (CheckPoint Research, 2022), Kreb-sonSecurity (Krebs, 2022), Cyberint (Shriebman, 2022, andTrellix (Fokker andTologonov, 2022) that conducted a qualitative analysis on the Conti chat logs to paint a picture of the organization.Each blog article attempts to uncover the roles and importance of each member, providing a description of a few actors identified as key.
From these documents, we extracted the role attributed to those well-known actors and compared them with the topic distribution found in this study.

Ethical considerations
The study has been approved by the ethics committee at the University of Montreal (project N.2023-4659) under minimal risks.The study required asking for a waiver of consent in line with Article 5.5A of the Canadian Tri-Council Policy Statement on Research Ethics.To ensure participants' confidentiality and privacy, the real pseudonyms of the actors are not displayed throughout the text.

Results
The best model included five topics that encompassed actors' discussions.The interpretation of the topics is presented below, followed by how they are distributed among actors' corpus.We then compare the results of this study with previous qualitative research conducted on the role of some of these actors.

From business to tech topics
The five topics that span actors' corpus are: (1) Business, (2) Technical, (3) Internal tasking/Management, (4) Malware, and (5) Customer Service/Problem Solving.Each topic is accompanied by an excerpt of a discussion from an actor's corpus whose main topic is the one being presented. 4 Business topic The first topic encompassed discussions regarding planification and internal tasking within a project.Actor 118, Actor 112 and Actor 23 were actors often quoted within chats to repeat what was said or ordered.The topic included words like build, office, task, and report, referring to some sort of task management.Words like system, hacker, coder, and software, were also included, referring to employees and their work tools.Actors getting the first topic as their dominant topic could be seen as "higher-ups" or participating in the management activities of the Conti organization.
Here is an excerpt of a discussion from actor 118, whose main topic is Business: "This is an important task, then let's build a system for it [...] 4 These excerpts are for illustrative purposes only and do not reflect the format of the actor's corpus provided to the algorithm nor the full range of discussions found within the actor's corpus.

. I suggest that you
allocate people and build a system that will analyze and report information from these office-based documents, [...] prepare reports by sector, the main department will prepare attacks [...]." Technical topic The second topic revolved around technical talks and developing technical projects.The vocabulary of this topic was very much focused on computer science, including words like version, command, module, program, function, system, window.Some other words were even more specific and denoted an attack vector or part of it: script, loader, backdoor and .exe.Actors having a tendency towards this topic could be taking part in delivering attacks.Here are excerpts from actors 86 and 54's corpus whose main topic is the Technical topic:"When an error occurs during process hollowing creation, do you send an error code to the server?[...]" and "I tried to shift the.exe file image in the process address space (i.e. to modify the process hollowing) and to write it to an arbitrary address, but this didn't work." Internal tasking/management topic The third topic was the only one without any computer science or technical words in it.The core of this topic was about human resources, management, and salaries.The topic included words like salary, people, money, email, network, talk, team, buy, month, salary, touch, company, blog and offer.The words onion and protonmail_com were also there, which are both domains used to communicate or add actors to different channels.Actors holding a high percentage of correspondence to this topic may have been involved in human resources, internal tasking and management tasks.
Here is an example of a discussion from actor 124's corpus, whose main topic is Internal tasking/Management: "I'll help you when you get your salary.Add to your contacts Actor 101, this is your team leader.[...] salary pay 2 times a month to your bank card.[...] workday 10-11 to 7:20 p.m, but it's best to discuss this with your supervisor [...]." Malware topics The fourth topic was directed toward one type of attack vector: malware and/or ransomware.Many of the words that made up this topic alluded to the injection or implementation of the malware as well as stratagems to avoid detection: DLL (refers to DLL hijacking), detect, crypto, crypt, loader and pour (term used as a synonym of launch/inject).An actor having the fourth topic as its main topic was likely taking part in the conception of malware as an attack vector.
Here is an excerpt of a discussion from actors 11 and 85 whose main topic is Malware: "As long as it is through rundll32 and dll pathmake [...]

with pdf icon. [...] I've run a new version of loader [...]" and "Don't crypt [encrypt files] if you're going to, I'll be pouring in new files soon."
Customer service/problem-solving topics The fifth and last topic appeared to be a bit blurrier, including two subtopics.The first revolved around customer service with words like order, payment, client and receive.The second subtopic related to what seemed to be attack assistance or problem-solving, with words like log (i.e., record of the events), error, module, proxy and IP.Actors with this topic as their main topic would represent actors who solved problems while also dealing with clients.
Here is a quote from actor's 36 corpus whose main topic is Customer Service/Problem-solving: "if a lib [library]

Multifaceted discussions of conti actors
Figure 2 displays the distribution of topics for each actor through a stacked bar graph.The colored brackets grossly emphasize where the prevalence of a topic is high across the actors' corpus.The figure shows that only a small number of actors (including Actor 112, Actor 118, Actor 86, Actor 94, Actor 11, Actor 85, Actor 126, and Actor 71) have discussions that centered around a single topic.For these actors, their stacked bar is almost monochrome, meaning that their discussions were almost entirely focused on a single topic.Quite the opposite, the rest of the studied actors' stacked bar is a mixture of multiple topics, illustrating the diverse and all-rounder discussions that most actors had.
Figure 2 also shows that the Business [red] and the Malware [green] topics are the rarest ones in members' discussions.Moreover, the number of actors' whose corpus specializes in one of these two topics is small, including Actor 112 and Actor 118 for the Business topic, as well as Actor 11 and Actor 85 for Malware topic.
In the same fashion, the Customer Service/Problem Solving [light blue] and Technical [dark blue] topics are spread among actors, with a few of them having their discussions centered specifically on one of these two topics.
On the other hand, the Internal tasking/Management topic [pink] is widely spread among actors.Actually, such topic is present in almost every actor corpus and monopolizes a moderate to high part of actors' discussions.Such topic is not technical (like the Business topic); it included discussions on human resources, management, and salaries.Such result illustrates the intensive non-technical aspect of RaaS operations, which seemed to monopolize time and effort for a large proportion of Conti actors.
Finally, out of 137 actors, six had specialized discussions with 95% of their discussions revolving around a single topic.Table 5 shows the six actors and the topic they specialized in.In short, the discussion of Actor 118 and Actor 112 were mainly about Business, Actor 11 focused on Malware, Actor 86 on Technical and Actor 126 and Actor 71 on Customer Service/Problem Solving.
All in all, this means that only 4.38% of the studied actors were specialized in a single topic, whereas 95.62% were all-rounders, with a corpus of discussion revolving around the five topics.

Topic distribution of well-known actors
This section compares the results obtained using machine learning to external sources' results obtained by humans reading the chat logs to assess if our results are coherent.This comparison also serves to evaluate the coherence of the machine learning model's output when compared with human judgment.
To compare the results of this study, we went through previously published blogs in which the Conti chats were analyzed qualitatively and extracted the role of well-known actors according to sources.We present in Table 6 the role assigned to well-known actors by external researchers and their distribution of topics based on the results of this analysis.To facilitate the analysis, we focus on their dominant topics, meaning the topic with the highest percentage in the actor's corpus.
As shown in Table 6, the two actors with their dominant topic being Business are Actor 112 and Actor 118.They were both interpreted as being the organization bosses in other blogs.Hence, talking about business is related to being at a high level in the organization.
Three individuals (Actor 55, Actor 65, and Actor 94) were interpreted as either penetration testers, coders, or hackers by previous researchers.In our study, their dominant topic was the Technical topic, which relates to coding, testing, and hacking.Our results are thus consistent with previous research.
Five actors were interpreted as managers with various specializations (see Table 6) in previous analyses.In our analysis, the dominant topic of these actors was Internal Tasking and Management.This result is also consistent as it shows how managers, regardless of their specialization, are involved in internal and management tasks.On the other hand, three actors (Actor 85, Actor 132, and Actor 11) were interpreted as managers of technical teams in previous external analyses while, in our analysis, Malware is their dominant topic.These managers may thus have been more the type of technical/hands-on type of managers.Note that Actor 132 and Actor 85 are a pair in this table because they were referred to as being the same actor with two different pseudonyms (CheckPoint Research, 2022).
Finally, two actors (Actor 23 and Actor 36) had as a dominant topic Customer-Service/Problem Solving.One was interpreted as a technical manager responsible for coders in other blogs.The other was interpreted as a manager/Chief operating officer.These two roles align with having a high prevalence of Customer Service/Problem-Solving topics.
While our results align with those from external sources, there are also some discrepancies.For instance, as shown in Table 6, a Conti Chief Operation Officer's (COO) focus appears to be primarily on customer service and problem-solving (Actor 36).However, this COO was also classified as a "manager" by another source, showing discrepancies in role assignments from external sources.This is because assigning roles to individuals based on Table 6 Roles of well-known actors and their topic distribution Table 6 presents a comparative analysis of well-known actors' dominant topic found in our machine learning model and the roles reported in existing qualitative analysis by CheckPoint (CheckPoint Researchp, 2022), KrebsonSecurity (Krebs, 2022), Cyberint (Shriebman, 2022), and Trellix (Cimpanu, 2020).This comparison does not aim to establish the absolute truth regarding the actors' roles.It aims to assess the coherence of the results of this study, given other qualitative studies on the topic.It also shows the level of agreement and potential discrepancies between automated and human analysis methods

Discussion
The results obtained are in line with large cybercrime organizations being organized similarly to firms (Lusthaus, 2018).This is highlighted by the three discussion points below: (1) the importance of non-tech talks, (2) culprit of specialization, yet diverse discussions, (3) higher-ups are business focus.The study results also corroborate key findings highlighted in previous qualitative research on the Conti RaaS operator (Fokker and Tologonov, 2022;Cimpanu, 2020;CheckPoint Research, 2022;Krebs, 2022;Kovacs, 2022).However, note that the Conti RaaS operator is one of the biggest RaaS operators and thus, this finding may be, in fact, an outlier.Whether a RaaS operator become organized as such probably depends on its size and scope as well as its success.Where members of a RaaS operator are located may also have an impact on its structure as places where the risks of arrests are low may facilitate the development of structured criminal organizations (Lusthaus, 2018;Lusthaus and Varese, 2021).Further research should investigate other cybercrime organizations to see what influences their structure.
The importance of non-tech talks The results of the study illustrate that a large proportion of discussions are non-technical and such discussion topics span across almost all Conti members.Non-tech talks encompass the Business and the Internal tasking/Management topics while focused tech talks encompassed the Malware and the Technical topics.The fifth topic, Customer Service/Problem Solving, included both.Merging nontech talks and tech talks shows that, on average, 44.2% (std=21.9) of actors' corpus involved non-tech talks, while 31.8%(std=23.0)involved tech talks.The Customer Service/Problem Solving topic formed, on average, 24% (std=17.6) of actors' corpus.These results show that Conti's daily operations required a lot of organization beyond writing malicious code to compromise networks.
Culprit of specialization, yet diverse discussions The results of the study also highlight that only a few members have a corpus that represented mainly a single topic.On the other hand, most actors in the dataset were diverse in their discussion topics: they mixed both Customer Service/Problem Solving with Internal tasking/ Management as well as Business, Technical talks and Malware discussions.Hence, Conti's staff needed to work across multiple fields and have expertise in various areas.Moreover, as shown in Table 6, some of Conti managers discussed about Customer Service/Problem Solving while others were more specialized, discussing more about Technical or Malware topics.Hence, some managers no longer talked as much about technical subjects, focusing instead on managing their team and dealing with customers.These different management roles were also noted in another blog (CheckPoint Research, 2022).Hence, although such RaaS operator represents the culprit of specialization in the cybercrime industry (Salvi, 2019); Meland et al., 2020;Maurya et al., 2018;Alwashali et al., 2021), the bulk of its members appeared to have non-tech and diverse discussions, such discussions are likely required to coordinate the economic activities of a large criminal organization.
Higher-ups are business focus According to previous blogs, (CheckPoint Research, 2022, Krebs, 2022) Conti higher-ups were always trying to find new ways to expand the firm's operation and generate more profit.Some of them even followed corporate tradition and held yearly performance review, talking about employees' efficiency and deliberating on the employee of the month.Two actors' discussions revolved almost solely on Business topic: Actor 112 and Actor 118.As shown in Table 6 Actor 112 and Actor 118 were identified as "Big Boss" and "Effective head of office operations" and both of their discussions revolved at 99% around the Business topic.This finding supports the claim that Conti was indeed an organized firm with leaders constantly seeking fresh approaches to grow the company's activities and generate greater profits.

Study limitations and future research
A first limitation of this study lies in the dataset as only the Jabber chat logs were used while the whole leak included also the rocket chat logs.To build on this limit, further studies could use the rocket chat logs or combine them with the jabber ones to investigate if the findings of this study hold with this additional corpus.Moreover, the original messages were written in Russian, and consequently, it is likely that the translations carried out was limited because of the use of Russian slang or abbreviations.Part of the meaning or nuance of a sentence may have been altered or lost through translation.Interpretation and reuse of results must take this limit into account.Consequently, it would also be interesting to carry out this research using the original chat logs in Russian.The use of the original chat logs would preserve all the meaning present in the data and could provide additional material and nuances the results.
Another limitation lies in the interpretation of the results.This study did not consider the size of the corpus, the timeline of the chats, nor the "member status" of the individuals.First, the corpus size may influence the topic distribution as individuals who discuss more may be more inclined to have generalist talks.Further studies should investigate how topic distribution influence the types of discussions in which individuals engage.Second, actors' experience in the organization was also not considered, limiting the interpretation of the results.For example, new individuals who have just arrived in the organization may have been more involved in specific discussion topics, such as human resources, due to their newcomer status.A more qualitative research focusing on the timeline of each actor could dive deeper into the data and analyze the changes of an actor over time.Third, this research does not consider the official status of the studied actors.Some actors are official members whereas others could be affiliates or even customers.Further studies should look whether topic distributions vary when considering the status of the actors studied.
A final limitation lies in the use of LDA models since such a model cannot capture contextual information, it only considers the frequency of words in a corpus.To overcome this, the conversations of actors having a specific topic as a dominant one were read and interpreted, thus providing contextual information around that topic.Further studies could deepen this analysis by conducting a qualitative thematic analysis of the conversations and comparing the results with this study.

Conclusion
Leveraging the Conti chat leaks, this study uses machine learning algorithms to uncover insights on the organization of the Conti RaaS operator.The study shows that the discussions of the large RaaS operator Conti revolved around five topics: (1) Business, (2) Technical, (3) Internal tasking/Management, (4) Malware, and (5) Customer service/Problem Solving.Moreover, the topic distribution illustrates that only a few actors had specialized discussions in one topic, while the rest were all-rounders.The results corroborate that large cybercrime organizations are organized similarly to firms (Lusthaus, 2018).This is highlighted due to the importance of non-tech talks in the chats, the diverse discussion topics (although the organization represents the culprit of specialization), the varied management styles of actors, and how higher-ups, and specifically the two bosses, were business-focused in their discussions.Finally, this study illustrates how to automatically extract actionable information on the organization of a sophisticated cybercrime organization.

Fig. 1
Fig. 1 Coherence score per number of topics k

Table 1
Structure of the data after aggregation per actor

Table 2
Chat logs

Table 3
Post processing descriptive statistic crashes, it means the client [affiliates] isn't sending what the lib is expecting [...] so the http parser crashes.You should give specifications to those who write to clients, what this lib can and cannot do.This is an industrial solution... and a lot of people use it.".

Table 5
Specialized actors with percentage of dominant topic in their corpus