Posts by Collection

publications

Recommendation for product configuration: an experimental evaluation (Best paper award)

Published in 18th International Configuration Workshop, 2016

The present work deals with the the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. From this point, two directions have been explored. The first one is to select a set of similar configurations in the sale history (typically,the k closest ones, using a distance measure) and to compute the best recommendation from this set - this is the line proposed by [9]. The second one, that we propose here, is to learn a Bayesian network from the entire sample as model of the users’ preferences, and to use it to recommend a pertinent value.

Recommended citation: Fargier, H., Gimenez, P. F., & Mengin, J. (2016, September). Recommendation for product configuration: an experimental evaluation. In 18th International Configuration Workshop (CWS 2016) within CP 2016: 22nd International Conference on Principles and Practice of Constraint Programming (pp. pp-9). https://hal.archives-ouvertes.fr/hal-01445239/file/fargier_17217.pdf

Learning lexicographic preference trees from positive examples

Published in AAAI Conference on Artificial Intelligence, 2018

This paper considers the task of learning the preferences of users on a combinatorial set of alternatives, as it can be the case for example with online configurators. In many settings, what is available to the learner is a set of positive examples of alternatives that have been selected during past interactions. We propose to learn a model of the users’ preferences that ranks previously chosen alternatives as high as possible. In this paper, we study the particular task of learning conditional lexicographic preferences. We present an algorithm to learn several classes of lexicographic preference trees, prove convergence properties of the algorithm, and experiment on both synthetic data and on a real-world bench in the domain of recommendation in interactive configuration.

Recommended citation: Fargier, H., Gimenez, P. F., & Mengin, J. (2018, April). Learning lexicographic preference trees from positive examples. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1). https://ojs.aaai.org/index.php/AAAI/article/download/11808/11667

Apprentissage de préférences en espace combinatoire et application à la recommandation en configuration interactive

Published in Université Paul Sabatier - Toulouse III, 2018

The analysis and the exploitation of preferences occur in multiple domains, such as economics, humanities and psychology. E-commerce got interested in the subject a few years ago with the surge of product personalisation. Our study deals with the representation and the learning of preferences on objects described by a set of attributes. These combinatorial spaces are huge, which makes the representation of an ordering in extenso intractable. That’s why preference representation languages have been built: they can represent preferences compactly on these huge spaces. In this dissertation, we study preference representation languages and preference learning.

Recommended citation: Pierre-François Gimenez. Apprentissage de préférences en espace combinatoire et application à la recommandation en configuration interactive. Intelligence artificielle [cs.AI]. Université Paul Sabatier - Toulouse III, 2018. Français. ⟨NNT : 2018TOU30182⟩. ⟨tel-02303275⟩ https://www.theses.fr/2018TOU30182

Experimental evaluation of three value recommendation methods in interactive configuration

Published in Journal of Universal Computer Science, 2020

The present work deals with the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sales histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. From this point, two directions have been explored. The first one is to select a set of similar configurations in the sales history (typically, the k closest ones, using a distance measure) and to compute the best recommendation from this set – this is the line proposed by [Coster et al., 2002]. The second one, that we propose here, is to learn a model from the entire sample as representation of the users’ preferences, and to use it to recommend a pertinent value; three families of models are experimented: the Bayesian networks, the naive Bayesian networks and the lexicographic preferences trees.

Recommended citation: H Fargier, PF Gimenez, J Mengin - Journal of Universal Computer Science, 2020 https://www.jucs.org/jucs_26_3/experimental_evaluation_of_three.html

On-board Diagnosis: A First Step from Detection to Prevention of Intrusions on Avionics Applications

Published in IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020

Nowadays, air travel is one of the safest transportation means. While safety is historically well integrated into avionics systems, it is becoming increasingly important to take into account the security of such systems for the future. In particular, Host-based Intrusion Detection Systems (HIDS) are commonly used in traditional information systems to improve their security. The adaptation of such systems for deployment inside an aircraft has been studied in another work and has shown to be effective in detecting anomalous behavior in an avionic application. However, the detection itself is not sufficient to provide an on-board reaction, and to prevent such intrusion. This paper proposes to improve such HIDS by introducing a signature-based system capable of providing a first diagnosis after the detection of an anomalous behavior. The proposed diagnosis approach is based on the definition of the signature of an alert, and its comparison with a knowledge database that is regularly updated throughout aircraft lifetime. This approach has been implemented on a real avionic computer and yielded good results in terms of classification accuracy and resources consumption.

Recommended citation: Damien, A., Gimenez, P. F., Feyt, N., Nicomette, V., Kaâniche, M., & Alata, E. (2020, October). On-board Diagnosis: A First Step from Detection to Prevention of Intrusions on Avionics Applications. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) (pp. 358-368). IEEE. https://ieeexplore.ieee.org/document/9251064

Hardware-Performance-Counters-based anomaly detection in massively deployed smart industrial devices

Published in IEEE 19th International Symposium on Network Computing and Applications (NCA), 2020

Energy providers are massively deploying devices to manage distributed resources or equipment. These devices are used for example to manage the energy of smart factories efficiently or to monitor the infrastructure of smart-grids. By design, they typically exhibit homogeneous behavior, with similar software and hardware architecture. Unfortunately, these devices are also of interest to attackers aiming to develop botnets or compromise companies’ security. This paper presents a new protection approach based on Hardware Performance Counters (HPC) to detect anomalies in massively deployed devices. These HPC are processed using outlier detection algorithms. Compared to existing solutions, we propose a lightweight approach based on a comparative analysis of devices’ HPC without relying on the modeling of the software applications running on the devices. To assess the relevance and the effectiveness of the approach, a thorough experimental analysis is carried out in a representative industrial-type environment, sampling the data from 100 Raspberry Pi to simulate about 10,000 devices deployed simultaneously. The results show high detection and performance efficiency under different software profiles and attack payloads. Moreover, the calibration of the approach depends primarily on the hardware rather than the application software running on the devices. It should ease its deployment in an operational environment.

Recommended citation: Bourdon, M., Gimenez, P. F., Alata, E., Kaaniche, M., Migliore, V., Nicomette, V., & Laarouchi, Y. (2020, November). Hardware-Performance-Counters-based anomaly detection in massively deployed smart industrial devices. In 2020 IEEE 19th International Symposium on Network Computing and Applications (NCA) (pp. 1-8). IEEE. https://ieeexplore.ieee.org/document/9306726

RIDS: Radio intrusion detection and diagnosis system for wireless communications in smart environment

Published in ACM Transactions on Cyber-Physical Systems, 2021

The expansion of the Internet-of-Things (IoT) market is visible in homes, factories, public places, and smart cities. While the massive deployment of connected devices offers opportunities to improve quality of life and to develop new services, the impact of such devices on the security of the users in a context where the level of malicious threat continues to increase is a major concern. One of the challenges is the heterogeneity and constant evolution of wireless technologies and protocols used. To overcome this problem, we propose RIDS, a Radio Intrusion Detection System that is based on the monitoring and profiling of radio communications at the physical layer level using autoencoder neural networks. RIDS is independent of the wireless protocols and modulation technologies used. Besides, it is designed to provide a threefold diagnosis of the detected anomalies: temporal (start and end date of the detected anomaly), frequential (main frequency of the anomaly), and spatial (location of the origin of the anomaly). To demonstrate the relevance and the efficiency of our approach, we collected a large dataset of radio-communications recorded with three different probes deployed in an experimental room. Multiple real-world attacks involving a wide variety of communication technologies are also injected to assess the detection and diagnosis efficiency. The results demonstrate the efficiency of RIDS in detecting and diagnosing anomalies that occurred in the 400–500 Mhz and 800–900 Mhz frequency bands. It is noteworthy that compromised devices and attacks using these communication bands are generally not easily covered by traditional solutions.

Recommended citation: Gimenez, P. F., Roux, J., Alata, E., Auriol, G., Kaâniche, M., & Nicomette, V. (2021). RIDS: Radio intrusion detection and diagnosis system for wireless communications in smart environment. ACM Transactions on Cyber-Physical Systems, 5(3), 1-1. https://dl.acm.org/doi/10.1145/3441458

GUI-Mimic, a cross-platform recorder and fuzzer of Graphical User Interface

Published in 9th GreHack Conference, 2021

In program analysis, a fuzzing toolset is needed to automatically trigger software operations in a natural while efficient way. Especially in dynamic analysis of malware, such a toolset can help execute the suspicious files to unveil their malicious payloads hidden by other benign-looking behaviors. In the fields of software testing, this tool is necessary for triggering and testing the programmed functionalities. Nevertheless, there has not yet been an easy-to-use tool that works on Windows for the purpose of generating activity through the Graphical User Interface (GUI). To meet this requirement, in our work, we develop GUI-Mimic. It is designed to integrate some useful features for stimulating different types of software – mouse and keyboard recording, random mouse and keyboard inputs, editing, trimming, randomization, transformations – to deliver an easy-to-deploy GUI fuzzer over different Operating Systems.

Recommended citation: Raulin, V., Gimenez, P. F., Han, Y., Viet Triem Tong, V., Ouairy, L. (2021). GUI-Mimic, a cross-platform recorder and fuzzer of Graphical User Interface. 9th GreHack Conference https://hal.archives-ouvertes.fr/hal-03449827/document

Towards generic quality assessment of synthetic traffic for evaluating intrusion detection systems

Published in RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2022

Network Intrusion Detection Systems (NIDSes) evaluation requires background traffic. However, real background traffic is hard to collect. We hence rely on synthetic traffic generated especially for this task. The quality of the generated traffic has to be evaluated according to some clearly defined criteria. In this paper, we show how to adapt the quality assessment solutions proposed for different fields of data generation such as image or text generation to network traffic. We summarize our study by discussing the criteria that evaluate the quality of a generated network traffic and by proposing functions to evaluate these criteria. This is the first contribution in the context of the Ph.D. thesis of Adrien Schoen.

Recommended citation: Schoen, A., Blanc, G., Gimenez, P. F., Han, Y., Majorczyk, F., & Mé, L. (2022, May). Towards generic quality assessment of synthetic traffic for evaluating intrusion detection systems. In RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information. https://hal.archives-ouvertes.fr/hal-03675359/document

Detecting APT through graph anomaly detection

Published in RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2022

Despite fruitful achievements made by unsupervised machine learning-based anomaly detection for network intrusion detection systems, they are still prone to the issue of high false alarm rates, and it is still difficult to reach very high recalls. In 2020, Leichtnam et al. proposed Sec2graph, an unsupervised approach applied to security objects graphs that exhibited interesting results on single-step attacks. The graph representation and the embedding allowed for better detection since it creates qualitative features. In this paper, we present new experiments to assess the performances of this approach for detecting APT attacks. We achieve better detection performances than the original work’s baseline detection methods on the DAPT2020 dataset. This work is realised in the context of the Ph.D. thesis of Maxime Lanvin, which started in October 2021.

Recommended citation: Lanvin, M., Gimenez, P. F., Han, Y., Majorczyk, F., Mé, L., & Totel, É. (2022, May). Detecting APT through graph anomaly detection. In RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information. https://hal.archives-ouvertes.fr/hal-03675346/document

Towards a Representation of Malware Execution Traces for Experts and Machine Learning

Published in RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2022

Dynamic analysis is a common technique to analyze the run-time behavior of software and identify malware (malicious software). Execution traces typically contain the list of system calls with their parameters, the list of accessed files, etc. Several representations have been proposed to organize these data better and help both human experts and automated tools analyze them effectively. This paper reviews these representations and identifies four research problems that the first author plans to investigate during his Ph.D.

Recommended citation: Raulin, V., Gimenez, P. F., Han, Y., & Tong, V. V. T. (2022). Towards a Representation of Malware Execution Traces for Experts and Machine Learning. RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information. https://hal.archives-ouvertes.fr/hal-03675366/document

Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?

Published in IEEE Transactions on Information Forensics and Security, 2022

Android security has received a lot of attention over the last decade, especially malware investigation. Researchers attempt to highlight applications’ security-relevant characteristics to better understand malware and effectively distinguish malware from benign applications. The accuracy and the completeness of their proposals are evaluated experimentally on malware and goodware datasets. Thus, the quality of these datasets is of critical importance: if the datasets are outdated or not representative of the studied population, the conclusions may be flawed. We specify different types of experimental scenarios. Some of them require unlabeled but representative datasets of the entire population. Others require datasets labeled with valuable characteristics that may be difficult to compute, such as malware datasets. We discuss the irregularities of datasets used in experiments, questioning the validity of the performances reported in the literature. This article focuses on providing guidelines for designing debiased datasets. First, we propose guidelines for building representative datasets from unlabeled ones. Second, we propose and experiment a debiasing algorithm that, given a biased labeled dataset and a target representative dataset, builds a representative and labeled dataset. Finally, from the previous debiased datasets, we produce datasets for experiments on Android malware detection or classification with machine learning algorithms. Experiments show that debiased datasets perfom better when classifying with machine learning algorithms.

Recommended citation: Miranda, T. C., Gimenez, P. F., Lalande, J. F., Tong, V. V. T., & Wilke, P. (2022). Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?. IEEE Transactions on Information Forensics and Security. https://ieeexplore.ieee.org/abstract/document/9787514

The complexity of unsupervised learning of lexicographic preferences

Published in M-PREF 2022 13th Multidisciplinary Workshop on Advances in Preference Handling, 2022

This paper considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees.

Recommended citation: Fargier, H., Gimenez, P. F., Mengin, J., & Le Nguyen, B. N. (2022, July). The complexity of unsupervised learning of lexicographic preferences. In 13th Multidisciplinary Workshop on Advances in Preference Handling (M-pref 2022)@ IJCAI 2022 https://hal.inria.fr/hal-03784693/document

Explainable artificial intelligence for cybersecurity: a literature survey

Published in Annals of Telecommunications, 2022

Artificial intelligence (AI) and machine learning (ML) are increasingly becoming essential in the development of cybersecurity solutions, with deep learning (DL) algorithms being extensively applied in recent years, e.g., for detecting Android malware or vulnerable source code. However, sharing the same fundamental limitation with other DL application domains, such as computer vision and natural language processing, AI-based cybersecurity solutions lack the capability of justifying the results (ranging from detection and prediction to reasoning and decision-making) and making them human-understandable. As a result, explainable AI (XAI) has emerged as a paramount topic addressing the related challenges of making AI models explainable or interpretable to human users. It is particularly relevant in the cybersecurity domain, in that XAI may allow security operators, who are overwhelmed with tens of thousands of security alerts per day (most of which are false positives) per day, to better assess the potential threats and reduce alert fatigue. With such a strong motivation, we conduct an extensive literature review on the intersection between XAI and cybersecurity. Particularly, we investigate the academic literature from two perspectives: the applications of XAI to cybersecurity (e.g., intrusion detection, malware classification, etc.) and the application of cybersecurity to XAI (e.g., attacks on XAI pipelines, potential countermeasures, etc.). We characterize the security of XAI with several security properties that have been discussed in the literature. We also formulate open questions that are either left out or not properly addressed in the literature and provide tentative answers.

Recommended citation: Fabien Charmet, Harry Chandra Tanuwidjaja, Solayman Ayoubi, Pierre-François Gimenez, Yufei Han, Houda Jmila, Gregory Blanc, Takeshi Takahashi & Zonghua Zhang. Explainable artificial intelligence for cybersecurity: a literature survey. Ann. Telecommun. (2022). https://doi.org/10.1007/s12243-022-00926-7

Errors in the CICIDS2017 dataset and the significant differences in detection performances it makes

Published in CRiSIS 2022 17th International Conference on Risks and Security of Internet and Systems, 2022

Among the difficulties encountered in building datasets to evaluate intrusion detection tools, a tricky part is the process of labelling the events into malicious and benign classes. The labelling correctness is paramount for the quality of the evaluation of intrusion detection systems but is often considered as the ground truth by practitioners and is rarely verified. Another difficulty lies in the correct capture of the network packets. If it is not the case, the characteristics of the network flows generated from the capture could be modified and lead to false results. In this paper, we present several flaws we identified in the labelling of the CICIDS2017 dataset and in the traffic capture, such as packet misorder, packet duplication and attack that were performed but not correctly labelled. Finally, we assess the impact of these different corrections on the evaluation of supervised intrusion detection approaches.

Recommended citation: Lanvin, M., Gimenez, P. F., Han, Y., Majorczyk, F., Mé, L., & Totel, E. (2022, December). Errors in the CICIDS2017 dataset and the significant differences in detection performances it makes. In CRiSIS 2022-International Conference on Risks and Security of Internet and Systems. https://link.springer.com/chapter/10.1007/978-3-031-31108-6_2

BAGUETTE: Hunting for evidence of malicious behavior in dynamic analysis reports

Published in 20th International conference on security and cryptography SECRYPT 2023, 2023

Malware analysis consists of studying a sample of suspicious code to understand it and producing a representation or explanation of this code that can be used by a human expert or a clustering/classification/detection tool. The analysis can be static (only the code is studied) or dynamic (only the interaction between the code and its host during one or more executions is studied). The quality of the interpretation of a code and its later detection depends on the quality of the information contained in this representation. To date, many analyses produce voluminous reports that are difficult to handle quickly. In this article, we present BAGUETTE, a graph-based representation of the interactions of a sample and the resources offered by the host system during one execution. We explain how BAGUETTE helps automatically search for specific behaviors in a malware database and how it efficiently assists the expert in analyzing samples.

Recommended citation: Vincent Raulin, Pierre-François Gimenez, Yufei Han, Valérie Viet Triem Tong. BAGUETTE: Hunting for Evidence of Malicious Behavior in Dynamic Analysis Reports. 20th International conference on security and cryptography SECRYPT 2023, Jul 2023, Rome, Italy. https://hal.science/hal-04102144/

Conditionally Acyclic CO-Networks for Efficient Preferential Optimization

Published in 26th European Conference on Artificial Intelligence (ECAI 2023), 2023

This paper focuses on graphical models for modelling preferences in combinatorial space and their use for item optimisation. The preferential optimisation task seeks to find the preferred item containing some defined values, which is useful for many recommendation settings in e-commerce. We show that efficient (i.e., with polynomial time complexity) preferential optimisation is achieved with a subset of cyclic CP-nets called conditional acyclic CP-net. We also introduce a new graphical preference model, called Conditional-Optimality networks (CO-networks), that are more concise than conditional acyclic CP-nets and LP-trees but have the same expressiveness with respect to optimisation. Finally, we empirically show that preferential optimisation can be used for encoding alternatives into partial instantiations and vice versa, paving the way towards CO-nets and CP-nets unsupervised learning with the minimal description length (MDL) principle.

Recommended citation: Gimenez, Pierre-François & Mengin, Jérôme. (2023). Conditionally Acyclic CO-Networks for Efficient Preferential Optimization. 10.3233/FAIA230352. https://www.researchgate.net/publication/374310696

Towards Understanding Alerts raised by Semi-supervised Network Intrusion Detection Systems

Published in 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2023), 2023

The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on semi-supervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our work identifies the most informative network traffic features associated with an anomaly alert, providing interpretations for the generated alerts. We conduct an empirical study using a large-scale network intrusion dataset, CICIDS2017, to compare the proposed AE-pvalues method with two state-of-the-art baselines applied in the semi-supervised anomaly detection task. Our experimental results show that the AE-pvalues method accurately identifies abnormal influential network traffic features. Furthermore, our study demonstrates that the explanation outputs can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures.

Recommended citation: Maxime Lanvin, Pierre-François Gimenez, Yufei Han, Frédéric Majorczyk, Ludovic Mé, et al.. Towards Understanding Alerts raised by Unsupervised Network Intrusion Detection Systems. The 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2023) https://dl.acm.org/doi/abs/10.1145/3607199.3607247

Survey on system-level graph-based and anomaly-based intrusion detection

Published in RESSI 2024-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2024

Intrusion Detection Systems (IDS) are tools for monitoring a system, in order to identify potential malicious activities within it. This survey presents an analysis of graph-based anomaly-based IDS at system level. We also present open issues regarding those IDS, and propose a taxonomy of suitable features to compare them.

Recommended citation: Dijoud, F., Gimenez, P. F., Hurfin, M., Majorczyk, F., & Pilastre, B. (2024, May). Survey on system-level graph-based and anomaly-based intrusion detection. In RESSI 2024-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information (pp. 1-2). https://inria.hal.science/hal-04714325/document

A Parser-Based Data Collector for Intrusion Detection

Published in RESSI 2024-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2024

Intrusion detection systems often struggle to identify attacks directed at applications. A contributing factor is the various syntactical forms these attacks can take. This paper introduces a methodology to design and adapt applicative data collectors (DCs) to software projects by integrating them into the application’s parsers. This data collector aims to enhance applications’ security by providing semantic information to intrusion detection mechanisms.

Recommended citation: Grégor Quétel, Eric Alata, Pierre-François Gimenez, Laurent Pautet, Thomas Robert. A Parser-Based Data Collector for Intrusion Detection. Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, May 2024, Eppe-Sauvage, France. https://hal.science/hal-04871463

A Tale of Two Methods: Unveiling the limitations of GAN and the Rise of Bayesian Networks for Synthetic Network Traffic Generation

Published in 9th International Workshop on Traffic Measurements for Cybersecurity (WTMC 2024), 2024

The evaluation of network intrusion detection systems requires a sufficient amount of mixed network traffic, i.e., composed of both malicious and legitimate flows. In particular, obtaining realistic legitimate traffic is hard. Synthetic network traffic is one of the tools to respond to insufficient or incomplete real-world datasets. In this paper, we only focus on synthetically generating high-quality legitimate traffic and we do not delve into malicious traffic generation. For this specific task, recent contributions make use of advanced machine learning-driven approaches, notably through Generative Adversarial Networks (GANs). However, evaluations of GAN-generated data often disregards pivotal attributes, such as protocol adherence. Our study addresses the gap by proposing a comprehensive set of metrics that assess the quality of synthetic legitimate network traffic. To illustrate the value of these metrics, we empirically compare advanced network-oriented GANs with a simple and yet effective probabilistic generative model, Bayesian Networks (BN). According to our proposed evaluation metrics, BN-based network traffic generation outperforms the state-of-the-art GAN-based opponents. In our study, BN yields substantially more realistic and useful synthetic benign traffic and minimizes the computational costs simultaneously.

Recommended citation: Schoen, A., Blanc, G., Gimenez, P. F., Han, Y., Majorczyk, F., & Mé, L. (2024). A Tale of Two Methods: Unveiling the limitations of GAN and the Rise of Bayesian Networks for Synthetic Network Traffic Generation. In Proceedings of the 9th International Workshop on Traffic Measurements for Cybersecurity (WTMC 2024). https://doi.org/10.1109/EuroSPW61312.2024.00036

Learning Conditional Preference Networks: an Approach Based on the Minimum Description Length Principle

Published in IJCAI International Joint Conference on Artificial Intelligence, 2024

CP-nets are a very expressive graphical model for representing preferences over combinatorial spaces. They are particularly well suited for settings where an important task is to compute the optimal completion of some partially specified alternative; this is, for instance, the case of interactive configurators, where preferences can be used at every step of the interaction to guide the decision maker towards a satisfactory configuration. Learning CP-nets is challenging when the input data has the form of pairwise comparisons between alternatives. Furthermore, this type of preference data is not commonly stored: it can be elicited but this puts an additional burden on the decision maker. In this article, we propose a new method for learning CP-nets from sales history, a kind of data readily available in many e-commerce applications. The approach is based on the minimum description length (MDL) principle. We show some theoretical properties of this learning task, namely its sample complexity and its NP-completeness, and we experiment with this learning algorithm in a recommendation setting with real sales history from a car maker.

Recommended citation: Gimenez, P. F., & Mengin, J. (2024). Learning Conditional Preference Networks: an Approach Based on the Minimum Description Length Principle. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). https://doi.org/10.24963/ijcai.2024/376

FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining

Published in 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT), 2024

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. Most methods generating synthetic network flows disregard the temporal dependencies between them, leading to unrealistic traffic. To address this issue, we introduce FlowChronicle, a novel synthetic network flow generation tool that relies on pattern mining and statistical models to preserve temporal dependencies. We empirically compare our method against state-of-the-art techniques on several criteria, namely realism, diversity, compliance, and novelty. This evaluation demonstrates the capability of FlowChronicle to achieve high-quality generation while significantly outperforming the other methods in preserving temporal dependencies between flows. Besides, in contrast to deep learning methods, the patterns identified by FlowChronicle are explainable, and experts can verify their soundness. Our work substantially advances synthetic network traffic generation, offering a method that enhances both the utility and trustworthiness of the generated network flows.

Recommended citation: Cüppers, J., Schoen, A., Blanc, G. & Gimenez, P. F., (2024, December). FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining Generation. In the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT). https://dl.acm.org/doi/10.1145/3696407

Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

Published in arXiv, 2025

The literature and multiple experts point to many potential risks from large language models (LLMs), but there are still very few direct measurements of the actual harms posed. AI risk assessment has so far focused on measuring the models’ capabilities, but the capabilities of models are only indicators of risk, not measures of risk. Better modeling and quantification of AI risk scenarios can help bridge this disconnect and link the capabilities of LLMs to tangible real-world harm. This paper makes an early contribution to this field by demonstrating how existing AI benchmarks can be used to facilitate the creation of risk estimates. We describe the results of a pilot study in which experts use information from Cybench, an AI benchmark, to generate probability estimates. We show that the methodology seems promising for this purpose, while noting improvements that can be made to further strengthen its application in quantitative AI risk assessment.

Recommended citation: Malcolm Murray, Henry Papadatos, Otter Quarks, Pierre-François Gimenez, Simeon Campos. Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation. https://doi.org/10.48550/arXiv.2503.04299

TADAM: Learning Timed Automata From Noisy Observations

Published in SIAM International Conference on Data Mining (SDM25), 2025

Timed Automata (TA) are formal models capable of representing regular languages with timing constraints, making them well-suited for modeling systems where behavior is driven by events occurring over time. Most existing work on TA learning relies on active learning, where access to a teacher is assumed to answer membership queries and provide counterexamples. While this framework offers strong theoretical guarantees, it is impractical for many real-world applications where such a teacher is unavailable. In contrast, passive learning approaches aim to infer TA solely from sequences accepted by the target automaton. However, current methods struggle to handle noise in the data, such as symbol omissions, insertions, or permutations, which often result in excessively large and inaccurate automata. In this paper, we introduce TADAM, a novel approach that leverages the Minimum Description Length (MDL) principle to balance model complexity and data fit, allowing it to distinguish between meaningful patterns and noise. We show that TADAM is significantly more robust to noisy data than existing techniques, less prone to overfitting, and produces concise models that can be manually audited. We further demonstrate its practical utility through experiments on real-world tasks, such as network flow classification and anomaly detection.

Recommended citation: Cornanguer, L. & Gimenez, P. F., (2025 May). TADAM: Learning Timed Automata From Noisy Observations. In the SIAM International Conference on Data Mining (SDM25). https://doi.org/10.1137/1.9781611978520

Towards programming languages free of injection-based vulnerabilities by design

Published in The 11th Workshop on Language-theoretic Security and Applications (LangSec25), 2025

Many systems are controlled via commands built upon user inputs. For systems that deal with structured commands, such as SQL queries, XML documents, or network messages, such commands are generally constructed in a “fill-in-the-blank” fashion: the user input is concatenated with a fixed part written by the developer (the template). However, the user input can be crafted to modify the command’s semantics intended by the developer and lead to the system’s malicious usages. Such an attack, called an injection-based attack, is considered one of the most severe threat to web applications. Solutions to prevent such vulnerabilities exist but are generally ad hoc and rely on the developer’s expertise and diligence. Our approach addresses these vulnerabilities from the formal language theory’s point of view. We formally define two new security properties. The first one, “intent-equivalence”, guarantees that a developer’s template cannot lead to malicious injections. The second one, “intent-security”, guarantees that every possible template is intent-equivalent, and therefore that the programming language itself is secure. From these definitions, we show that new design patterns can help create programming languages that are secure by design.

Recommended citation: Alata, E. & Gimenez, P. F., (2025 May). Towards programming languages free of injection-based vulnerabilities by design. In the 11th Workshop on Language-theoretic Security and Applications (LangSec25). https://doi.org/10.1109/SPW67851.2025.00008

Certifiably robust malware detectors by design

Published in 40th International Conference on ICT Systems Security and Privacy Protection (IFIPSEC25), 2025

Malware analysis involves analyzing suspicious software to detect malicious payloads. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behaviour of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. We propose a new model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. We compare and validate these approaches with machine-learning-based malware detection methods, allowing for robust detection with limited reduction in detection performance.

Recommended citation: Gimenez, P. F., Sivaprasad, S. & Fritz, M., (2025 May). Certifiably robust malware detectors by design. In the 40th International Conference on ICT Systems Security and Privacy Protection (IFIPSEC25) https://link.springer.com/chapter/10.1007/978-3-031-92886-4_9

A theory of injection-based vulnerabilities in formal grammars

Published in Journal of Theoretical Computer Science (TCS), 2025

Many systems are controlled via commands built upon user inputs. For systems that deal with structured commands, such as SQL queries, XML documents, or network messages, such commands are generally constructed in a “fill-in-the-blank” fashion: the user input is concatenated with a fixed part written by the developer (the template). However, the user input can be crafted to modify the command’s semantics intended by the developer and lead to the system’s malicious usages. Such an attack, called an injection-based attack, is considered one of the most severe threat to web applications. Solutions to prevent such vulnerabilities exist but are generally ad hoc and rely on the developer’s expertise and diligence. Our approach addresses these vulnerabilities from the formal language theory’s point of view. We formally define two new security properties. The first one, “intent-equivalence”, guarantees that a developer’s template cannot lead to malicious injections. The second one, “intent-security”, guarantees that every possible template is intent-equivalent, and therefore that the programming language itself is secure. We thoroughly analyze the decidability of these properties for the most common grammar classes. We conclude by highlighting the technical implications of these results for various settings and tools.

Recommended citation: Alata, E. & Gimenez, P. F., (2025). A theory of injection-based vulnerabilities in formal grammars. In Theoretical Computer Science. https://doi.org/10.1016/j.tcs.2025.115554

Synthetic Network Traffic Generation for Intrusion Detection Systems: a Systematic Literature Review

Published in ESORICS 2025 International Workshops, 2025

Network data can be difficult to collect due to privacy and confidentiality reasons. For these reasons, network datasets are typically created with controlled environments called testbeds. However, these datasets are regularly criticized for their limited size, class imbalance, obsolescence, and lack of actual user activity. Following the rapid development of generative artificial intelligence, new methods have been applied to synthetic network traffic generation without emulation or simulation. This systematic literature review assesses the current state of synthetic network traffic generation for intrusion detection systems.

Recommended citation: Gimenez, P. F., (2025). Synthetic Network Traffic Generation for Intrusion Detection Systems: a Systematic Literature Review. In the ESORICS 2025 International Workshops.

Superviz25-SQL: High-Quality Dataset to Empower Unsupervised SQL Injection Detection Systems

Published in ESORICS 2025 International Workshops, 2025

The digitalization of public and private services led to more sophisticated and serious cybersecurity threats. Among them, SQL injection attacks leverage user inputs to remotely execute malicious actions on a database, such as data exfiltration and deletion, or privilege escalation. They are regularly classified as one of the most prominent threats to web services. Intrusion detection systems are widely used to detect such injection attacks and react to them, but it is difficult to assess their actual effectiveness and compare them because of a lack of high-quality datasets. Current SQL injection datasets lack structural diversity, are poorly documented, and generated samples are not representative of real-world infrastructures. This article presents a new dataset, called S25-SQL, whose design is structured around four quality dimensions: realism, diversity, benchmarking capabilities and the presence of good documentation. Nine novelty-detection baselines leveraging standard feature extraction techniques and machine learning algorithms are provided, serving as reference points for future SQL injection detection techniques.

Recommended citation: Quetel, G., Alata, A., Gimenez, P. F., Pautet, L. & Robert, T., (2025). SuperviZ25-SQL: High-Quality Dataset to Empower Unsupervised SQL Injection Detection Systems. In the ESORICS 2025 International Workshops.

ROSCA: Robust and Scalable Security Alert Correlation and Prioritisation using the MITRE ATT&CK Framework

Published in First International Workshop on Analytics, Telemetry, and Cybersecurity for HPCC (High Performance Computing and Communications) (WATCH2025), 2025

In large organisations and complex infrastructures, the overwhelming volume of security alerts often results in analyst fatigue, delayed responses, and missed attacks. Moreover, Security Operations Centers (SOCs) typically rely on commercial solutions that operate as black boxes, offering limited transparency into their alert classification mechanisms and lacking the flexibility for in-house adaptation or re-implementation. To address these limitations and improve situational awareness, this paper proposes ROSCA, an efficient alert prioritisation method grounded in the MITRE ATT&CK kill chain model. The proposed approach automatically aggregates and correlates alerts based on their shared attributes, enabling the construction of contextualised cases. Each case is then assigned a score reflecting its threat relevance and urgency, before being presented to analysts within a prioritised queue. Our method emphasises the detection of multi-stage attack patterns and supports rapid processing through a robust, noise-tolerant scoring mechanism designed for interpretability and operational integration. We validate the effectiveness of ROSCA using real-world alert data and compare it against the MATE framework, demonstrating superior prioritisation accuracy and more reliable identification of critical alerts.

Recommended citation: Garcia, R., Lahmadi, A., Gimenez, P. F. & Sala, C. ROSCA: Robust and Scalable Security Alert Correlation and Prioritisation using the MITRE ATT&CK Framework. First International Workshop on Analytics, Telemetry, and Cybersecurity for HPCC (High Performance Computing and Communications) (WATCH2025)

CasinoLimit: An Offensive Dataset Labeled with MITRE ATT&CK Techniques

Published in 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2025), 2025

Cybersecurity exercises are a common way to train and evaluate the skills of cybersecurity professionals. These exercises also provide a unique opportunity to generate datasets with realistic attack traces on non-sensitive systems. Nevertheless, the collected logs are unlabeled, and deciding which logs are related to pentesters is a difficult problem. In this paper, we present a novel methodology to label efficiently both system and network logs using MITRE ATT&CK techniques. To demonstrate the effectiveness of our approach, we introduce CL, a dataset generated from a pentest exercise that has been played by 114 participants where we collected 540 Go of attack data. We apply our methodology to accurately label these logs with a semiautomatic approach: labels are inferred from the shell sessions and propagated to the network sessions, and eventually corrected by a junior analyst. An expert analyst has manually reviewed all the labels that have been computed to ensure the quality of the labeling process. The results of the pentest exercise are deeply discussed. We show the variability of players’ behaviors and that players can be distinguished by their command line habits. In addition, the high level of granularity of labels coupled with the number of participants enables multiple other applications. With this paper, we release the full dataset and the associated labeling tool, MANATEE, which can be used to browse the logs and labels. To support the generalization of our approach, we made it possible to load other datasets with this tool.

Recommended citation: Sebastian Killian, et al.. CasinoLimit: An Offensive Dataset Labeled with MITRE ATT&CK Techniques. The 28th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2025)

software

ecf: empirical Christoffel function for anomaly detection

Published:

Authors: Pierre-François Gimenez

License: MIT

This python module is an free and open source implementation of the empirical Christoffel function applied to outlier detection as proposed by Lasserre and Pauwels in The empirical Christoffel function with applications in data analysis. Pypi page. Source code.

TADAM

Published:

Authors: Lénaïg Cornanguer and Pierre-François Gimenez

License: BSD-3-Clause

TADAM is a probabilistic time automata learner from noisy observations, made in the context of the SecGen associate team between Inria and CISPA. It an implementation of the learning algorithm described in TADAM: Learning Timed Automata From Noisy Observations. Pypi page. Source code.

talks

Learning Lexicographic Preference Trees from Positive Examples

Published:

We consider the task of learning the preferences of users on a combinatorial set of alternatives, as it can be the case for example with online configurators. In many settings, what is available to the learner is a set of positive examples of alternatives that have been selected during past interactions. We propose to learn a model of the users’ preferences that ranks previously chosen alternatives as high as possible. Here, we study the particular task of learning conditional lexicographic preferences. We present an algorithm to learn several classes of lexicographic preference trees, prove convergence properties of the algorithm, and experiment on both synthetic data and on a real-world bench in the domain of recommendation in interactive configuration. Slides

Interactive configuration with constraints consistency and recommendation

Published:

We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides

Un système agnostique de détection d’intrusion radio pour protéger l’Internet des objets

Published:

L’expansion de l’Internet des objets (IoT) entraîne l’apparition de maisons intelligentes, d’usines intelligentes et même de villes intelligentes. Bien que ces objets améliorent la qualité de vie de ses utilisateurs et offrent de nouvelles opportunités économiques, ils sont aussi un important vecteur d’attaques (le botnet Mirai étant sûrement l’exemple le plus connu). Pour protéger ces environnements, des systèmes de détection d’intrusion (IDS) sont développés. Ces IDS rencontrent des problématiques uniques à l’IoT, telles que l’évolution rapide des technologies et des protocoles ou encore leur réseau décentralisé. Pour surmonter ces problèmes, nous proposons un IDS qui surveille de larges bandes de fréquences au niveau de la couche physique sans faire d’hypothèses sur les protocoles ou les technologies présentes. De plus, notre solution propose pour chaque attaque détectée un diagnostic triple : temporel (les dates exactes de l’anomalie détectée), fréquentiel (la fréquence principale de l’anomalie) et spatial (la position estimée de l’origine de l’anomalie). Nous avons expérimenté notre méthode avec une expérimentation grandeur nature: notre système a pu efficacement détecter et diagnostiquer les attaques lancées sur les bandes 400-500 MHz et 800-900 MHz, deux bandes qui ne sont pas couvertes par les solutions traditionnelles. Slides

A formal study of injection-based vulnerabilities and some tools it will enable

Published:

Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. We are setting up an ANR project to go into this subject in more depth. The presentation, in French, is available on Youtube. Slides

Machine Learning 101

Published:

Machine learning is applied successfully in various domains, including cybersecurity, where it has been used for intrusion detection, malware analysis, and attack comprehension, for example. Therefore, many cybersecurity researchers seek to catch up and introduce such techniques in their research. This presentation aims to provide the basics of machine learning for cybersecurity researchers. The presentation is available on Youtube. Slides

La sécurité informatique à l’ère de l’intelligence artificielle

Published:

Avec la numérisation de nos vies, les systèmes informatiques ont pris une place prépondérante dans notre société et sont naturellement devenus la cible d’attaquants, du “script kiddie” au groupe organisé à l’objectif politique. Depuis quelques années, l’intelligence artificielle s’est invitée à la fête : qu’elle permette de détecter automatiquement des attaques ou qu’elle soit subrepticement manipulée dans les voitures autonomes, elle amène son lot de problèmes et de solutions. Ce séminaire a pour objectif de présenter ces deux domaines, leurs enjeux et leurs interactions, et de mettre en avant les pistes de recherche que la communauté scientifique privilégie pour lutter contre ces menaces. Slides

Machine learning et sécurité : entre menaces et opportunités

Published:

Cette présentation rappelle les concepts fondamentaux en sécurité et présente les applications des techniques de machine learning aux multiples problématiques liées à la sécurité. Le dernier axe abordé est celui de la sécurité des méthodes de machine learning, qui sont directement attaquées par de nombreuses techniques aux objectifs et aux moyens multiples. Séminaire présenté avec Ludovic Mé. Slides

Une introduction aux méthodes d’IA explicables

Published:

Avec l’utilisation toujours croissantes des techniques d’IA, le besoin de vérification des prédictions se fait de plus en plus pressant. C’est l’objectif des méthodes d’explicabilité : permettre à un utilisateur de savoir pourquoi une décision a été prise par le système. Cette introduction fait un point sur les nombreuses familles de techniques disponibles. Slides

Interactive configuration and recommendation in presence of constraints

Published:

We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides

The complexity of unsupervised learning of lexicographic preferences

Published:

This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides

Behavioral intrusion detection system based on machine learning

Published:

In this talk, I present the sec2graph approach, its performances, and its explanation mechanism. This mechanism helped us identify several flaws we identified in the labelling of the CICIDS2017 dataset and in the traffic capture, such as packet misorder, packet duplication and attack that were performed but not correctly labelled. Slides

Anomaly detection and explanation in networks with machine learning

Published:

This talk presents recent work on anomaly detection in network data and anomaly explanation. Our approach represents the network data with a security objects graph analyzed by an autoencoder. We introduce a new statistical explanation technique for reconstruction-based methods and compare it with SHAP. Finally, we use these explanations to analyze the dataset CICIDS2017 and check whether they match the expert’s expectations. Slides

The complexity of unsupervised learning of lexicographic preferences

Published:

This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides

A theory of injection-based vulnerabilities in formal grammars

Published:

Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. Slides

Certifiably robust malware detectors by design

Published:

Malware analysis consists in analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. This article proposes a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. We compare and validate these approaches with various machine-learning-based malware detection methods, allowing for robust detection with minimal detection performance reduction.

Network traffic generation: a non-technical look at its characteristics and stakes

Published:

Intrusion detection is an essential mechanism in information systems security. Machine learning has been successfully applied to this problem. These techniques rely on training data used to train a detection model. This training data generally comes from datasets that are often more or less automatically generated. Worse, the number of datasets remains small enough that the diversity of the dataset is questionable, and its aging is problematic. A solution to these problems is synthetic data generation: it would be free of experimental inaccuracies, could be easily updated, and alleviate the class imbalance by generating more data on rare classes. We plan to generate benign data only, as attacks are easier to generate with dedicated tools. This talk highlights the characteristics of network traffic generation from a non-technical point of view, adapted to data mining practitioners, as well as the issues and opportunities. Slides

Conditionally Acyclic CO-Networks for Efficient Preferential Optimization

Published:

This paper focuses on graphical models for modelling preferences in combinatorial space and their use for item optimisation. The preferential optimisation task seeks to find the preferred item containing some defined values, which is useful for many recommendation settings in e-commerce. We show that efficient (i.e., with polynomial time complexity) preferential optimisation is achieved with a subset of cyclic CP-nets called conditional acyclic CP-net. We also introduce a new graphical preference model, called Conditional-Optimality networks (CO-networks), that are more concise than conditional acyclic CP-nets and LP-trees but have the same expressiveness with respect to optimisation. Finally, we empirically show that preferential optimisation can be used for encoding alternatives into partial instantiations and vice versa, paving the way towards CO-nets and CP-nets unsupervised learning with the minimal description length (MDL) principle. Poster

Towards Understanding Alerts raised by Unsupervised Network Intrusion Detection Systems

Published:

The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our work identifies the most informative network traffic features associated with an anomaly alert, providing interpretations for the generated alerts. We conduct an empirical study using a large-scale network intrusion dataset, CICIDS2017, to compare the proposed AE-pvalues method with two state-of-the-art baselines applied in the unsupervised anomaly detection task. Our experimental results show that the AE-pvalues method accurately identifies abnormal influential network traffic features. Furthermore, our study demonstrates that the explanation outputs can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures. Slides

Intelligence artificielle : d’où vient-elle, jusqu’où ira-t-elle ?

Published:

Un exposé de vulgarisation d’une heure sur le thème de l’intelligence artificielle à l’occasion des 30 ans du GIP RENATER. Dans cette présentation, je reviens sur l’histoire de l’intelligence artificielle et je démystifie un peu son fonctionnement. Je présente ensuite les principaux domaines révolutionnés par l’intelligence artificielle et les nombreux risques qu’elle soulève. Slides

Three new challenges on network data generation

Published:

In this presentation, I present three potential ideas to investigate within the SecGen project: 1) generation for concept drift evaluation, 2) causal learning and 3) system logs generation. Slides

BAGUETTE: Hunting for Evidence of Malicious Behavior in Dynamic Analysis Reports

Published:

Malware analysis consists of studying a sample of suspicious code to understand it and producing a representation or explanation of this code that can be used by a human expert or a clustering/classification/detection tool. The analysis can be static (only the code is studied) or dynamic (only the interaction between the code and its host during one or more executions is studied). The quality of the interpretation of a code and its later detection depends on the quality of the information contained in this representation. To date, many analyses produce voluminous reports that are difficult to handle quickly. In this article, we present BAGUETTE, a graph-based representation of the interactions of a sample and the resources offered by the host system during one execution. We explain how BAGUETTE helps automatically search for specific behaviors in a malware database and how it efficiently assists the expert in analyzing samples. We also develop a possible use case of BAGUETTE being currently researched: explainable unsupervised malware behavior clustering. Slides - Video

Robust malware detectors by design

Published:

This work present the cooperation with CISPA on robust malware detectors. Malware analysis involves analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. We propose a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. Slides

Learning Conditional Preference Networks: an Approach Based on the Minimum Description Length Principle

Published:

CP-nets are a very expressive graphical model for representing preferences over combinatorial spaces. They are particularly well suited for settings where an important task is to compute the optimal completion of some partially specified alternative; this is, for instance, the case of interactive configurators, where preferences can be used at every step of the interaction to guide the decision maker towards a satisfactory configuration. Learning CP-nets is challenging when the input data has the form of pairwise comparisons between alternatives. Furthermore, this type of preference data is not commonly stored: it can be elicited but this puts an additional burden on the decision maker. In this article, we propose a new method for learning CP-nets from sales history, a kind of data readily available in many e-commerce applications. The approach is based on the minimum description length (MDL) principle. We show some theoretical properties of this learning task, namely its sample complexity and its NP-completeness, and we experiment with this learning algorithm in a recommendation setting with real sales history from a car maker. Slides - Poster

Leveraging explainability to increase the usability of intrusion detection systems

Published:

The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our study demonstrates that these explanations can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures. Slides

Towards programming languages free of injection-based vulnerabilities by design

Published:

Many systems are controlled via commands built upon user inputs. For systems that deal with structured commands, such as SQL queries, XML documents, or network messages, such commands are generally constructed in a “fill-in-the-blank” fashion: the user input is concatenated with a fixed part written by the developer (the template). However, the user input can be crafted to modify the command’s semantics intended by the developer and lead to the system’s malicious usages. Such an attack, called an injection-based attack, is considered one of the most severe threat to web applications. Solutions to prevent such vulnerabilities exist but are generally ad hoc and rely on the developer’s expertise and diligence. Our approach addresses these vulnerabilities from the formal language theory’s point of view. We formally define two new security properties. The first one, “intent-equivalence”, guarantees that a developer’s template cannot lead to malicious injections. The second one, “intent-security”, guarantees that every possible template is intent-equivalent, and therefore that the programming language itself is secure. From these definitions, we propose new techniques to create programming language that are secure by design, and present two secure, simplified versions of widespread languages.

Keynote: can generative AI help us better assess security solutions?

Published:

In this keynote, I present the work that lead us to work on synthetic data generation, and how we generate security data. This works aims at creating realistic and diverse network and system datasets to better evaluate security solutions, most notably supervision solutions. I will notably present two works on network flow records generation using pattern mining and on packet header generation using probabilistic timed automata generation. Slides

Ongoing work on synthetic network traffic generation for IDS evaluation

Published:

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. This presentation entails some recent work on this subject. Slides

FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining

Published:

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. Most methods generating synthetic network flows disregard the temporal dependencies between them, leading to unrealistic traffic. To address this issue, we introduce FlowChronicle, a novel synthetic network flow generation tool that relies on pattern mining and statistical models to preserve temporal dependencies. We empirically compare our method against state-of-the-art techniques on several criteria, namely realism, diversity, compliance, and novelty. This evaluation demonstrates the capability of FlowChronicle to achieve high-quality generation while significantly outperforming the other methods in preserving temporal dependencies between flows. Besides, in contrast to deep learning methods, the patterns identified by FlowChronicle are explainable, and experts can verify their soundness. Our work substantially advances synthetic network traffic generation, offering a method that enhances both the utility and trustworthiness of the generated network flows. Slides - Video

Towards more realistic honeypots with synthetic network traffic injection

Published:

Honeypots and honeynets need to be realistic to attract and convince attackers to reveal their techniques. Several work on realistic file systems, but realistic local network communication is still an open question. In this work, we propose to generate synthetic network traffic using generative machine learning techniques and inject it into the network. This presentation entails some recent, ongoing work on this subject.

Synthetic network traffic with Fos-R

Published:

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. This presentation entails some recent work on this subject.

TADAM: Learning Timed Automata from Noisy Observations

Published:

Timed Automata (TA) are formal models capable of representing regular languages with timing constraints, making them well-suited for modeling systems where behavior is driven by events occurring over time. Most existing work on TA learning relies on active learning, where access to a teacher is assumed to answer membership queries and provide counterexamples. While this framework offers strong theoretical guarantees, it is impractical for many real-world applications where such a teacher is unavailable. In contrast, passive learning approaches aim to infer TA solely from sequences accepted by the target automaton. However, current methods struggle to handle noise in the data, such as symbol omissions, insertions, or permutations, often resulting in excessively large and inaccurate automata. In this paper, we introduce TADAM, a novel approach that leverages the Minimum Description Length (MDL) principle to balance model complexity and data fit, allowing it to distinguish between meaningful patterns and noise. We show that TADAM is significantly more robust to noisy data than existing techniques, less prone to overfitting, and produces concise models that can be manually audited. We further demonstrate its practical utility through experiments on real-world tasks, such as network flow classification and anomaly detection. Slides - Poster

Towards programming languages free of injection-based vulnerabilities by design

Published:

Many systems are controlled via commands built upon user inputs. For systems that deal with structured commands, such as SQL queries, XML documents, or network messages, such commands are generally constructed in a “fill-in-the-blank” fashion: the user input is concatenated with a fixed part written by the developer (the template). However, the user input can be crafted to modify the command’s semantics intended by the developer and lead to the system’s malicious usages. Such an attack, called an injection-based attack, is considered one of the most severe threat to web applications. Solutions to prevent such vulnerabilities exist but are generally ad hoc and rely on the developer’s expertise and diligence. Our approach addresses these vulnerabilities from the formal language theory’s point of view. We formally define two new security properties. The first one, “intent-equivalence”, guarantees that a developer’s template cannot lead to malicious injections. The second one, “intent-security”, guarantees that every possible template is intent-equivalent, and therefore that the programming language itself is secure. From these definitions, we show that new design patterns can help create programming languages that are secure by design. Slides

Robust malware detectors by design

Published:

This work present the cooperation with CISPA on robust malware detectors. Malware analysis involves analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. We propose a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. Slides

Towards more realistic honeypots with synthetic network traffic injection

Published:

Honeypots and honeynets need to be realistic to attract and convince attackers to reveal their techniques. Several work on realistic file systems, but realistic local network communication is still an open question. In this work, we propose to generate synthetic network traffic using generative machine learning techniques and inject it into the network. This presentation entails some recent, ongoing work on this subject.

AI for Cybersecurity: Three Applications for Network Security

Published:

Researchers have experimented with many AI techniques for detecting network intrusions. This presentation details how curious results led us to a new research direction at the intersection of AI and cybersecurity. This talk will rely on three recent works: 1) how to apply AI to intrusion detection, 2) a new XAI (eXplainable AI) framework dedicated to anomaly detection and 3) a new method for assessing data quality and generating synthetic network data with AI. Slides

Background traffic generation with statistical AI

Published:

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. This presentation entails some recent work on this subject.

Synthetic Network Traffic Generation for Intrusion Detection Systems: a Systematic Literature Review

Published:

Network data can be difficult to collect due to privacy and confidentiality reasons. For these reasons, network datasets are typically created with controlled environments called testbeds. However, these datasets are regularly criticized for their limited size, class imbalance, obsolescence, and lack of actual user activity. Following the rapid development of generative artificial intelligence, new methods have been applied to synthetic network traffic generation without emulation or simulation. This systematic literature review assesses the current state of synthetic network traffic generation for intrusion detection systems. Slides

Synthetic Network Traffic Generation for Intrusion Detection Systems: a Systematic Literature Review

Published:

Network data can be difficult to collect due to privacy and confidentiality reasons. For these reasons, network datasets are typically created with controlled environments called testbeds. However, these datasets are regularly criticized for their limited size, class imbalance, obsolescence, and lack of actual user activity. Following the rapid development of generative artificial intelligence, new methods have been applied to synthetic network traffic generation without emulation or simulation. This systematic literature review assesses the current state of synthetic network traffic generation for intrusion detection systems.

Towards more realistic honeypots with synthetic network traffic injection

Published:

Honeypots and honeynets need to be realistic to attract and convince attackers to reveal their techniques. Several work on realistic file systems, but realistic local network communication is still an open question. In this work, we propose to generate synthetic network traffic using generative machine learning techniques and inject it into the network. This presentation entails some recent, ongoing work on this subject.

FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining

Published:

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. Most methods generating synthetic network flows disregard the temporal dependencies between them, leading to unrealistic traffic. To address this issue, we introduce FlowChronicle, a novel synthetic network flow generation tool that relies on pattern mining and statistical models to preserve temporal dependencies. We empirically compare our method against state-of-the-art techniques on several criteria, namely realism, diversity, compliance, and novelty. This evaluation demonstrates the capability of FlowChronicle to achieve high-quality generation while significantly outperforming the other methods in preserving temporal dependencies between flows. Besides, in contrast to deep learning methods, the patterns identified by FlowChronicle are explainable, and experts can verify their soundness. Our work substantially advances synthetic network traffic generation, offering a method that enhances both the utility and trustworthiness of the generated network flows.

teaching

Machine learning, classification and Bayesian network

M1, UPSSITECH engineering school, 2015

This course aims to train the students in machine learning techniques:

  • Overview of various ML techniques. Supervised learning (decision tree learning, k-nn, SVM) and unsupervised learning.
  • Experiments with Weka, a data mining tool
  • Bayesian reasoning and learning
  • Project on reinforcement learning: the students must develop an agent that learns to escape a maze with Q-learning.

Artificial intelligence

L3, University of Toulouse, 2015

This course aims to train the students in general artificial intelligence techniques:

  • Representation of problems with state diagrams
  • Automatic resolution with search algorithms (breadth-first search, depth-first search, iterative depth-first search, best-first, greedy search and A*)
  • Exploration of game tree (minimax algorithm)
  • Graph coloration (Welsh and Power algorithm and DSatur)
  • Resolution of constraint satisfaction problems (backtrack and forward checking algorithms)

Graphs and algorithms

L3, INSA-Toulouse, 2018

This course presents some graph theory and their algorithms:

  • Topological sort and level structure
  • Breadth-first and depth-first search
  • Shortest path algorithms: Dijkstra, Bellman-Ford
  • Maximum flow / minimum cut with Ford-Fulkerson algorithm

Web development and security

M2, CentraleSupélec, 2022

I was the coordinator of this module in 2022-2024. Besides, I teach the Flash backend framework.