Page Not Found
Page not found. Your pixels are in another canvas.
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Page not found. Your pixels are in another canvas.
About me
This is a page not in th emain menu
Published in 18th International Configuration Workshop, 2016
The present work deals with the the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. From this point, two directions have been explored. The first one is to select a set of similar configurations in the sale history (typically,the k closest ones, using a distance measure) and to compute the best recommendation from this set - this is the line proposed by [9]. The second one, that we propose here, is to learn a Bayesian network from the entire sample as model of the users’ preferences, and to use it to recommend a pertinent value.
Recommended citation: Fargier, H., Gimenez, P. F., & Mengin, J. (2016, September). Recommendation for product configuration: an experimental evaluation. In 18th International Configuration Workshop (CWS 2016) within CP 2016: 22nd International Conference on Principles and Practice of Constraint Programming (pp. pp-9). https://hal.archives-ouvertes.fr/hal-01445239/file/fargier_17217.pdf
Published in AAAI Conference on Artificial Intelligence, 2018
This paper considers the task of learning the preferences of users on a combinatorial set of alternatives, as it can be the case for example with online configurators. In many settings, what is available to the learner is a set of positive examples of alternatives that have been selected during past interactions. We propose to learn a model of the users’ preferences that ranks previously chosen alternatives as high as possible. In this paper, we study the particular task of learning conditional lexicographic preferences. We present an algorithm to learn several classes of lexicographic preference trees, prove convergence properties of the algorithm, and experiment on both synthetic data and on a real-world bench in the domain of recommendation in interactive configuration.
Recommended citation: Fargier, H., Gimenez, P. F., & Mengin, J. (2018, April). Learning lexicographic preference trees from positive examples. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1). https://ojs.aaai.org/index.php/AAAI/article/download/11808/11667
Published in Université Paul Sabatier - Toulouse III, 2018
The analysis and the exploitation of preferences occur in multiple domains, such as economics, humanities and psychology. E-commerce got interested in the subject a few years ago with the surge of product personalisation. Our study deals with the representation and the learning of preferences on objects described by a set of attributes. These combinatorial spaces are huge, which makes the representation of an ordering in extenso intractable. That’s why preference representation languages have been built: they can represent preferences compactly on these huge spaces. In this dissertation, we study preference representation languages and preference learning.
Recommended citation: Pierre-François Gimenez. Apprentissage de préférences en espace combinatoire et application à la recommandation en configuration interactive. Intelligence artificielle [cs.AI]. Université Paul Sabatier - Toulouse III, 2018. Français. ⟨NNT : 2018TOU30182⟩. ⟨tel-02303275⟩ https://www.theses.fr/2018TOU30182
Published in Journal of Universal Computer Science, 2020
The present work deals with the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sales histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. From this point, two directions have been explored. The first one is to select a set of similar configurations in the sales history (typically, the k closest ones, using a distance measure) and to compute the best recommendation from this set – this is the line proposed by [Coster et al., 2002]. The second one, that we propose here, is to learn a model from the entire sample as representation of the users’ preferences, and to use it to recommend a pertinent value; three families of models are experimented: the Bayesian networks, the naive Bayesian networks and the lexicographic preferences trees.
Recommended citation: H Fargier, PF Gimenez, J Mengin - Journal of Universal Computer Science, 2020 https://www.jucs.org/jucs_26_3/experimental_evaluation_of_three.html
Published in IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020
Nowadays, air travel is one of the safest transportation means. While safety is historically well integrated into avionics systems, it is becoming increasingly important to take into account the security of such systems for the future. In particular, Host-based Intrusion Detection Systems (HIDS) are commonly used in traditional information systems to improve their security. The adaptation of such systems for deployment inside an aircraft has been studied in another work and has shown to be effective in detecting anomalous behavior in an avionic application. However, the detection itself is not sufficient to provide an on-board reaction, and to prevent such intrusion. This paper proposes to improve such HIDS by introducing a signature-based system capable of providing a first diagnosis after the detection of an anomalous behavior. The proposed diagnosis approach is based on the definition of the signature of an alert, and its comparison with a knowledge database that is regularly updated throughout aircraft lifetime. This approach has been implemented on a real avionic computer and yielded good results in terms of classification accuracy and resources consumption.
Recommended citation: Damien, A., Gimenez, P. F., Feyt, N., Nicomette, V., Kaâniche, M., & Alata, E. (2020, October). On-board Diagnosis: A First Step from Detection to Prevention of Intrusions on Avionics Applications. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) (pp. 358-368). IEEE. https://ieeexplore.ieee.org/document/9251064
Published in IEEE 19th International Symposium on Network Computing and Applications (NCA), 2020
Energy providers are massively deploying devices to manage distributed resources or equipment. These devices are used for example to manage the energy of smart factories efficiently or to monitor the infrastructure of smart-grids. By design, they typically exhibit homogeneous behavior, with similar software and hardware architecture. Unfortunately, these devices are also of interest to attackers aiming to develop botnets or compromise companies’ security. This paper presents a new protection approach based on Hardware Performance Counters (HPC) to detect anomalies in massively deployed devices. These HPC are processed using outlier detection algorithms. Compared to existing solutions, we propose a lightweight approach based on a comparative analysis of devices’ HPC without relying on the modeling of the software applications running on the devices. To assess the relevance and the effectiveness of the approach, a thorough experimental analysis is carried out in a representative industrial-type environment, sampling the data from 100 Raspberry Pi to simulate about 10,000 devices deployed simultaneously. The results show high detection and performance efficiency under different software profiles and attack payloads. Moreover, the calibration of the approach depends primarily on the hardware rather than the application software running on the devices. It should ease its deployment in an operational environment.
Recommended citation: Bourdon, M., Gimenez, P. F., Alata, E., Kaaniche, M., Migliore, V., Nicomette, V., & Laarouchi, Y. (2020, November). Hardware-Performance-Counters-based anomaly detection in massively deployed smart industrial devices. In 2020 IEEE 19th International Symposium on Network Computing and Applications (NCA) (pp. 1-8). IEEE. https://ieeexplore.ieee.org/document/9306726
Published in ACM Transactions on Cyber-Physical Systems, 2021
The expansion of the Internet-of-Things (IoT) market is visible in homes, factories, public places, and smart cities. While the massive deployment of connected devices offers opportunities to improve quality of life and to develop new services, the impact of such devices on the security of the users in a context where the level of malicious threat continues to increase is a major concern. One of the challenges is the heterogeneity and constant evolution of wireless technologies and protocols used. To overcome this problem, we propose RIDS, a Radio Intrusion Detection System that is based on the monitoring and profiling of radio communications at the physical layer level using autoencoder neural networks. RIDS is independent of the wireless protocols and modulation technologies used. Besides, it is designed to provide a threefold diagnosis of the detected anomalies: temporal (start and end date of the detected anomaly), frequential (main frequency of the anomaly), and spatial (location of the origin of the anomaly). To demonstrate the relevance and the efficiency of our approach, we collected a large dataset of radio-communications recorded with three different probes deployed in an experimental room. Multiple real-world attacks involving a wide variety of communication technologies are also injected to assess the detection and diagnosis efficiency. The results demonstrate the efficiency of RIDS in detecting and diagnosing anomalies that occurred in the 400–500 Mhz and 800–900 Mhz frequency bands. It is noteworthy that compromised devices and attacks using these communication bands are generally not easily covered by traditional solutions.
Recommended citation: Gimenez, P. F., Roux, J., Alata, E., Auriol, G., Kaâniche, M., & Nicomette, V. (2021). RIDS: Radio intrusion detection and diagnosis system for wireless communications in smart environment. ACM Transactions on Cyber-Physical Systems, 5(3), 1-1. https://dl.acm.org/doi/10.1145/3441458
Published in 9th GreHack Conference, 2021
In program analysis, a fuzzing toolset is needed to automatically trigger software operations in a natural while efficient way. Especially in dynamic analysis of malware, such a toolset can help execute the suspicious files to unveil their malicious payloads hidden by other benign-looking behaviors. In the fields of software testing, this tool is necessary for triggering and testing the programmed functionalities. Nevertheless, there has not yet been an easy-to-use tool that works on Windows for the purpose of generating activity through the Graphical User Interface (GUI). To meet this requirement, in our work, we develop GUI-Mimic. It is designed to integrate some useful features for stimulating different types of software – mouse and keyboard recording, random mouse and keyboard inputs, editing, trimming, randomization, transformations – to deliver an easy-to-deploy GUI fuzzer over different Operating Systems.
Recommended citation: Raulin, V., Gimenez, P. F., Han, Y., Viet Triem Tong, V., Ouairy, L. (2021). GUI-Mimic, a cross-platform recorder and fuzzer of Graphical User Interface. 9th GreHack Conference https://hal.archives-ouvertes.fr/hal-03449827/document
Published in RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2022
Network Intrusion Detection Systems (NIDSes) evaluation requires background traffic. However, real background traffic is hard to collect. We hence rely on synthetic traffic generated especially for this task. The quality of the generated traffic has to be evaluated according to some clearly defined criteria. In this paper, we show how to adapt the quality assessment solutions proposed for different fields of data generation such as image or text generation to network traffic. We summarize our study by discussing the criteria that evaluate the quality of a generated network traffic and by proposing functions to evaluate these criteria. This is the first contribution in the context of the Ph.D. thesis of Adrien Schoen.
Recommended citation: Schoen, A., Blanc, G., Gimenez, P. F., Han, Y., Majorczyk, F., & Mé, L. (2022, May). Towards generic quality assessment of synthetic traffic for evaluating intrusion detection systems. In RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information. https://hal.archives-ouvertes.fr/hal-03675359/document
Published in RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2022
Despite fruitful achievements made by unsupervised machine learning-based anomaly detection for network intrusion detection systems, they are still prone to the issue of high false alarm rates, and it is still difficult to reach very high recalls. In 2020, Leichtnam et al. proposed Sec2graph, an unsupervised approach applied to security objects graphs that exhibited interesting results on single-step attacks. The graph representation and the embedding allowed for better detection since it creates qualitative features. In this paper, we present new experiments to assess the performances of this approach for detecting APT attacks. We achieve better detection performances than the original work’s baseline detection methods on the DAPT2020 dataset. This work is realised in the context of the Ph.D. thesis of Maxime Lanvin, which started in October 2021.
Recommended citation: Lanvin, M., Gimenez, P. F., Han, Y., Majorczyk, F., Mé, L., & Totel, É. (2022, May). Detecting APT through graph anomaly detection. In RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information. https://hal.archives-ouvertes.fr/hal-03675346/document
Published in RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2022
Dynamic analysis is a common technique to analyze the run-time behavior of software and identify malware (malicious software). Execution traces typically contain the list of system calls with their parameters, the list of accessed files, etc. Several representations have been proposed to organize these data better and help both human experts and automated tools analyze them effectively. This paper reviews these representations and identifies four research problems that the first author plans to investigate during his Ph.D.
Recommended citation: Raulin, V., Gimenez, P. F., Han, Y., & Tong, V. V. T. (2022). Towards a Representation of Malware Execution Traces for Experts and Machine Learning. RESSI 2022-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information. https://hal.archives-ouvertes.fr/hal-03675366/document
Published in IEEE Transactions on Information Forensics and Security, 2022
Android security has received a lot of attention over the last decade, especially malware investigation. Researchers attempt to highlight applications’ security-relevant characteristics to better understand malware and effectively distinguish malware from benign applications. The accuracy and the completeness of their proposals are evaluated experimentally on malware and goodware datasets. Thus, the quality of these datasets is of critical importance: if the datasets are outdated or not representative of the studied population, the conclusions may be flawed. We specify different types of experimental scenarios. Some of them require unlabeled but representative datasets of the entire population. Others require datasets labeled with valuable characteristics that may be difficult to compute, such as malware datasets. We discuss the irregularities of datasets used in experiments, questioning the validity of the performances reported in the literature. This article focuses on providing guidelines for designing debiased datasets. First, we propose guidelines for building representative datasets from unlabeled ones. Second, we propose and experiment a debiasing algorithm that, given a biased labeled dataset and a target representative dataset, builds a representative and labeled dataset. Finally, from the previous debiased datasets, we produce datasets for experiments on Android malware detection or classification with machine learning algorithms. Experiments show that debiased datasets perfom better when classifying with machine learning algorithms.
Recommended citation: Miranda, T. C., Gimenez, P. F., Lalande, J. F., Tong, V. V. T., & Wilke, P. (2022). Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?. IEEE Transactions on Information Forensics and Security. https://ieeexplore.ieee.org/abstract/document/9787514
Published in M-PREF 2022 13th Multidisciplinary Workshop on Advances in Preference Handling, 2022
This paper considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees.
Recommended citation: Fargier, H., Gimenez, P. F., Mengin, J., & Le Nguyen, B. N. (2022, July). The complexity of unsupervised learning of lexicographic preferences. In 13th Multidisciplinary Workshop on Advances in Preference Handling (M-pref 2022)@ IJCAI 2022 https://hal.inria.fr/hal-03784693/document
Published in Annals of Telecommunications, 2022
Artificial intelligence (AI) and machine learning (ML) are increasingly becoming essential in the development of cybersecurity solutions, with deep learning (DL) algorithms being extensively applied in recent years, e.g., for detecting Android malware or vulnerable source code. However, sharing the same fundamental limitation with other DL application domains, such as computer vision and natural language processing, AI-based cybersecurity solutions lack the capability of justifying the results (ranging from detection and prediction to reasoning and decision-making) and making them human-understandable. As a result, explainable AI (XAI) has emerged as a paramount topic addressing the related challenges of making AI models explainable or interpretable to human users. It is particularly relevant in the cybersecurity domain, in that XAI may allow security operators, who are overwhelmed with tens of thousands of security alerts per day (most of which are false positives) per day, to better assess the potential threats and reduce alert fatigue. With such a strong motivation, we conduct an extensive literature review on the intersection between XAI and cybersecurity. Particularly, we investigate the academic literature from two perspectives: the applications of XAI to cybersecurity (e.g., intrusion detection, malware classification, etc.) and the application of cybersecurity to XAI (e.g., attacks on XAI pipelines, potential countermeasures, etc.). We characterize the security of XAI with several security properties that have been discussed in the literature. We also formulate open questions that are either left out or not properly addressed in the literature and provide tentative answers.
Recommended citation: Fabien Charmet, Harry Chandra Tanuwidjaja, Solayman Ayoubi, Pierre-François Gimenez, Yufei Han, Houda Jmila, Gregory Blanc, Takeshi Takahashi & Zonghua Zhang. Explainable artificial intelligence for cybersecurity: a literature survey. Ann. Telecommun. (2022). https://doi.org/10.1007/s12243-022-00926-7
Published in CRiSIS 2022 17th International Conference on Risks and Security of Internet and Systems, 2022
Among the difficulties encountered in building datasets to evaluate intrusion detection tools, a tricky part is the process of labelling the events into malicious and benign classes. The labelling correctness is paramount for the quality of the evaluation of intrusion detection systems but is often considered as the ground truth by practitioners and is rarely verified. Another difficulty lies in the correct capture of the network packets. If it is not the case, the characteristics of the network flows generated from the capture could be modified and lead to false results. In this paper, we present several flaws we identified in the labelling of the CICIDS2017 dataset and in the traffic capture, such as packet misorder, packet duplication and attack that were performed but not correctly labelled. Finally, we assess the impact of these different corrections on the evaluation of supervised intrusion detection approaches.
Recommended citation: Lanvin, M., Gimenez, P. F., Han, Y., Majorczyk, F., Mé, L., & Totel, E. (2022, December). Errors in the CICIDS2017 dataset and the significant differences in detection performances it makes. In CRiSIS 2022-International Conference on Risks and Security of Internet and Systems. https://link.springer.com/chapter/10.1007/978-3-031-31108-6_2
Published in 20th International conference on security and cryptography SECRYPT 2023, 2023
Malware analysis consists of studying a sample of suspicious code to understand it and producing a representation or explanation of this code that can be used by a human expert or a clustering/classification/detection tool. The analysis can be static (only the code is studied) or dynamic (only the interaction between the code and its host during one or more executions is studied). The quality of the interpretation of a code and its later detection depends on the quality of the information contained in this representation. To date, many analyses produce voluminous reports that are difficult to handle quickly. In this article, we present BAGUETTE, a graph-based representation of the interactions of a sample and the resources offered by the host system during one execution. We explain how BAGUETTE helps automatically search for specific behaviors in a malware database and how it efficiently assists the expert in analyzing samples.
Recommended citation: Vincent Raulin, Pierre-François Gimenez, Yufei Han, Valérie Viet Triem Tong. BAGUETTE: Hunting for Evidence of Malicious Behavior in Dynamic Analysis Reports. 20th International conference on security and cryptography SECRYPT 2023, Jul 2023, Rome, Italy. https://hal.science/hal-04102144/
Published in 26th European Conference on Artificial Intelligence (ECAI 2023), 2023
This paper focuses on graphical models for modelling preferences in combinatorial space and their use for item optimisation. The preferential optimisation task seeks to find the preferred item containing some defined values, which is useful for many recommendation settings in e-commerce. We show that efficient (i.e., with polynomial time complexity) preferential optimisation is achieved with a subset of cyclic CP-nets called conditional acyclic CP-net. We also introduce a new graphical preference model, called Conditional-Optimality networks (CO-networks), that are more concise than conditional acyclic CP-nets and LP-trees but have the same expressiveness with respect to optimisation. Finally, we empirically show that preferential optimisation can be used for encoding alternatives into partial instantiations and vice versa, paving the way towards CO-nets and CP-nets unsupervised learning with the minimal description length (MDL) principle.
Recommended citation: Gimenez, Pierre-François & Mengin, Jérôme. (2023). Conditionally Acyclic CO-Networks for Efficient Preferential Optimization. 10.3233/FAIA230352. https://www.researchgate.net/publication/374310696
Published in 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2023), 2023
The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on semi-supervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our work identifies the most informative network traffic features associated with an anomaly alert, providing interpretations for the generated alerts. We conduct an empirical study using a large-scale network intrusion dataset, CICIDS2017, to compare the proposed AE-pvalues method with two state-of-the-art baselines applied in the semi-supervised anomaly detection task. Our experimental results show that the AE-pvalues method accurately identifies abnormal influential network traffic features. Furthermore, our study demonstrates that the explanation outputs can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures.
Recommended citation: Maxime Lanvin, Pierre-François Gimenez, Yufei Han, Frédéric Majorczyk, Ludovic Mé, et al.. Towards Understanding Alerts raised by Unsupervised Network Intrusion Detection Systems. The 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2023), Oct 2023, Hong Kong, France. ⟨10.1145/3607199.3607247⟩ https://dl.acm.org/doi/abs/10.1145/3607199.3607247
Published in RESSI 2024-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information, 2024
Intrusion Detection Systems (IDS) are tools for monitoring a system, in order to identify potential malicious activities within it. This survey presents an analysis of graph-based anomaly-based IDS at system level. We also present open issues regarding those IDS, and propose a taxonomy of suitable features to compare them.
Recommended citation: Dijoud, F., Gimenez, P. F., Hurfin, M., Majorczyk, F., & Pilastre, B. (2024, May). Survey on system-level graph-based and anomaly-based intrusion detection. In RESSI 2024-Rendez-Vous de la Recherche et de l'Enseignement de la Sécurité des Systèmes d'Information (pp. 1-2). https://inria.hal.science/hal-04714325/document
Published in 9th International Workshop on Traffic Measurements for Cybersecurity (WTMC 2024), 2024
The evaluation of network intrusion detection systems requires a sufficient amount of mixed network traffic, i.e., composed of both malicious and legitimate flows. In particular, obtaining realistic legitimate traffic is hard. Synthetic network traffic is one of the tools to respond to insufficient or incomplete real-world datasets. In this paper, we only focus on synthetically generating high-quality legitimate traffic and we do not delve into malicious traffic generation. For this specific task, recent contributions make use of advanced machine learning-driven approaches, notably through Generative Adversarial Networks (GANs). However, evaluations of GAN-generated data often disregards pivotal attributes, such as protocol adherence. Our study addresses the gap by proposing a comprehensive set of metrics that assess the quality of synthetic legitimate network traffic. To illustrate the value of these metrics, we empirically compare advanced network-oriented GANs with a simple and yet effective probabilistic generative model, Bayesian Networks (BN). According to our proposed evaluation metrics, BN-based network traffic generation outperforms the state-of-the-art GAN-based opponents. In our study, BN yields substantially more realistic and useful synthetic benign traffic and minimizes the computational costs simultaneously.
Recommended citation: Schoen, A., Blanc, G., Gimenez, P. F., Han, Y., Majorczyk, F., & Mé, L. (2024). A Tale of Two Methods: Unveiling the limitations of GAN and the Rise of Bayesian Networks for Synthetic Network Traffic Generation. In Proceedings of the 9th International Workshop on Traffic Measurements for Cybersecurity (WTMC 2024).
Published in IJCAI International Joint Conference on Artificial Intelligence, 2024
CP-nets are a very expressive graphical model for representing preferences over combinatorial spaces. They are particularly well suited for settings where an important task is to compute the optimal completion of some partially specified alternative; this is, for instance, the case of interactive configurators, where preferences can be used at every step of the interaction to guide the decision maker towards a satisfactory configuration. Learning CP-nets is challenging when the input data has the form of pairwise comparisons between alternatives. Furthermore, this type of preference data is not commonly stored: it can be elicited but this puts an additional burden on the decision maker. In this article, we propose a new method for learning CP-nets from sales history, a kind of data readily available in many e-commerce applications. The approach is based on the minimum description length (MDL) principle. We show some theoretical properties of this learning task, namely its sample complexity and its NP-completeness, and we experiment with this learning algorithm in a recommendation setting with real sales history from a car maker.
Recommended citation: Gimenez, P. F., & Mengin, J. (2024). Learning Conditional Preference Networks: an Approach Based on the Minimum Description Length Principle. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
Published in 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT), 2024
Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. Most methods generating synthetic network flows disregard the temporal dependencies between them, leading to unrealistic traffic. To address this issue, we introduce FlowChronicle, a novel synthetic network flow generation tool that relies on pattern mining and statistical models to preserve temporal dependencies. We empirically compare our method against state-of-the-art techniques on several criteria, namely realism, diversity, compliance, and novelty. This evaluation demonstrates the capability of FlowChronicle to achieve high-quality generation while significantly outperforming the other methods in preserving temporal dependencies between flows. Besides, in contrast to deep learning methods, the patterns identified by FlowChronicle are explainable, and experts can verify their soundness. Our work substantially advances synthetic network traffic generation, offering a method that enhances both the utility and trustworthiness of the generated network flows.
Recommended citation: Cüppers, J., Schoen, A., Blanc, G. & Gimenez, P. F., (2024, December). FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining Generation. In the 20th International Conference on emerging Networking EXperiments and Technologies (CoNEXT). https://dl.acm.org/doi/10.1145/3696407
Published:
Authors: Pierre-François Gimenez
License: MIT
This python module is an free and open source implementation of the empirical Christoffel function applied to outlier detection as proposed by Lasserre and Pauwels in The empirical Christoffel function with applications in data analysis. Pypi page. Source code.
Published:
Authors: Pierre-François Gimenez
License: MIT
Implementation of the CP-net learner described in Learning Conditional Preference Networks: an Approach Based on the Minimum Description Length Principle. Pypi page. Source code.
Published:
Authors: Joscha Cüppers, Pierre-François Gimenez and Adrien Schoen
License: MIT
FlowChronicle is a network flow records data miner and generator, made in the context of the SecGen associated team between Inria and CISPA. It an implementation of the system described in FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining. Pypi page. Source code.
Published:
We consider the task of learning the preferences of users on a combinatorial set of alternatives, as it can be the case for example with online configurators. In many settings, what is available to the learner is a set of positive examples of alternatives that have been selected during past interactions. We propose to learn a model of the users’ preferences that ranks previously chosen alternatives as high as possible. Here, we study the particular task of learning conditional lexicographic preferences. We present an algorithm to learn several classes of lexicographic preference trees, prove convergence properties of the algorithm, and experiment on both synthetic data and on a real-world bench in the domain of recommendation in interactive configuration. Slides
Published:
We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides
Published:
L’expansion de l’Internet des objets (IoT) entraîne l’apparition de maisons intelligentes, d’usines intelligentes et même de villes intelligentes. Bien que ces objets améliorent la qualité de vie de ses utilisateurs et offrent de nouvelles opportunités économiques, ils sont aussi un important vecteur d’attaques (le botnet Mirai étant sûrement l’exemple le plus connu). Pour protéger ces environnements, des systèmes de détection d’intrusion (IDS) sont développés. Ces IDS rencontrent des problématiques uniques à l’IoT, telles que l’évolution rapide des technologies et des protocoles ou encore leur réseau décentralisé. Pour surmonter ces problèmes, nous proposons un IDS qui surveille de larges bandes de fréquences au niveau de la couche physique sans faire d’hypothèses sur les protocoles ou les technologies présentes. De plus, notre solution propose pour chaque attaque détectée un diagnostic triple : temporel (les dates exactes de l’anomalie détectée), fréquentiel (la fréquence principale de l’anomalie) et spatial (la position estimée de l’origine de l’anomalie). Nous avons expérimenté notre méthode avec une expérimentation grandeur nature: notre système a pu efficacement détecter et diagnostiquer les attaques lancées sur les bandes 400-500 MHz et 800-900 MHz, deux bandes qui ne sont pas couvertes par les solutions traditionnelles. Slides
Published:
Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. We are setting up an ANR project to go into this subject in more depth. The presentation, in French, is available on Youtube. Slides
Published:
Machine learning is applied successfully in various domains, including cybersecurity, where it has been used for intrusion detection, malware analysis, and attack comprehension, for example. Therefore, many cybersecurity researchers seek to catch up and introduce such techniques in their research. This presentation aims to provide the basics of machine learning for cybersecurity researchers. The presentation is available on Youtube. Slides
Published:
Avec la numérisation de nos vies, les systèmes informatiques ont pris une place prépondérante dans notre société et sont naturellement devenus la cible d’attaquants, du “script kiddie” au groupe organisé à l’objectif politique. Depuis quelques années, l’intelligence artificielle s’est invitée à la fête : qu’elle permette de détecter automatiquement des attaques ou qu’elle soit subrepticement manipulée dans les voitures autonomes, elle amène son lot de problèmes et de solutions. Ce séminaire a pour objectif de présenter ces deux domaines, leurs enjeux et leurs interactions, et de mettre en avant les pistes de recherche que la communauté scientifique privilégie pour lutter contre ces menaces. Slides
Published:
Cette présentation rappelle les concepts fondamentaux en sécurité et présente les applications des techniques de machine learning aux multiples problématiques liées à la sécurité. Le dernier axe abordé est celui de la sécurité des méthodes de machine learning, qui sont directement attaquées par de nombreuses techniques aux objectifs et aux moyens multiples. Séminaire présenté avec Ludovic Mé. Slides
Published:
Avec l’utilisation toujours croissantes des techniques d’IA, le besoin de vérification des prédictions se fait de plus en plus pressant. C’est l’objectif des méthodes d’explicabilité : permettre à un utilisateur de savoir pourquoi une décision a été prise par le système. Cette introduction fait un point sur les nombreuses familles de techniques disponibles. Slides
Published:
We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides
Published:
This presentation is an overview on four PhDs recently started in the CIDRE team, on intrusion detection, data generation, malware analysis and botnet detection. Slides
Published:
This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides
Published:
In this talk, I present the sec2graph approach, its performances, and its explanation mechanism. This mechanism helped us identify several flaws we identified in the labelling of the CICIDS2017 dataset and in the traffic capture, such as packet misorder, packet duplication and attack that were performed but not correctly labelled. Slides
Published:
Une vue synthétique des différentes représentations de langages réguliers (expression régulière, grammaire régulière, automate à états finis) et une potentielle application à la génération de données réseaux. Slides
Published:
This talk presents recent work on anomaly detection in network data and anomaly explanation. Our approach represents the network data with a security objects graph analyzed by an autoencoder. We introduce a new statistical explanation technique for reconstruction-based methods and compare it with SHAP. Finally, we use these explanations to analyze the dataset CICIDS2017 and check whether they match the expert’s expectations. Slides
Published:
This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides
Published:
Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. Slides
Published:
Malware analysis consists in analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. This article proposes a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. We compare and validate these approaches with various machine-learning-based malware detection methods, allowing for robust detection with minimal detection performance reduction.
Published:
In this round-table discussion, I present the current stakes of network supervision, how machine learning can answer them, and what challenges it brings.
Published:
Intrusion detection is an essential mechanism in information systems security. Machine learning has been successfully applied to this problem. These techniques rely on training data used to train a detection model. This training data generally comes from datasets that are often more or less automatically generated. Worse, the number of datasets remains small enough that the diversity of the dataset is questionable, and its aging is problematic. A solution to these problems is synthetic data generation: it would be free of experimental inaccuracies, could be easily updated, and alleviate the class imbalance by generating more data on rare classes. We plan to generate benign data only, as attacks are easier to generate with dedicated tools. This talk highlights the characteristics of network traffic generation from a non-technical point of view, adapted to data mining practitioners, as well as the issues and opportunities. Slides
Published:
This paper focuses on graphical models for modelling preferences in combinatorial space and their use for item optimisation. The preferential optimisation task seeks to find the preferred item containing some defined values, which is useful for many recommendation settings in e-commerce. We show that efficient (i.e., with polynomial time complexity) preferential optimisation is achieved with a subset of cyclic CP-nets called conditional acyclic CP-net. We also introduce a new graphical preference model, called Conditional-Optimality networks (CO-networks), that are more concise than conditional acyclic CP-nets and LP-trees but have the same expressiveness with respect to optimisation. Finally, we empirically show that preferential optimisation can be used for encoding alternatives into partial instantiations and vice versa, paving the way towards CO-nets and CP-nets unsupervised learning with the minimal description length (MDL) principle. Poster
Published:
The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our work identifies the most informative network traffic features associated with an anomaly alert, providing interpretations for the generated alerts. We conduct an empirical study using a large-scale network intrusion dataset, CICIDS2017, to compare the proposed AE-pvalues method with two state-of-the-art baselines applied in the unsupervised anomaly detection task. Our experimental results show that the AE-pvalues method accurately identifies abnormal influential network traffic features. Furthermore, our study demonstrates that the explanation outputs can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures. Slides
Published:
An overview of scientific questions about network traffic generation. Slides
Published:
Un exposé de vulgarisation d’une heure sur le thème de l’intelligence artificielle à l’occasion des 30 ans du GIP RENATER. Dans cette présentation, je reviens sur l’histoire de l’intelligence artificielle et je démystifie un peu son fonctionnement. Je présente ensuite les principaux domaines révolutionnés par l’intelligence artificielle et les nombreux risques qu’elle soulève. Slides
Published:
This round-table discussion is dedicated to the effect of AI on cybersecurity: its limitations, its opportunities and its risks. Video
Published:
In this presentation, I present three potential ideas to investigate within the SecGen project: 1) generation for concept drift evaluation, 2) causal learning and 3) system logs generation. Slides
Published:
Dans cette présentation, je vais faire un tour d’horizon des modélisations de données de sécurité et montrer l’importance qu’elles ont pour une utilisation avec des techniques d’apprentissage automatique. Slides
Published:
Malware analysis consists of studying a sample of suspicious code to understand it and producing a representation or explanation of this code that can be used by a human expert or a clustering/classification/detection tool. The analysis can be static (only the code is studied) or dynamic (only the interaction between the code and its host during one or more executions is studied). The quality of the interpretation of a code and its later detection depends on the quality of the information contained in this representation. To date, many analyses produce voluminous reports that are difficult to handle quickly. In this article, we present BAGUETTE, a graph-based representation of the interactions of a sample and the resources offered by the host system during one execution. We explain how BAGUETTE helps automatically search for specific behaviors in a malware database and how it efficiently assists the expert in analyzing samples. We also develop a possible use case of BAGUETTE being currently researched: explainable unsupervised malware behavior clustering. Slides - Video
Published:
This work present the cooperation with CISPA on robust malware detectors. Malware analysis involves analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. We propose a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure.
Published:
This event has been cancelled.
Published:
CP-nets are a very expressive graphical model for representing preferences over combinatorial spaces. They are particularly well suited for settings where an important task is to compute the optimal completion of some partially specified alternative; this is, for instance, the case of interactive configurators, where preferences can be used at every step of the interaction to guide the decision maker towards a satisfactory configuration. Learning CP-nets is challenging when the input data has the form of pairwise comparisons between alternatives. Furthermore, this type of preference data is not commonly stored: it can be elicited but this puts an additional burden on the decision maker. In this article, we propose a new method for learning CP-nets from sales history, a kind of data readily available in many e-commerce applications. The approach is based on the minimum description length (MDL) principle. We show some theoretical properties of this learning task, namely its sample complexity and its NP-completeness, and we experiment with this learning algorithm in a recommendation setting with real sales history from a car maker. Slides - Poster
Published:
The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our study demonstrates that these explanations can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures. Slides
Published:
In the talk, I present the FosR project on security data generation, and specifically the results obtained in the SecGen associated team. Slides
Published:
Many systems are controlled via commands built upon user inputs. For systems that deal with structured commands, such as SQL queries, XML documents, or network messages, such commands are generally constructed in a “fill-in-the-blank” fashion: the user input is concatenated with a fixed part written by the developer (the template). However, the user input can be crafted to modify the command’s semantics intended by the developer and lead to the system’s malicious usages. Such an attack, called an injection-based attack, is considered one of the most severe threat to web applications. Solutions to prevent such vulnerabilities exist but are generally ad hoc and rely on the developer’s expertise and diligence. Our approach addresses these vulnerabilities from the formal language theory’s point of view. We formally define two new security properties. The first one, “intent-equivalence”, guarantees that a developer’s template cannot lead to malicious injections. The second one, “intent-security”, guarantees that every possible template is intent-equivalent, and therefore that the programming language itself is secure. From these definitions, we propose new techniques to create programming language that are secure by design, and present two secure, simplified versions of widespread languages.
Published:
During this workshop with AISI, I present my recent work on network data generation using AI and I present new research questions on the role of LLMs in this kind of application. Slides
Published:
In this keynote, I present the work that lead us to work on synthetic data generation, and how we generate security data. This works aims at creating realistic and diverse network and system datasets to better evaluate security solutions, most notably supervision solutions. I will notably present two works on network flow records generation using pattern mining and on packet header generation using probabilistic timed automata generation. Slides
Published:
Table ronde: “Vision d’avenir pour la cybersécurité en Bretagne et en Europe : quels challenges et quelle place pour l’IA ?”
M1, UPSSITECH engineering school, 2015
This course aims to train the students in machine learning techniques:
L3, University of Toulouse, 2015
This course aims to train the students in general artificial intelligence techniques:
L3, INSA-Toulouse, 2018
This course presents some graph theory and their algorithms:
M2, CentraleSupélec, 2020
I was the coordinator of this module in 2023-2024.
M1, CentraleSupélec, 2020
I am in charge of this module since 2020.
M2, CentraleSupélec, 2022
I was the coordinator of this module in 2022-2024. Besides, I teach the Flash backend framework.
M2, CentraleSupélec, 2024
I am the coordinator of this module since 2024.