TBA
Talk, Séminaire PIRAT, CentraleSupélec
TBA
Talk, Séminaire PIRAT, CentraleSupélec
TBA
Talk, Inria Evaluation Seminar, Rungis, France
Talk, Séminaire Sci-Rennes, Centre Inria de l’Université de Rennes
The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our study demonstrates that these explanations can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures.”
Talk, IJCAI 2024, the 33rd International Joint Conference on Artificial Intelligence, Jeju, Korea
CP-nets are a very expressive graphical model for representing preferences over combinatorial spaces. They are particularly well suited for settings where an important task is to compute the optimal completion of some partially specified alternative; this is, for instance, the case of interactive configurators, where preferences can be used at every step of the interaction to guide the decision maker towards a satisfactory configuration. Learning CP-nets is challenging when the input data has the form of pairwise comparisons between alternatives. Furthermore, this type of preference data is not commonly stored: it can be elicited but this puts an additional burden on the decision maker. In this article, we propose a new method for learning CP-nets from sales history, a kind of data readily available in many e-commerce applications. The approach is based on the minimum description length (MDL) principle. We show some theoretical properties of this learning task, namely its sample complexity and its NP-completeness, and we experiment with this learning algorithm in a recommendation setting with real sales history from a car maker. Slides - Poster
Talk, Séminaire des experts du Groupe Orange, Couvent des Jacobins, Rennes
This event has been cancelled.
Talk, DefMal annual workshop, Saint-Malo, France
This work present the cooperation with CISPA on robust malware detectors. Malware analysis involves analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. We propose a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure.
Talk, Toulouse Hacking Conference (THCon), Toulouse, France
Malware analysis consists of studying a sample of suspicious code to understand it and producing a representation or explanation of this code that can be used by a human expert or a clustering/classification/detection tool. The analysis can be static (only the code is studied) or dynamic (only the interaction between the code and its host during one or more executions is studied). The quality of the interpretation of a code and its later detection depends on the quality of the information contained in this representation. To date, many analyses produce voluminous reports that are difficult to handle quickly. In this article, we present BAGUETTE, a graph-based representation of the interactions of a sample and the resources offered by the host system during one execution. We explain how BAGUETTE helps automatically search for specific behaviors in a malware database and how it efficiently assists the expert in analyzing samples. We also develop a possible use case of BAGUETTE being currently researched: explainable unsupervised malware behavior clustering. Slides - Video
Talk, Académie des Technologies, En visioconférence
Dans cette présentation, je vais faire un tour d’horizon des modélisations de données de sécurité et montrer l’importance qu’elles ont pour une utilisation avec des techniques d’apprentissage automatique. Slides
Talk, SecGen plenary meeting, CISPA, Saarbrücken, Germany
In this presentation, I present three potential ideas to investigate within the SecGen project: 1) generation for concept drift evaluation, 2) causal learning and 3) system logs generation. Slides
Talk, Séminaire IA du Commandement de la cyberdéfense, Maison des Polytechniciens, Paris 7e, France
This round-table discussion is dedicated to the effect of AI on cybersecurity: its limitations, its opportunities and its risks. Video
Talk, Séminaire GIP RENATER, Roscoff, France
Un exposé de vulgarisation d’une heure sur le thème de l’intelligence artificielle à l’occasion des 30 ans du GIP RENATER. Dans cette présentation, je reviens sur l’histoire de l’intelligence artificielle et je démystifie un peu son fonctionnement. Je présente ensuite les principaux domaines révolutionnés par l’intelligence artificielle et les nombreux risques qu’elle soulève. Slides
Talk, CISPA-Inria Workshop, Campus Cyber, Puteaux, France
An overview of scientific questions about network traffic generation. Slides
Talk, FADEx Seminar, IRISA, Rennes, France
The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our work identifies the most informative network traffic features associated with an anomaly alert, providing interpretations for the generated alerts. We conduct an empirical study using a large-scale network intrusion dataset, CICIDS2017, to compare the proposed AE-pvalues method with two state-of-the-art baselines applied in the unsupervised anomaly detection task. Our experimental results show that the AE-pvalues method accurately identifies abnormal influential network traffic features. Furthermore, our study demonstrates that the explanation outputs can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures. Slides
Talk, 26th European Conference on Artificial Intelligence ECAI 2023, Kraków, Poland
This paper focuses on graphical models for modelling preferences in combinatorial space and their use for item optimisation. The preferential optimisation task seeks to find the preferred item containing some defined values, which is useful for many recommendation settings in e-commerce. We show that efficient (i.e., with polynomial time complexity) preferential optimisation is achieved with a subset of cyclic CP-nets called conditional acyclic CP-net. We also introduce a new graphical preference model, called Conditional-Optimality networks (CO-networks), that are more concise than conditional acyclic CP-nets and LP-trees but have the same expressiveness with respect to optimisation. Finally, we empirically show that preferential optimisation can be used for encoding alternatives into partial instantiations and vice versa, paving the way towards CO-nets and CP-nets unsupervised learning with the minimal description length (MDL) principle. Poster
Talk, SecGen kick-off meeting, Campus Cyber, Puteaux, France
Intrusion detection is an essential mechanism in information systems security. Machine learning has been successfully applied to this problem. These techniques rely on training data used to train a detection model. This training data generally comes from datasets that are often more or less automatically generated. Worse, the number of datasets remains small enough that the diversity of the dataset is questionable, and its aging is problematic. A solution to these problems is synthetic data generation: it would be free of experimental inaccuracies, could be easily updated, and alleviate the class imbalance by generating more data on rare classes. We plan to generate benign data only, as attacks are easier to generate with dedicated tools. This talk highlights the characteristics of network traffic generation from a non-technical point of view, adapted to data mining practitioners, as well as the issues and opportunities. Slides
Talk, La Cyber au rendez-vous de l’IA de confiance, Campus Cyber, Puteaux, France
In this round-table discussion, I present the current stakes of network supervision, how machine learning can answer them, and what challenges it brings.
Talk, CIDRE seminar, Ploërmel, France
Malware analysis consists in analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. This article proposes a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. We compare and validate these approaches with various machine-learning-based malware detection methods, allowing for robust detection with minimal detection performance reduction.
Talk, 2023 Annual Meeting of the WG "Formal Methods for Security", Roscoff, France
Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. Slides
Talk, Séminaire ANITI, IRIT, Toulouse, France
This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides
Talk, NICT seminar, Campus Cyber, Puteaux, France
This talk presents recent work on anomaly detection in network data and anomaly explanation. Our approach represents the network data with a security objects graph analyzed by an autoencoder. We introduce a new statistical explanation technique for reconstruction-based methods and compare it with SHAP. Finally, we use these explanations to analyze the dataset CICIDS2017 and check whether they match the expert’s expectations. Slides
Talk, "Papers, please" seminar, Rennes, France
Une vue synthétique des différentes représentations de langages réguliers (expression régulière, grammaire régulière, automate à états finis) et une potentielle application à la génération de données réseaux. Slides
Talk, Supsec 3rd workshop: AI for supervision, Rennes, France
In this talk, I present the sec2graph approach, its performances, and its explanation mechanism. This mechanism helped us identify several flaws we identified in the labelling of the CICIDS2017 dataset and in the traffic capture, such as packet misorder, packet duplication and attack that were performed but not correctly labelled. Slides
Talk, M-PREF Workshop of IJCAI, Vienne, Autriche
This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides
Talk, GT stats seminar, Rennes, France
This presentation is an overview on four PhDs recently started in the CIDRE team, on intrusion detection, data generation, malware analysis and botnet detection. Slides
Talk, Séminaire ANITI, IRIT, Toulouse, France
We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides
Talk, "Papers, please" seminar, Rennes, France
Avec l’utilisation toujours croissantes des techniques d’IA, le besoin de vérification des prédictions se fait de plus en plus pressant. C’est l’objectif des méthodes d’explicabilité : permettre à un utilisateur de savoir pourquoi une décision a été prise par le système. Cette introduction fait un point sur les nombreuses familles de techniques disponibles. Slides
Talk, MeetUp LumenAI, Rennes, France
Cette présentation rappelle les concepts fondamentaux en sécurité et présente les applications des techniques de machine learning aux multiples problématiques liées à la sécurité. Le dernier axe abordé est celui de la sécurité des méthodes de machine learning, qui sont directement attaquées par de nombreuses techniques aux objectifs et aux moyens multiples. Séminaire présenté avec Ludovic Mé. Slides
Talk, Séminaires du département informatique, Rennes, France
Avec la numérisation de nos vies, les systèmes informatiques ont pris une place prépondérante dans notre société et sont naturellement devenus la cible d’attaquants, du “script kiddie” au groupe organisé à l’objectif politique. Depuis quelques années, l’intelligence artificielle s’est invitée à la fête : qu’elle permette de détecter automatiquement des attaques ou qu’elle soit subrepticement manipulée dans les voitures autonomes, elle amène son lot de problèmes et de solutions. Ce séminaire a pour objectif de présenter ces deux domaines, leurs enjeux et leurs interactions, et de mettre en avant les pistes de recherche que la communauté scientifique privilégie pour lutter contre ces menaces. Slides
Talk, Hands-on Machine Learning for Security seminar, Rennes, France
Machine learning is applied successfully in various domains, including cybersecurity, where it has been used for intrusion detection, malware analysis, and attack comprehension, for example. Therefore, many cybersecurity researchers seek to catch up and introduce such techniques in their research. This presentation aims to provide the basics of machine learning for cybersecurity researchers. The presentation is available on Youtube. Slides
Talk, SoSySec seminars at IRISA, Rennes, France
Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. We are setting up an ANR project to go into this subject in more depth. The presentation, in French, is available on Youtube. Slides
Talk, Nouvelles Avancées en Sécurité des Systèmes d'Information, INSA-Toulouse, Toulouse, France
L’expansion de l’Internet des objets (IoT) entraîne l’apparition de maisons intelligentes, d’usines intelligentes et même de villes intelligentes. Bien que ces objets améliorent la qualité de vie de ses utilisateurs et offrent de nouvelles opportunités économiques, ils sont aussi un important vecteur d’attaques (le botnet Mirai étant sûrement l’exemple le plus connu). Pour protéger ces environnements, des systèmes de détection d’intrusion (IDS) sont développés. Ces IDS rencontrent des problématiques uniques à l’IoT, telles que l’évolution rapide des technologies et des protocoles ou encore leur réseau décentralisé. Pour surmonter ces problèmes, nous proposons un IDS qui surveille de larges bandes de fréquences au niveau de la couche physique sans faire d’hypothèses sur les protocoles ou les technologies présentes. De plus, notre solution propose pour chaque attaque détectée un diagnostic triple : temporel (les dates exactes de l’anomalie détectée), fréquentiel (la fréquence principale de l’anomalie) et spatial (la position estimée de l’origine de l’anomalie). Nous avons expérimenté notre méthode avec une expérimentation grandeur nature: notre système a pu efficacement détecter et diagnostiquer les attaques lancées sur les bandes 400-500 MHz et 800-900 MHz, deux bandes qui ne sont pas couvertes par les solutions traditionnelles. Slides
Talk, Panorama des recherches dans le domaine automobile, LAAS-IRIT-Laplace, Toulouse, France
We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides
Talk, AAAI’18 Technical Track, New Orleans, USA
We consider the task of learning the preferences of users on a combinatorial set of alternatives, as it can be the case for example with online configurators. In many settings, what is available to the learner is a set of positive examples of alternatives that have been selected during past interactions. We propose to learn a model of the users’ preferences that ranks previously chosen alternatives as high as possible. Here, we study the particular task of learning conditional lexicographic preferences. We present an algorithm to learn several classes of lexicographic preference trees, prove convergence properties of the algorithm, and experiment on both synthetic data and on a real-world bench in the domain of recommendation in interactive configuration. Slides