Talks and presentations

Protocol Semantics vs. User Semantics: Predicting Cross-Domain Generalization in Intrusion Detection

June 26, 2026

Talk, Bi-annual PIRAT Seminar, La Baule-Escoublac, France

Machine-learning-based intrusion detection systems rarely generalize across deployment environments, often degrading when transferred to new network domains, which greatly limits their applicability to real-world settings. We hypothesize that feature category is a key driver of cross-domain transferability in intrusion detection because L7 features (grounded in standardized protocols) are domain-invariant, while L8 features (encoding real-world action the user is performing) absorb domain-specific noise and hurt generalization. This article introduces a novel taxonomy distinguishing lexical, syntactic, and two levels of semantic features: protocol-level (L7) and user-level (L8). To study this hypothesis, we publish Superviz26-SQL, the first four-domain SQL attack detection benchmark, and we show that feature category is a key driver of domain generalisation. Our results confirm that handcrafted Feature Extractors (FE) relying on L7 features generalize well, while FEs using L8 semantics drop by up to 8 AUROC points. Strikingly, FEs using pretrained models (despite high in-domain AUROC scores) can collapse up to near-random performance under domain shift, and the same robustness hierarchy holds under sudden concept drift. Finally, fine-tuning experiments show that the cross-domain gap is recoverable for most FEs with at most 10,000 target domain samples. These findings offer guidance for practitioners: L7-semantic FEs for immediate zero-shot transfer, or pretrained encoders at the cost of collecting target-domain samples and retraining.

Challenge SKYFALL[M]: retour d’expérience

June 26, 2026

Talk, Bi-annual PIRAT Seminar, La Baule-Escoublac, France

Retour d’expérience sur le challenge SKYFALL[M] qui a été joué lors du BreizhCTF 2026.

CasinoLimit: An Offensive Dataset Labeled with MITRE ATT&CK Techniques

June 24, 2026

Talk, SequoIA workshop on AI and cybersecurity, Rennes, France

Cybersecurity exercises are a common way to train and evaluate the skills of cybersecurity professionals. These exercises also provide a unique opportunity to generate datasets with realistic attack traces on non-sensitive systems. Nevertheless, the collected logs are unlabeled, and deciding which logs are related to pentesters is a difficult problem. In this paper, we present a novel methodology to label efficiently both system and network logs using MITRE ATT&CK techniques. To demonstrate the effectiveness of our approach, we introduce CasinoLimit, a dataset generated from a pentest exercise that has been played by 114 participants where we collected 540 GB of attack data. We apply our methodology to accurately label these logs with a semi-automatic approach: labels are inferred from the shell sessions and propagated to the network sessions, and eventually corrected by a junior analyst. An expert analyst has manually reviewed all the labels that have been computed to ensure the quality of the labeling process. The results of the pentest exercise are deeply discussed. We show the variability of players’ behaviors and that players can be distinguished by their command line habits. In addition, the high level of granularity of labels coupled with the number of participants enables multiple other applications. With this paper, we release the full dataset and the associated labeling tool, Manatee, which can be used to browse the logs and labels. To support the generalization of our approach, we made it possible to load other datasets with this tool. Slides

A look back at the SecGen associate team

June 22, 2026

Talk, 3rd CISPA-Inria Workshop, Saarbrücken, Germany

This talk proposes some retrospection about the SecGen collaboration between Inria and CISPA. Slides

Rust pour la recherche: retour d’expérience

June 09, 2026

Talk, Rust Paris 2026, Paris, France

En tant que chercheur Inria, je développe régulièrement des proofs-of-concept pour des projets de recherche. Python est le langage le plus utilisé dans ce contexte. Pour un projet scientifique qui implique de la programmation système et de haute performances, j’ai choisi Rust. Dans cet exposé, je fais un retour d’expérience sur les avantages et les inconvénients que j’ai rencontrés. Globalement, mon retour est très positif, même si je ne m’attend pas à ce que Rust prenne une place importante dans la recherche. Slides

Protocol Semantics vs. User Semantics: Predicting Cross-Domain Generalization in Intrusion Detection

June 08, 2026

Talk, Superviz plenary meeting, Campus Cyber, Puteaux, France

Machine-learning-based intrusion detection systems rarely generalize across deployment environments, often degrading when transferred to new network domains, which greatly limits their applicability to real-world settings. We hypothesize that feature category is a key driver of cross-domain transferability in intrusion detection because L7 features (grounded in standardized protocols) are domain-invariant, while L8 features (encoding real-world action the user is performing) absorb domain-specific noise and hurt generalization. This article introduces a novel taxonomy distinguishing lexical, syntactic, and two levels of semantic features: protocol-level (L7) and user-level (L8). To study this hypothesis, we publish Superviz26-SQL, the first four-domain SQL attack detection benchmark, and we show that feature category is a key driver of domain generalisation. Our results confirm that handcrafted Feature Extractors (FE) relying on L7 features generalize well, while FEs using L8 semantics drop by up to 8 AUROC points. Strikingly, FEs using pretrained models (despite high in-domain AUROC scores) can collapse up to near-random performance under domain shift, and the same robustness hierarchy holds under sudden concept drift. Finally, fine-tuning experiments show that the cross-domain gap is recoverable for most FEs with at most 10,000 target domain samples. These findings offer guidance for practitioners: L7-semantic FEs for immediate zero-shot transfer, or pretrained encoders at the cost of collecting target-domain samples and retraining.

SKYFALL[M]: feedback from our BreizhCTF 2026 challenge

June 02, 2026

Talk, DeceptIA Plenary Meeting, Rennes, France

Some feedback about the SKYFALL[M] challenge we proposed to the BreizhCTF 2026 challenge.

JunId: Empowering Static Analysis with Asymptotically Optimal Intersection with Incomplete Sentences

May 21, 2026

Talk, Twelfth LangSec Workshop at IEEE Security & Privacy, San Francisco, USA

Formal language theory led to groundbreaking theoretical results on the decidability of many problems, often proven with algorithms with intractable worst-case complexity. This article focuses on the efficient computation of the intersection between context-free and regular languages, which can be highly interesting for source code static analysis. We propose two new faster algorithms: 1) a general-purpose algorithm for fast intersection based on heuristics and 2) an asymptotically optimal algorithm named JunId (for “Junction Identification”) for intersection with incomplete sentences, which is especially important for statically detecting injection vulnerabilities. An experimental study supports our claims. The implementations of the algorithms and the proofs verified with the Coq proof assistant are available to the reader. Slides

Fixing Injection Vulnerabilities at the Root: Design Patterns for Secure Programming Languages

May 05, 2026

Talk, Toulouse Hacking Conference (THCon), Toulouse, France

Injection attacks (SQL or other) are still all too common, and for good reason: the vulnerability stems from the structure of the languages themselves. In this presentation, I will discuss the applications of theoretical work on the definition of injection vulnerabilities, and I will show that it is possible to create languages that are not vulnerable to these attacks. I will use an example to illustrate this: slight modifications to the LDAP language make it possible to obtain a more secure version. Slides - Vidéo

AI for Cybersecurity: Three Applications for Network Security

February 12, 2026

Talk, CyberSchool École d’Hiver Recherche, Rennes, France

Researchers have experimented with many AI techniques for detecting network intrusions. This presentation details how curious results led us to a new research direction at the intersection of AI and cybersecurity. This talk will rely on three recent works: 1) how to apply AI to intrusion detection, 2) a new XAI (eXplainable AI) framework dedicated to anomaly detection and 3) a new method for assessing data quality and generating synthetic network data with AI. Slides

Learning timed automata for synthetic network traffic generation

February 03, 2026

Talk, Les TransNumériques, Rennes, France

In this talk, I will present how to use automata to generate synthetic network traffic, and I’ll describe how we developed a new method, called TADAM, to effectively learn network protocol automata from noisy observations. Slides

FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining

November 20, 2025

Talk, European Symposium on Security and Artificial Intelligence, Rennes, France

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. Most methods generating synthetic network flows disregard the temporal dependencies between them, leading to unrealistic traffic. To address this issue, we introduce FlowChronicle, a novel synthetic network flow generation tool that relies on pattern mining and statistical models to preserve temporal dependencies. We empirically compare our method against state-of-the-art techniques on several criteria, namely realism, diversity, compliance, and novelty. This evaluation demonstrates the capability of FlowChronicle to achieve high-quality generation while significantly outperforming the other methods in preserving temporal dependencies between flows. Besides, in contrast to deep learning methods, the patterns identified by FlowChronicle are explainable, and experts can verify their soundness. Our work substantially advances synthetic network traffic generation, offering a method that enhances both the utility and trustworthiness of the generated network flows. Slides

L’IA pour la cybersécurité : focus sur l’IA générative (table ronde)

November 19, 2025

Talk, European Cyber Week, Rennes, France

Table ronde: “L’IA pour la cybersécurité : focus sur l’IA générative”

Towards more realistic honeypots with synthetic network traffic injection

November 04, 2025

Talk, DeceptIA meeting, Tokyo, Japan

Honeypots and honeynets need to be realistic to attract and convince attackers to reveal their techniques. Several work on realistic file systems, but realistic local network communication is still an open question. In this work, we propose to generate synthetic network traffic using generative machine learning techniques and inject it into the network. This presentation entails some recent, ongoing work on this subject.

Synthetic Network Traffic Generation for Intrusion Detection Systems: a Systematic Literature Review

October 30, 2025

Talk, PIRAT Seminar, Rennes, France

Network data can be difficult to collect due to privacy and confidentiality reasons. For these reasons, network datasets are typically created with controlled environments called testbeds. However, these datasets are regularly criticized for their limited size, class imbalance, obsolescence, and lack of actual user activity. Following the rapid development of generative artificial intelligence, new methods have been applied to synthetic network traffic generation without emulation or simulation. This systematic literature review assesses the current state of synthetic network traffic generation for intrusion detection systems.

Synthetic Network Traffic Generation for Intrusion Detection Systems: a Systematic Literature Review

September 26, 2025

Talk, ANUBIS Workshop, Toulouse, France

Network data can be difficult to collect due to privacy and confidentiality reasons. For these reasons, network datasets are typically created with controlled environments called testbeds. However, these datasets are regularly criticized for their limited size, class imbalance, obsolescence, and lack of actual user activity. Following the rapid development of generative artificial intelligence, new methods have been applied to synthetic network traffic generation without emulation or simulation. This systematic literature review assesses the current state of synthetic network traffic generation for intrusion detection systems. Slides

Background traffic generation with statistical AI

September 02, 2025

Talk, Joint Superviz-HiSec Workshop, Paris, France

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. This presentation entails some recent work on this subject.

AI for Cybersecurity: Three Applications for Network Security

July 01, 2025

Talk, Summer School "AI-driven Cyber security", Rennes, France

Researchers have experimented with many AI techniques for detecting network intrusions. This presentation details how curious results led us to a new research direction at the intersection of AI and cybersecurity. This talk will rely on three recent works: 1) how to apply AI to intrusion detection, 2) a new XAI (eXplainable AI) framework dedicated to anomaly detection and 3) a new method for assessing data quality and generating synthetic network data with AI. Slides

Towards more realistic honeypots with synthetic network traffic injection

June 19, 2025

Talk, DeceptIA kick-off meeting, Nancy, France

Honeypots and honeynets need to be realistic to attract and convince attackers to reveal their techniques. Several work on realistic file systems, but realistic local network communication is still an open question. In this work, we propose to generate synthetic network traffic using generative machine learning techniques and inject it into the network. This presentation entails some recent, ongoing work on this subject.

Robust malware detectors by design

May 22, 2025

Talk, IFIPSEC conference, Maribor, Slovenia

This work present the cooperation with CISPA on robust malware detectors. Malware analysis involves analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. We propose a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. Slides

Towards programming languages free of injection-based vulnerabilities by design

May 15, 2025

Talk, Eleventh LangSec Workshop at IEEE Security & Privacy, San Francisco, USA

Many systems are controlled via commands built upon user inputs. For systems that deal with structured commands, such as SQL queries, XML documents, or network messages, such commands are generally constructed in a “fill-in-the-blank” fashion: the user input is concatenated with a fixed part written by the developer (the template). However, the user input can be crafted to modify the command’s semantics intended by the developer and lead to the system’s malicious usages. Such an attack, called an injection-based attack, is considered one of the most severe threat to web applications. Solutions to prevent such vulnerabilities exist but are generally ad hoc and rely on the developer’s expertise and diligence. Our approach addresses these vulnerabilities from the formal language theory’s point of view. We formally define two new security properties. The first one, “intent-equivalence”, guarantees that a developer’s template cannot lead to malicious injections. The second one, “intent-security”, guarantees that every possible template is intent-equivalent, and therefore that the programming language itself is secure. From these definitions, we show that new design patterns can help create programming languages that are secure by design. Slides

TADAM: Learning Timed Automata from Noisy Observations

May 01, 2025

Talk, SIAM International Conference on Data Mining (SDM25), Alexandria Virginia, U.S.

Timed Automata (TA) are formal models capable of representing regular languages with timing constraints, making them well-suited for modeling systems where behavior is driven by events occurring over time. Most existing work on TA learning relies on active learning, where access to a teacher is assumed to answer membership queries and provide counterexamples. While this framework offers strong theoretical guarantees, it is impractical for many real-world applications where such a teacher is unavailable. In contrast, passive learning approaches aim to infer TA solely from sequences accepted by the target automaton. However, current methods struggle to handle noise in the data, such as symbol omissions, insertions, or permutations, often resulting in excessively large and inaccurate automata. In this paper, we introduce TADAM, a novel approach that leverages the Minimum Description Length (MDL) principle to balance model complexity and data fit, allowing it to distinguish between meaningful patterns and noise. We show that TADAM is significantly more robust to noisy data than existing techniques, less prone to overfitting, and produces concise models that can be manually audited. We further demonstrate its practical utility through experiments on real-world tasks, such as network flow classification and anomaly detection. Slides - Poster

Synthetic network traffic with Fos-R

April 16, 2025

Talk, PIRAT seminars, Rennes, France

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. This presentation entails some recent work on this subject.

Towards more realistic honeypots with synthetic network traffic injection

April 15, 2025

Talk, DefMal Webinar, Rennes, France

Honeypots and honeynets need to be realistic to attract and convince attackers to reveal their techniques. Several work on realistic file systems, but realistic local network communication is still an open question. In this work, we propose to generate synthetic network traffic using generative machine learning techniques and inject it into the network. This presentation entails some recent, ongoing work on this subject.

FlowChronicle: Synthetic Network Flow Generation through Pattern Set Mining

April 10, 2025

Talk, Toulouse Hacking Conference (THCon), Toulouse, France

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. Most methods generating synthetic network flows disregard the temporal dependencies between them, leading to unrealistic traffic. To address this issue, we introduce FlowChronicle, a novel synthetic network flow generation tool that relies on pattern mining and statistical models to preserve temporal dependencies. We empirically compare our method against state-of-the-art techniques on several criteria, namely realism, diversity, compliance, and novelty. This evaluation demonstrates the capability of FlowChronicle to achieve high-quality generation while significantly outperforming the other methods in preserving temporal dependencies between flows. Besides, in contrast to deep learning methods, the patterns identified by FlowChronicle are explainable, and experts can verify their soundness. Our work substantially advances synthetic network traffic generation, offering a method that enhances both the utility and trustworthiness of the generated network flows. Slides - Video

Ongoing work on synthetic network traffic generation for IDS evaluation

March 11, 2025

Talk, Superviz plenary meeting, Campus Cyber, Puteaux, France

Network traffic datasets are regularly criticized, notably for the lack of realism and diversity in their attack or benign traffic. Generating synthetic network traffic using generative machine learning techniques is a recent area of research that could complement experimental test beds and help assess the efficiency of network security tools such as network intrusion detection systems. This presentation entails some recent work on this subject. Slides

AI and pattern mining for synthetic security data generation

February 04, 2025

Talk, 2nd CISPA-Inria Workshop, Campus Cyber, Puteaux, France

This event has been cancelled.

Une introduction à la revue systématique de littérature

January 23, 2025

Talk, "Papers, please" seminar, Rennes, France

Cet exposé à destination des doctorants des équipes PIRAT et SUSHI est une introduction à la systematic literature review: son intérêt, sa méthode et ses limites.

Vision d’avenir pour la cybersécurité en Bretagne et en Europe : quels challenges et quelle place pour l’IA ? (table ronde)

November 20, 2024

Talk, European Cyber Week, Rennes, France

Table ronde: “Vision d’avenir pour la cybersécurité en Bretagne et en Europe : quels challenges et quelle place pour l’IA ?”

Keynote: can generative AI help us better assess security solutions?

November 08, 2024

Talk, Journées Informatiques en Région Centre 2024, Bourges, France

In this keynote, I present the work that lead us to work on synthetic data generation, and how we generate security data. This works aims at creating realistic and diverse network and system datasets to better evaluate security solutions, most notably supervision solutions. I will notably present two works on network flow records generation using pattern mining and on packet header generation using probabilistic timed automata generation. Slides

Generative AI for assessing network intrusion detection systems

November 06, 2024

Talk, Inria/UK-AISI Workshop, UK Ambassy, Paris

During this workshop with AISI, I present my recent work on network data generation using AI and I present new research questions on the role of LLMs in this kind of application. Slides

Towards programming languages free of injection-based vulnerabilities by design

October 24, 2024

Talk, Séminaire PIRAT, Rennes, France

Many systems are controlled via commands built upon user inputs. For systems that deal with structured commands, such as SQL queries, XML documents, or network messages, such commands are generally constructed in a “fill-in-the-blank” fashion: the user input is concatenated with a fixed part written by the developer (the template). However, the user input can be crafted to modify the command’s semantics intended by the developer and lead to the system’s malicious usages. Such an attack, called an injection-based attack, is considered one of the most severe threat to web applications. Solutions to prevent such vulnerabilities exist but are generally ad hoc and rely on the developer’s expertise and diligence. Our approach addresses these vulnerabilities from the formal language theory’s point of view. We formally define two new security properties. The first one, “intent-equivalence”, guarantees that a developer’s template cannot lead to malicious injections. The second one, “intent-security”, guarantees that every possible template is intent-equivalent, and therefore that the programming language itself is secure. From these definitions, we propose new techniques to create programming language that are secure by design, and present two secure, simplified versions of widespread languages.

FosR project: can generative AI help us better assess security solutions?

October 15, 2024

Talk, Inria Evaluation Seminar, Rungis, France

In the talk, I present the FosR project on security data generation, and specifically the results obtained in the SecGen associate team. Slides

Leveraging explainability to increase the usability of intrusion detection systems

October 14, 2024

Talk, Séminaire Sci-Rennes, Rennes, France

The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our study demonstrates that these explanations can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures. Slides

Learning Conditional Preference Networks: an Approach Based on the Minimum Description Length Principle

August 08, 2024

Talk, IJCAI 2024, the 33rd International Joint Conference on Artificial Intelligence, Jeju, Korea

CP-nets are a very expressive graphical model for representing preferences over combinatorial spaces. They are particularly well suited for settings where an important task is to compute the optimal completion of some partially specified alternative; this is, for instance, the case of interactive configurators, where preferences can be used at every step of the interaction to guide the decision maker towards a satisfactory configuration. Learning CP-nets is challenging when the input data has the form of pairwise comparisons between alternatives. Furthermore, this type of preference data is not commonly stored: it can be elicited but this puts an additional burden on the decision maker. In this article, we propose a new method for learning CP-nets from sales history, a kind of data readily available in many e-commerce applications. The approach is based on the minimum description length (MDL) principle. We show some theoretical properties of this learning task, namely its sample complexity and its NP-completeness, and we experiment with this learning algorithm in a recommendation setting with real sales history from a car maker. Slides - Poster

Attacks on machine learning: challenges and solutions

June 26, 2024

Talk, Séminaire des experts du Groupe Orange, Rennes, France

This event has been cancelled.

Robust malware detectors by design

June 03, 2024

Talk, DefMal annual workshop, Saint-Malo, France

This work present the cooperation with CISPA on robust malware detectors. Malware analysis involves analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. We propose a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. Our framework ERDALT is based on this structure. Slides

BAGUETTE: Hunting for Evidence of Malicious Behavior in Dynamic Analysis Reports

April 04, 2024

Talk, Toulouse Hacking Conference (THCon), Toulouse, France

Malware analysis consists of studying a sample of suspicious code to understand it and producing a representation or explanation of this code that can be used by a human expert or a clustering/classification/detection tool. The analysis can be static (only the code is studied) or dynamic (only the interaction between the code and its host during one or more executions is studied). The quality of the interpretation of a code and its later detection depends on the quality of the information contained in this representation. To date, many analyses produce voluminous reports that are difficult to handle quickly. In this article, we present BAGUETTE, a graph-based representation of the interactions of a sample and the resources offered by the host system during one execution. We explain how BAGUETTE helps automatically search for specific behaviors in a malware database and how it efficiently assists the expert in analyzing samples. We also develop a possible use case of BAGUETTE being currently researched: explainable unsupervised malware behavior clustering. Slides - Video

Modélisations des données de sécurité pour l’apprentissage automatique

March 11, 2024

Talk, Académie des Technologies, Rennes, France

Dans cette présentation, je vais faire un tour d’horizon des modélisations de données de sécurité et montrer l’importance qu’elles ont pour une utilisation avec des techniques d’apprentissage automatique. Slides

Three new challenges on network data generation

March 06, 2024

Talk, SecGen plenary meeting, Saarbrücken, Germany

In this presentation, I present three potential ideas to investigate within the SecGen project: 1) generation for concept drift evaluation, 2) causal learning and 3) system logs generation. Slides

Les IA, un démultiplicateur d’effets pour la cyberdéfense ? (table ronde)

January 31, 2024

Talk, Séminaire IA du Commandement de la cyberdéfense, Paris 7e, France

This round-table discussion is dedicated to the effect of AI on cybersecurity: its limitations, its opportunities and its risks. Video

Intelligence artificielle : d’où vient-elle, jusqu’où ira-t-elle ?

November 22, 2023

Talk, Séminaire GIP RENATER, Roscoff, France

Un exposé de vulgarisation d’une heure sur le thème de l’intelligence artificielle à l’occasion des 30 ans du GIP RENATER. Dans cette présentation, je reviens sur l’histoire de l’intelligence artificielle et je démystifie un peu son fonctionnement. Je présente ensuite les principaux domaines révolutionnés par l’intelligence artificielle et les nombreux risques qu’elle soulève. Slides

Network traffic generation, between machine learning and cybersecurity

November 06, 2023

Talk, CISPA-Inria Workshop, Campus Cyber, Puteaux, France

An overview of scientific questions about network traffic generation. Slides

Towards Understanding Alerts raised by Unsupervised Network Intrusion Detection Systems

October 19, 2023

Talk, FADEx Seminar, Rennes, France

The use of Machine Learning for anomaly detection in cyber security-critical applications, such as intrusion detection systems, has been hindered by the lack of explainability. Without understanding the reason behind anomaly alerts, it is too expensive or impossible for human analysts to verify and identify cyber-attacks. Our research addresses this challenge and focuses on unsupervised network intrusion detection, where only benign network traffic is available for training the detection model. We propose a novel post-hoc explanation method, called AE-pvalues, which is based on the p-values of the reconstruction errors produced by an Auto-Encoder-based anomaly detection method. Our work identifies the most informative network traffic features associated with an anomaly alert, providing interpretations for the generated alerts. We conduct an empirical study using a large-scale network intrusion dataset, CICIDS2017, to compare the proposed AE-pvalues method with two state-of-the-art baselines applied in the unsupervised anomaly detection task. Our experimental results show that the AE-pvalues method accurately identifies abnormal influential network traffic features. Furthermore, our study demonstrates that the explanation outputs can help identify different types of network attacks in the detected anomalies, enabling human security analysts to understand the root cause of the anomalies and take prompt action to strengthen security measures. Slides

Conditionally Acyclic CO-Networks for Efficient Preferential Optimization

September 30, 2023

Talk, 26th European Conference on Artificial Intelligence ECAI 2023, Kraków, Poland

This paper focuses on graphical models for modelling preferences in combinatorial space and their use for item optimisation. The preferential optimisation task seeks to find the preferred item containing some defined values, which is useful for many recommendation settings in e-commerce. We show that efficient (i.e., with polynomial time complexity) preferential optimisation is achieved with a subset of cyclic CP-nets called conditional acyclic CP-net. We also introduce a new graphical preference model, called Conditional-Optimality networks (CO-networks), that are more concise than conditional acyclic CP-nets and LP-trees but have the same expressiveness with respect to optimisation. Finally, we empirically show that preferential optimisation can be used for encoding alternatives into partial instantiations and vice versa, paving the way towards CO-nets and CP-nets unsupervised learning with the minimal description length (MDL) principle. Poster

Network traffic generation: a non-technical look at its characteristics and stakes

June 21, 2023

Talk, SecGen kick-off meeting, Campus Cyber, Puteaux, France

Intrusion detection is an essential mechanism in information systems security. Machine learning has been successfully applied to this problem. These techniques rely on training data used to train a detection model. This training data generally comes from datasets that are often more or less automatically generated. Worse, the number of datasets remains small enough that the diversity of the dataset is questionable, and its aging is problematic. A solution to these problems is synthetic data generation: it would be free of experimental inaccuracies, could be easily updated, and alleviate the class imbalance by generating more data on rare classes. We plan to generate benign data only, as attacks are easier to generate with dedicated tools. This talk highlights the characteristics of network traffic generation from a non-technical point of view, adapted to data mining practitioners, as well as the issues and opportunities. Slides

Etat de l’art de la recherche en Cyber & IA (table ronde)

June 20, 2023

Talk, La Cyber au rendez-vous de l’IA de confiance, Campus Cyber, Puteaux, France

In this round-table discussion, I present the current stakes of network supervision, how machine learning can answer them, and what challenges it brings.

Certifiably robust malware detectors by design

June 01, 2023

Talk, CIDRE seminar, Ploërmel, France

Malware analysis consists in analyzing suspicious software to detect malicious payload. Static malware analysis, which does not require software execution, relies increasingly on machine learning techniques to achieve scalability. Although such techniques obtain very high detection accuracy, they can be easily evaded with adversarial examples where a few modifications of the sample can dupe the detector without modifying the behavior of the software. Unlike other domains, such as computer vision, creating an adversarial example for malware without altering its functionality requires specific transformations. This article proposes a taxonomy of the transformations an attacker can use depending on the threat models that modelize their capability. We show the effectiveness of this taxonomy by proposing a new set of features and model architecture that can lead to certifiably robust malware detection by design. In addition, we show that every robust detector can be decomposed into a specific structure, which can be applied to learn empirically robust malware detectors, even on fragile features. We compare and validate these approaches with various machine-learning-based malware detection methods, allowing for robust detection with minimal detection performance reduction.

A theory of injection-based vulnerabilities in formal grammars

March 29, 2023

Talk, 2023 Annual Meeting of the WG "Formal Methods for Security", Roscoff, France

Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. Slides

The complexity of unsupervised learning of lexicographic preferences

March 17, 2023

Talk, Séminaire ANITI, IRIT, Toulouse, France

This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides

Anomaly detection and explanation in networks with machine learning

March 02, 2023

Talk, NICT seminar, Campus Cyber, Puteaux, France

This talk presents recent work on anomaly detection in network data and anomaly explanation. Our approach represents the network data with a security objects graph analyzed by an autoencoder. We introduce a new statistical explanation technique for reconstruction-based methods and compare it with SHAP. Finally, we use these explanations to analyze the dataset CICIDS2017 and check whether they match the expert’s expectations. Slides

Introduction aux langages réguliers, aux automates à états finis et aux expressions régulières, et leurs applications en machine learning

December 15, 2022

Talk, "Papers, please" seminar, Rennes, France

Une vue synthétique des différentes représentations de langages réguliers (expression régulière, grammaire régulière, automate à états finis) et une potentielle application à la génération de données réseaux. Slides

Behavioral intrusion detection system based on machine learning

September 20, 2022

Talk, Supsec 3rd workshop: AI for supervision, Rennes, France

In this talk, I present the sec2graph approach, its performances, and its explanation mechanism. This mechanism helped us identify several flaws we identified in the labelling of the CICIDS2017 dataset and in the traffic capture, such as packet misorder, packet duplication and attack that were performed but not correctly labelled. Slides

The complexity of unsupervised learning of lexicographic preferences

July 23, 2022

Talk, M-PREF Workshop of IJCAI, Vienna, Austria

This work considers the task of learning users’ preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users’ preferences that ranks previously chosen alternatives as high as possible; and an algorithm to learn, in this setting, a particular model of preferences: lexicographic preferences trees (LP-trees). In this paper, we study complexity-theoretical problems related to this approach. We give an upper bound on the sample complexity of learning an LP-tree, which is logarithmic in the number of attributes. We also prove that computing the LP tree that minimises the empirical risk can be done in polynomial time when restricted to the class of linear LP-trees. Slides

Some work of starting PhDs in CIDRE on AI and cybersecurity

June 09, 2022

Talk, GT stats seminar, Rennes, France

This presentation is an overview on four PhDs recently started in the CIDRE team, on intrusion detection, data generation, malware analysis and botnet detection. Slides

Interactive configuration and recommendation in presence of constraints

May 20, 2022

Talk, Séminaire ANITI, IRIT, Toulouse, France

We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides

Une introduction aux méthodes d’IA explicables

March 29, 2022

Talk, "Papers, please" seminar, Rennes, France

Avec l’utilisation toujours croissantes des techniques d’IA, le besoin de vérification des prédictions se fait de plus en plus pressant. C’est l’objectif des méthodes d’explicabilité : permettre à un utilisateur de savoir pourquoi une décision a été prise par le système. Cette introduction fait un point sur les nombreuses familles de techniques disponibles. Slides

Machine learning et sécurité : entre menaces et opportunités

November 29, 2021

Talk, MeetUp LumenAI, Rennes, France

Cette présentation rappelle les concepts fondamentaux en sécurité et présente les applications des techniques de machine learning aux multiples problématiques liées à la sécurité. Le dernier axe abordé est celui de la sécurité des méthodes de machine learning, qui sont directement attaquées par de nombreuses techniques aux objectifs et aux moyens multiples. Séminaire présenté avec Ludovic Mé. Slides

La sécurité informatique à l’ère de l’intelligence artificielle

October 13, 2021

Talk, Séminaires du département informatique, Rennes, France

Avec la numérisation de nos vies, les systèmes informatiques ont pris une place prépondérante dans notre société et sont naturellement devenus la cible d’attaquants, du “script kiddie” au groupe organisé à l’objectif politique. Depuis quelques années, l’intelligence artificielle s’est invitée à la fête : qu’elle permette de détecter automatiquement des attaques ou qu’elle soit subrepticement manipulée dans les voitures autonomes, elle amène son lot de problèmes et de solutions. Ce séminaire a pour objectif de présenter ces deux domaines, leurs enjeux et leurs interactions, et de mettre en avant les pistes de recherche que la communauté scientifique privilégie pour lutter contre ces menaces. Slides

Machine Learning 101

September 24, 2021

Talk, Hands-on Machine Learning for Security seminar, Rennes, France

Machine learning is applied successfully in various domains, including cybersecurity, where it has been used for intrusion detection, malware analysis, and attack comprehension, for example. Therefore, many cybersecurity researchers seek to catch up and introduce such techniques in their research. This presentation aims to provide the basics of machine learning for cybersecurity researchers. The presentation is available on Youtube. Slides

A formal study of injection-based vulnerabilities and some tools it will enable

February 19, 2021

Talk, SoSySec seminars at IRISA, Rennes, France

Many systems work by receiving instructions and processing them: e.g., a browser receives and then displays an HTML page and executes Javascript scripts, a database receives a query and then applies it to its data, an embedded system controlled through a protocol receives and then processes a message. When such instructions depend on user input, one generally constructs them with concatenation or insertion. It can lead to injection-based attacks: when the user input modifies the query’s intended semantics and leads to a security breach. Protections do exist but are not sufficient as they never tackle the origin of the problem: the language itself. We propose a new formal approach based on formal languages to assess risk, enhance static analysis, and enable new tools. This approach is general and can be applied to query, programming, and domain-specific languages as well as network protocols. We are setting up an ANR project to go into this subject in more depth. The presentation, in French, is available on Youtube. Slides

Un système agnostique de détection d’intrusion radio pour protéger l’Internet des objets

January 22, 2020

Talk, Nouvelles Avancées en Sécurité des Systèmes d'Information, INSA-Toulouse, Toulouse, France

L’expansion de l’Internet des objets (IoT) entraîne l’apparition de maisons intelligentes, d’usines intelligentes et même de villes intelligentes. Bien que ces objets améliorent la qualité de vie de ses utilisateurs et offrent de nouvelles opportunités économiques, ils sont aussi un important vecteur d’attaques (le botnet Mirai étant sûrement l’exemple le plus connu). Pour protéger ces environnements, des systèmes de détection d’intrusion (IDS) sont développés. Ces IDS rencontrent des problématiques uniques à l’IoT, telles que l’évolution rapide des technologies et des protocoles ou encore leur réseau décentralisé. Pour surmonter ces problèmes, nous proposons un IDS qui surveille de larges bandes de fréquences au niveau de la couche physique sans faire d’hypothèses sur les protocoles ou les technologies présentes. De plus, notre solution propose pour chaque attaque détectée un diagnostic triple : temporel (les dates exactes de l’anomalie détectée), fréquentiel (la fréquence principale de l’anomalie) et spatial (la position estimée de l’origine de l’anomalie). Nous avons expérimenté notre méthode avec une expérimentation grandeur nature: notre système a pu efficacement détecter et diagnostiquer les attaques lancées sur les bandes 400-500 MHz et 800-900 MHz, deux bandes qui ne sont pas couvertes par les solutions traditionnelles. Slides

Interactive configuration with constraints consistency and recommendation

June 12, 2018

Talk, Panorama des recherches dans le domaine automobile, LAAS-IRIT-Laplace, Toulouse, France

We present our work on the recommendation of values in interactive configuration, with no prior knowledge about the user, but given a list of products previously configured and bought by other users (“sale histories”). The basic idea is to recommend, for a given variable at a given step of the configuration process, a value that has been chosen by other users in a similar context, where the context is defined by the variables that have already been decided, and the values that the current user has chosen for these variables. This presentation details how we handle constraints about the configuration and highlights some experimental results. Slides

Learning Lexicographic Preference Trees from Positive Examples

February 07, 2018

Talk, AAAI’18 Technical Track, New Orleans, USA

We consider the task of learning the preferences of users on a combinatorial set of alternatives, as it can be the case for example with online configurators. In many settings, what is available to the learner is a set of positive examples of alternatives that have been selected during past interactions. We propose to learn a model of the users’ preferences that ranks previously chosen alternatives as high as possible. Here, we study the particular task of learning conditional lexicographic preferences. We present an algorithm to learn several classes of lexicographic preference trees, prove convergence properties of the algorithm, and experiment on both synthetic data and on a real-world bench in the domain of recommendation in interactive configuration. Slides

Pierre-François Gimenez

Talks and presentations