School / Prep
ENSEIRB-MATMECA
Internal code
EI9IS329
Description
The aim of this course is twofold, and will be structured around two projects.
The first project aims to present and implement some techniques for extracting information from textual data.
We will first see how proven algorithms such as the Bag-of-Word model and TF-IDF can be used to extract relevant data from documents.
We will then look at vector embedding methods, studying the Word2Vec model, which can be used to extract contextual data.
Finally, we will see how this information can be used to identify semantically related texts or categorize them using clustering algorithms.
The second project will address a similar problem, but in the context of visual data.
The lectures will be accompanied by TDs/TPs enabling the effective implementation of the algorithms presented above.
Two projects, one on textual data and the other on visual data, based on real data, will enable students to apply the algorithms seen in lectures, while applying their skills in distributed computing to process the volumetry of the dataset in a reasonable time.
Teaching hours
- TDTutorial21,33h
Mandatory prerequisites
Notions of Python, algorithms and linear algebra
Further information
The analysis and processing of natural language (NLP) is one of today's major challenges in Artificial Intelligence. Advances in this field are used on a daily basis in search engines, chatbots and mailboxes (spam detection, advertising targeting, etc.).
On the other hand, processing large numbers of images to extract information is another of today's major challenges. Automatic detection of people, signs, object recognition... the fields are many and varied.
Assessment of knowledge
Initial assessment / Main session - Tests
Type of assessment | Type of test | Duration (in minutes) | Number of tests | Test coefficient | Eliminatory mark in the test | Remarks |
---|---|---|---|---|---|---|
Project | Continuous control | 1 |