Thanks to our daily data collection and test driving work, Zenseact has a huge amount of data stored in our data centers. How to present the data in a user friendly way and how to manage it is a challenge. As of today we typically extract metadata upfront at the time of ingestion by analyzing the data being uploaded. We would like to explore how we could extract metadata from and index data already present in our storage systems by designing a data crawler. The crawler would need to have a basic understanding of our file system layout and the file formats we use and their internal structure. This way, if we want to extract new information from our dataset we can add new analysis logic to the crawler, and eventually, the data would be available once the crawling has been completed.
Creating a high-performance file searching algorithm should be explored based on the data system crawler work. Ideally, this algorithm should return detailed results with a very low response time. The algorithm could start by using the index maintained by the file system crawler, but then continue to actually inspect file contents in an efficient way. It could also mean some type of keywords based search, this part could integrate tags made by teams in the company or use machine learning methods to extract relevant keywords.
A data center user interface could be the final outcome that combines the functionality of searching for and analyzing data as well as monitoring the file crawling process. Potentially the data center could be integrated into Zenseact's data platform down the line alongside our other data access user interfaces.
Depending on the interest of the candidates we are open to focusing on different parts of the above ideas, or alternative approaches within this problem domain.
In this master thesis project, you will focus on:
- Building a data crawler to extract and index metadata from files stored in parallel file systems.
- Designing a high-performance searching system with low response time based on the index.
- Construct a data center that can be used to navigate the index and execute the data searching algorithm.
We are looking for 2 students, preferably with good knowledge of
- Background in Computer Science, Engineering, or related field.
- C++ and/or Python programming
- Distributed systems
- Databases and querying engines
- Data visualization
Good to have:
- Machine Learning knowledge
- User interface designing
Please send in individual applications with CV, motivational letter, and grade transcripts.
Planned start: January 2022, with some flexibility.
Final application date: 20 of November 2021, but we will screen candidates continuously, so please submit your application as soon as possible.
Duration: 30 ECTS
For questions regarding the project, please contact firstname.lastname@example.org or email@example.com.