Here are projects that I am or was at some time interested to work on but that I don’t have time to tackle. If you are interested in any of them, please contact me and I might be able to guide you.
Projects are sorted by decreasing practicality, eg the projects at the end are more research-oriented and could give no practical results.
Implement the Burrows–Wheeler transform for any data type in pydivsufsort
pydivsufsort provides quality bindings to the divsufsort algorithm for suffix arrays, as well as other useful string algorithms.
Currently, the Burrows–Wheeler transform is implemented natively by divsufsort. However it does not work on types longer than a byte (see issue). It would be nice to implement it in Cython, along with the inverse Burrows–Wheeler transform.
Reference: https://www.cs.waikato.ac.nz/~tcs/COMP317/bwt.html
MIDI indexing using suffix arrays
tags: music, suffix arrays, web
How cool would it be to be able to search for a song by its melody? This is the idea behind the project.
The project is described on https://github.com/louisabraham/what-the-midi
I implemented part of it but I am not sure I will have time to finish it any time soon.
The idea is to use suffix arrays to index MIDI files. A variant could be to store intervals instead of notes to be invariant to transposition.
There are huge databases like this one.
Currently the repository contains a proof-of-concept for KMP search and a suffix array database.
The goal is to make a web app out of it and to make it more performant.
AI code review
tags: machine learning, NLP, language models
This post showed that some publicly available language models are able to detect bugs in code.
The project would be to build a tool similar to the one shown in the post that could integrate with an IDE like VSCode and suggest places where there might be bugs.
LassoNet and VAE/GAN
tags: machine learning, neural networks, feature selection
I co-authored this paper about feature selection for neural networks.
The problem of feature selection is that it is meaningful only if the features are interpretable.
For example, if the input of the neural network is composed of hand-crafted features, then the features are interpretable. However, if the input is a raw image, then the features are not really interpretable. All it could tell is that some regions of the image are always irrelevant but not what pattern is useful.
I’m thinking about a way to combine feature selection with latent space representation. The issue is that the selected features would still not necessarily be interpretable but there seems to be solutions like https://arxiv.org/abs/2002.03754.
A first experiment would be “offline”:
- train an interpretable latent space representation
- run LassoNet on this space
LassoNet would thus produce a classifier that is both performant and with a concise explanation.
An obvious comparison would be interpretability techniques (eg SHAP) providing local explanations. LassoNet would have the advantage of stability and reliability thanks to the sparseness.
If this POC works, other experiments would train LassoNet AND the latent space together to get even better results.
Detecting when a piano is out of tune
tags: music, machine learning
When practicing every day on the same piano, it is easy to get used to the fact that the piano is out of tune.
Thus, it would be useful to build some kind of tool that would detect when the piano is out of tune.
Two variants can be considered:
- an “offline” tool that would require the user to play some scales
- an “online” tool that would detect when the piano is out of tune while the user is playing anything
I made some bibliographical notes about techniques that would make this possible. Challenges are the inharmonicity of piano strings and the fact that people play multiple notes at the same time, making it hard to detect the pitch of a single note in the “online” variant.
Citation graph visualization
tags: graph, visualization, web
When reading about a new topic, it would be nice to be able to visualize the citation graph of the papers that are relevant to the topic.
Then, it would be really helpful to be able to extract the references that are made to each reference.
Thus, one would be able to get much faster how each paper is related to the others.
Google Scholar allows to very easily get the papers that cite a given paper. However, it is not possible to get the papers that are cited by a given paper.
This is very sad, and I’m not the only one to think so.
Resources:
- scholarly seems useful to scrape Google Scholar
- it looks like (thanks Clémentine for pointing this out) Semantic Scholar has an API that allows to get the references of a paper. While it is not open-source, not totally accurate and is missing some papers, it is a good start.
- pdfminer could be useful to extract the references from a PDF
Old projects
Practical zero knowledge proofs over Ethereum
tags: cryptography, blockchain, zkSNARKs
The paper Cao2021 uses zero knowledge proofs to implement a DRM system. However they do not describe any implementation.
There are multiple implementation of zkSNARKs on Ethereum, like ZoKrates.
The project would be to implement the system described in the paper using ZoKrates and to evaluate the performance of the system, eg relative to gas consumption.
Reference:
- Cao, Z. and Zhao, L., 2021, March. A Design of Key Distribution Mechanism in Decentralized Digital Rights Management Based on Blockchain and Zero-Knowledge Proof. In 2021 The 3rd International Conference on Blockchain Technology (pp. 53-59).
Update: I actually did this one here.