Records can be inconsistent between various databases through typos, forgotten information, and misspellings. This algorithm could merge many data sets into one while avoiding redundancies of information.
What it does
Groups the records which it thinks is the same person together.
How we built it
Used Levenshtein distance and Double Monophone to see how close two records were. We dealt with various edge cases such as switching, potential misspellings, abbreviations, shortened versions of names/places, and nicknames. After completing the main algorithm, we judged ideal weights by how likely someone is to accidentally input wrong data. We tested with various confidence thresholds to determine the ideal number of groups for the given test data set.
Challenges we ran into
Dealing with unnormalized data and the numerous edge cases of how someone could enter a record incorrectly. How much weight for each columns influence on the overall confidence score.
Accomplishments that we're proud of
- Sorting and confidence algorithm
- Implementing different ways of determining if one record is the same as another and how tolerant to be
What We learned
- How to use databases such as SQL and its integration into python
- Algorithm design
What's next for Patient Match
- Import using a SQL database rather than using a .csv file
- Using statistics and math to determine proper weights rather than trial and error
Try It out