## What it does

Our project is a solution to OpenAlly's first Challenge: Patient Matching

## How I built it

• There is a is_same_person() function which takes in 2 rows and a logging file pointer, and returns a score
• The lower the score is, the more likely that the two rows are describing the same person
• The exact threshold where it decides that the two rows are not equal comes from testing
• most mistakes in data fields (typos) can be measured using the Damerau-Levenshtein Distance algorithm.
• Damerau-Levenshtein Distance is an algorithm which measures string similarity as the minimum number of character substitutions, insertions, removals and adjacent character swaps to transform one string to another.
• I use this algorithm for many of the columns to determine if they are equivalent
• if the distance is below some threshold for each column, then consider the values for that column equivalent
• for last and first names, the score modifier scales with HOW different the two strings are
• for example, "Austin" vs. "Marco" would return a high number, as the names are very different
• if the names are very different, then it becomes increasingly likely that they are different people
• for more unpredictable, non-name fields, scores are static
• because the data had so many inconsistensies, we had to write a lot of logic for each column to normalize its contents before looking at the similarity
• for example, some of the States entries were postal codes and others were expanded names
• to solve this, I wrote different cases based on the combination of rows
• if only one row is abbreviated, abbreviate the other
• get the distance between the two; if they are the same, subtract 5 from the score
• if they are only 1 apart, then add 5 to the score
• if they are 2 apart (entirely different), add 15 to the score
• if both were abbreviated, use the same logic as before, except add 7.5 if they are 1 apart
• there is no conversion happening, so it is more likely that they are actually different
• if neither is abbreviated, get the distance as is; if the distance > 2, add 15 to the score
• otherwise, subtract 5
• another clever solution is in handle_names(), the function that takes care of first and last names
• in the data, there are instances where a patient's name is in its short form in one row and in its long form in another.
• while this cannot be known with certainty, it must be accounted for to prevent false positives
• the solution was to see if the difference in length between the two names was more than 2
• if so, then check if one of the names starts with the other name
• if it does, then subtract from the score
• otherwise, if the two names appear to be unique, return a positive score proportionate to   how different they are
• there are more things in the code that I found interesting, but I will leave them out so that this does not become an essay
• I made a conscious effort to comment all of my code, so I hope that it is readable

## Grouping

• I thought the way that I got grouping to work pretty interesting
• a set of indices is maintained to keep track of which rows have unique groups
• each row is given a pointer to a list of sibling rows
• when comparing 2 rows, if the rows are deemed equal, then they are put in the same group
• if neither has a group at that point, then one of the rows makes a new group containing both rows and adds the index of one of the rows to the set
• the other row's group points to this list
• if one of them has a group but the other does not, then it adds the groupless row to the group and sets its pointer to point to that group
• if both of them have groups, then the smaller group is appended onto the larger group
• the smaller group is then removed from the set
• the end result is that all of the rows that are equivalent are put in their own grouping

## Challenges I ran into

• figuring out how best to group the patients
• I tested the algorithm only by comparing adjacent rows
• I did not put a focus on the end goal; I just wanted to get accuracy
• I realized that the runtime complexity was quadratic and that it was going to be tested against 100s of thousands of records
• if I had realized this sooner, I maybe could have changed my logic to incorporate some sort of hashing
• this would mean being able to cut down on the main bottleneck: the is_same_person() function
• hashing would mean less computation in the long run
• realizing that using numpy arrays were not going to instantly speed it up
• however, when the dtype of the array is an object, there is a lot of overhead because of the pointer dereferencing that has to happen to get the object
• arrays are much more beneficial when using primitives

## What I learned

• I learned a lot about NLP in doing this project
• Learning about the different string similarity algorithms was really interesting for me
• I learned about the importance of randomness in algorithmic design
• I already knew that many quicksort implementations randomize the order of the data before processing, but this project made me realize how beneficial that is
• With the default order, running the `test()` function, I was maxing out at 95% accuracy
• However, when randomizing the order this number approached 100%, even hitting it at one point
• this was happening because a lot of the changes between different patients were "gradual" in the dataset
• it was as if a gradient was applied, smoothing out sharp changes between rows
• I was able to figure it out with 95% accuracy with the gradient, but randomizing the order of the rows to put distinct rows next to each other bumped up the performance even more
• While this may be a case of overfitting the weights to the data, there were very few sacrifices for accuracy that had to be made to account for edge cases
• Therefore, I believe that if the algorithm did NOT run in quadratic time, it would be very useful in industrial applications

## Technologies

natural-language-processing, nltk, numpy, python

255802