This was Redis #beyongcache #hackathon submission:

The whole pipeline is open-sourced in https://github.com/AlexMikhalev/cord19redisknowledgegraph

Vision

Mission

To build a natural language processing pipeline, capable of handling a large number of documents and concepts, incorporating System 1 AI (fast, intuitive reasoning) and System 2 (high-level reasoning) and then present knowledge in a modern VR/AR visualisation. Knowledge should be re-usable and shareable.

Goal

Build NLP pipeline leveraging Redis ecosystem whenever possible. We use COVID 19 (CORD19) medical articles corpus as input and experience of participation in the Kaggle CORD19 concept. My major challenge of CORD19 competition was running out of memory/storage and loosing processed steps, building and re-loading snapshots while leveraging modern frameworks like spacy/PyTorch. Overall I missed about a week in the competition by re-processing the data. Hence this implementation designed to avoid running out of memory, taking baby steps initially.

Implementation

Conclusion and lessons learned

We took OCR scans in JSON format and turned them into Knowledge Graph, demonstrating how you can apply modern techniques like BERT tokenisation and more traditional Semantic Network/OWL/Methathesaurus technique based on Unified Medical Language System. Redis Ecosystem offers a lot to the data science community, and I hope it will take its place at the core of Kaggle notebooks, ML frameworks and make them more deployable. The success of our industry depends on how our tools work together — regardless of whether they are engineering, data science, machine learning and organisational or architectural.

PS. Building a distributed system, even as simple one as this one with kids around is “difficult/funny”. Special thank you to my patient wife, Karine.

Further steps


Written on June 11, 2020 by Alex Mikhalev.

Originally published on Medium