Data science is a highly complex field that sits at the meeting point of a number of applied science disciplines. In one sense, it’s a branch of statistics, but it has gone so far beyond the bounds of traditional statistics that it’s become a discipline in its own right. College curricula for data science majors include classes in statistics as well as math, as it provides the foundation for all statistical calculations. In addition, you’re going to need to know how to program, work with databases, and know a thing or two about machine learning. It’s a lot to learn, and you’re usually not considered to have covered it in depth until you’ve done some graduate work specifically in the field of data science.
Mathematics
First and foremost, the elephant in the room. There’s some debate as to whether you need an extensive math background for data science. Most professional education courses and bootcamps offer data science with a minimum of pure mathematics, but college curricula usually don’t stint when it comes to a math requirement. Indeed, you’ll be expected to take calculus (I and II) and linear algebra, on the basis that statistics is a branch of mathematics, and you should have the mathematical underpinnings to perform the complex calculations that come with advanced statistics.
Does a data scientist regularly resort to multiplying matrices and adding vectors, or does the computer do that heavy lifting nowadays? The same question can be asked for calculus, and the answer given by the people who make up degree requirements is yes. The better you understand calculus, the better you’ll be able to understand how statistics work. In other words, knowing a lot of math is not going to hurt you in your effort to become a data science professional.
Statistics
Whereas knowing how to add vectors or multiply matrices may not be essential to data science, one branch of math is unquestionably necessary in any data scientist’s toolkit: statistics. In some ways, data science is nothing but statistics, or, if you prefer, it’s rather like statistics on steroids. As a result, you’re going to have to know as much as you can about statistics and statistical modeling to make it in the field. That means knowing a great deal more than how to isolate a median or calculate a standard deviation (although you’ll need to start from there.) You’ll also need to know about probability theory, probability distributions, dimensionality reduction, hypothesis testing, and regression, and have a handle on the principles of Bayesian statistics, a paradigm in which established beliefs are updated when new data enter the picture. (There was a Thomas Bayes back in the 18th century who is at the root of this type of statistical analysis.)
Statistics is a rich and complex field in and of itself, and many future data scientists take their bachelor’s degrees in statistics, and specialize in data science when it comes time to select a master’s degree program. There’s no getting around learning statistics for data science (if you really have an aversion to the field, you’re not going to like data science any better), and, the more skilled you are at statistics, the more capable a data scientist you’ll be.
Programming
If there is something that separates data scientists from statisticians, it’s the fact that the latter are expected to be programmers as well as mathematicians. In four years of college, you will learn multiple programming languages. That said, the programming language preferred by most data scientists is Python, and you’ll have to learn to program in the language in order to work in the data science field. As computer languages go, Python is relatively easy to learn: most of its commands are ordinary English words, and its syntax is designed to be streamlined and readily understandable to speakers of natural (human) languages. When Guido van Rossum thought up the language over one Christmas vacation when he had nothing else to do, that kind of intelligibility was one of his goals, and the result was a multi-purpose language that remains incredibly popular (indeed, has only gained in popularity) thirty years after its first release.
There is an alternative to Python when it comes to finding a programming language that can perform all manner of complex calculations while standing on one foot and whistling while eating peanut butter crackers: R. Also in its fourth decade of existence, R was designed specifically for statistical computing and data visualization, and has many adherents in the world of data science. The primacy of Python, however, is the result of the latter language’s simpler syntax (R’s syntax is on the idiosyncratic side), and the fact that it is very frequently taught as a first programming language, even to children.
Both Python and R share robust sets of ready-to-use packages of code and documentation that can be used to increase the functionality of the language. Among the Python packages and libraries available that are used on a daily basis by data scientists is a triptych of technologies: NumPy, pandas, and Matplotlib. The first is a library of code that can be plugged into Python in order to increase its abilities as a giant supercalculator able to do everything you can do to matrices and arrays without getting arrested.
For its part, pandas (spelled with a lowercase p for reasons unbeknownst to all save the library’s creators) is the library made with data science operations in mind. It’s not cute like those big black-and-white raccoons that eat bamboo and live in China, but pandas the library was designed to help data scientists with vast datasets. Pandas is something of a Swiss army knife in that it can perform a multitude of data science operations, including data alignment, reshaping, pivoting, merging and joining of datasets, and hierarchical axis indexing (most of which are means of bringing big data down to size.)
Finally, Matplotlib isn’t a library for dyslexic pilots. It assists in the creation of data visualizations, the tables, charts, and graphs that summarize your findings and make them accessible to the people for whom you’re performing your analytical prestidigitation. Those visualizations can be static, animated, or even interactive. In addition to Matplotlib, which is the visualization library closest to Python, you will probably also learn to use business intelligence (BI) software such as Tableau, which adds its own wrinkles to the visualization process, and is handier with larger datasets.
SQL/NoSQL
Data today come in two flavors: structured and unstructured. Structured data is roughly like a bunch of file cards that have fields neatly filled out for, at the most anodyne, names, addresses, and phone numbers. There can be as many fields as the person creating the database wants, but there’s always that element of needing a round peg for a round hole. To get something out of a structured dataset, you need a language that will allow you to query the database and bring back information. This is most frequently done using the Structured Query Language, SQL, which can add, delete, retrieve, and manipulate data within the confines of a structured database. While it’s not without its quirks, you can learn it far more rapidly than any of the other elements on this list.
That’s the good news. The bad news is that data are defined ever more broadly these days, and that some estimates put 80% of today’s data into the equivalent of a giant cardboard box filled with file folders that contain heaps of information grouped higgledy-piggledy in what are called unstructured or NoSQL (Not Only SQL) databases. These can be enormous amounts of all sorts of data that you might not even consider data, like images, social media posts, audio files, and even medical information.
The current trend is very much in favor of NoSQL databases, which require their own set of tools, the most popular of which is probably MongoDB, which comes with its own query language, MQL (Mongo DB Query Language.) Although Mongo is usually associated with the JavaScript language as opposed to Python, there’s a native Python driver for Mongo (PyMongo) that will let you take advantage of Mongo while talking to your computer in Python. You’re going eventually to have to face up to learning how to navigate your way through the muddy, churning waters of an unstructured database sooner or later; you might as well turn your attention to it as soon as you’ve gotten a handle on SQL.
Machine Learning
AI (artificial intelligence) is gradually taking over the world. One of its more benign applications is in the data science field, where, under the name of machine learning, it can be used to extract conclusions from gigantic piles of data at a rate far faster than an entire human lifetime spent squeezing meaning from big datasets. It’s a bit of an oversimplification, but machine learning steps in once the data scientist has processed, cleaned, wrangled, or otherwise made manageable oversized unstructured datasets, such as information gleaned from that giant data cesspool, social media.
How do you go about learning machine learning? You’ll require all the skills outlined above, along with some new ones that are specifically designed for machine learning. For Python users, you’ll need to know the Scikit-learn library (no, it’s not about teaching a dog to attack something), which helps out with every step of your workflow and makes it possible to plug the most-used machine learning algorithms into your Python code. (For those using R, there is a roughly equivalent library called Caret, as in the character you get when you hit shift + 6, not as in what Bugs Bunny eats.) Other machine learning libraries that help Python adapt itself to machine learning are TensorFlow, Theano and, where natural language processing (NLP) is involved, PyTorch. Machine learning is a dense field that is developing at an alarming rate, so you can’t possibly learn everything there is to know about it, let alone in the space of a few months or even years. Still, you’ll need to have a solid understanding of the subject and the algorithms and technologies it employs to get data to give up their secrets and guide business decisions.
Yes, that’s a great deal to have to learn, but it doesn’t all have to be learned in one sitting. And, in any event, working in data science will compel you to keep in touch with the newest emerging trends and technologies. Thus, the learning process is never really complete. Moreover, some of it can be done while you’re on the job, where everyone is going to be learning and innovating along with you.
Learn Data Science with Noble Desktop
If you’ve settled on learning what you need to know to start a career in data science, Noble Desktop offers a thorough Data Science Certificate program. Over the course of roughly five weeks, the Python-centered curriculum will teach you the language’s fundamentals, followed by its application to data science, data visualization, and machine learning. You’ll also learn to use SQL. A part-time option is available for those who are unable to devote 40 hours per week to classwork. The course is taught by expert instructors to small classes, so that you can receive your fair share of the teacher’s attention. Your Noble Desktop tuition includes a free retake option, access to recordings of your classroom sessions, Noble’s state-of-the-art workbooks and learning materials, and half a dozen 1-on-1 sessions with a dedicated mentor who can assist you every step of the way as the program unfolds.
How to Learn Data Science
Master data science with hands-on training. Data science is a field that focuses on creating and improving tools to clean and analyze large amounts of raw data.
- Data Science Certificate at Noble Desktop: live, instructor-led course available in NYC or live online
- Find Data Science Classes Near You: Search & compare dozens of available courses in-person
- Attend a data science class live online (remote/virtual training) from anywhere
- Find & compare the best online data science classes (on-demand) from the top providers and platforms
- Train your staff with corporate and onsite data science training