Setup your environment
As a Java developer, I tend to pick up an new subject using a language that I am familiar with. So, I have one less thing to worry about. However, for machine learning, I suggest we get out of our comfort zone and learn Python. Here are some reasons:
- Python is popular language in data science due to the strength of its core libraries (NumPy, SciPy, pandas, matplotlib, IPython)
- Python is one of the most used language in Google. The other one is Java.
- TensorFlow is using Python and we will use it to scale our machine learning pipeline later.
- I know I am going to get into deep learning and there are great support of frameworks and libraries in this space. To name you a few here, there are Genism, Tensorflow, Keras, Caffe, nolearn and more. Check this link to grab the list.
The major issues I came across when I code Python are dependency management and version compatibility. Python 3 is not fully backward compatible to Python 2. For most data science tasks, you may need quite a few of 3rd party dependencies and the last thing you want is to sort thru is to get all these different library versions fit together. To lessen your headache, I recommend that you install Anaconda that bundles with popular data science libraries plus it allows you to create your own virtual environments so python 2 and 3 libraries will not step into each other. You can follow this video to set it up. On top of that, it also comes with Jupyter Notebook (ie. IPython Notebook) that gives you an interactive environment for coding Python. You can get familiar with it from this video
# check conda version
$ conda -V
# update conda
$ conda update conda
# check what python version is available.
$ conda search "^python$"
# create an environment in python 2.7.12
$ conda create --name abc python=2.7.12 [you need one package here]
# There is no need to install Anaconda again. Conda, the package manager for Anaconda, fully supports separated environments.
# The easiest way to create an environment for Python 2.7 is to do
$ conda create -n py27_13 python=2.7.13 anaconda
# activate an environment
$ source activate py27_13
# check if you are in python 2.7.13 if you activate this env
(py27_13) $ python
Python 2.7.13 |Anaconda 4.3.1 (x86_64)| (default, Dec 20 2016, 23:05:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
# install package after activate
(py27_13) $ conda install [package_name]
(py27_13) $ conda update [package_name]
(py27_13) $ conda remove [package_name]
(py27_13) $ pip install [package_name]
# install package to particular env without activation
$ conda install -n py27_13 [package_name]
# list out all environment previously created. NOTE: root is the active in current.
$ conda env list
# conda environments:
root * //anaconda
# deactivate an environment
> source deactivate py27_13
# remove an environment
$ conda remove --name py27_13 --all
Common packages/ libraries that you may need to install
$ source activate py27_13
# install common package for machine learning in python
(py27_13) $ conda install -c conda-forge tensorflow # install tensorflow
(py27_13) $ pip install -U nltk pandas scikit-learn matplotlib beautifulsoup4 gensim seaborn tabulate
- tensorflow – Google deep learning library
- nltk – NLP packages
- scikit-learn – Popular machine learning library for python
- matplotlib & seaborn – Data visualization tool
- tabulate – Pretty print for table data
- beautifulsoup4 – HTML parser
- genism – High performance deep neural net package
Write Python Code on IntelliJ
Actually, you can use any text editor to write python code. But I prefer to do that in IntelliJ as I like to step thru python code in debug mode so I can examine the variables without printing it out. Below are the steps you can follow if you want to set up IntelliJ for Python.
- Create new project (File > New > Project) and select “Python”.
- Under Project SDK, click on the New button and point the path under “/anaconda/envs/abc/bin/python”
- Create New Python File
After that, you can type in Python code and Run it. If you want, you can add a breakpoint and run it in debug mode.