[資料探勘]Java ETL技巧整理

ETL

ETL = Extract-Transform-Loading

過程

Raw Data -> ETL Script -> Tidy Data (結構化資料)

Java

FileReader

把檔案裡的字元讀進來以後呢，轉換成看得懂的文字。

FileReader fReader = new FileReader ("filename")

BufferedReader

為所接受到的內容建立一個緩衝的功能。

bReader = new BufferedReader(fReader);

e.g.

BufferedReader bReader = null;

String line;

FileReader fReader;

fReader = new FileReader("filename.txt");

bReader = new BufferedReader(fReader);

while((line= bReader.readLine()) != null){

System.out.println(line);

}

***為什麼要有BufferedReader？(用BufferReader接載FileReader)

因為如果資料量很大的話，主記憶體又有限，一次讀進來的話

會用掉很多記憶體，所有BufferedReader可以針對Buffer進行

資料段的讀取、修改與刪除。

Scanner

Scanner 可以使用正規表達式(Regular Expression) 剖析(Parsing)基本資料型態與字串

Scanner sc = new Scanner (new File("FileName"));

while(sc.hasNextLine()){

String next = sc.nextLine();

}

以下是資料檔︰

e.g.

try {

Scanner sc = new Scanner(new File("上市股票代碼表.txt"));

while(sc.hasNextLine()){

String next = sc.nextLine();

System.out.println(next);

}

sc.close();

} catch (IOException e) {

e.printStackTrace();

}

使用useDelimiter切割資料

->可以指定資料分隔符號及正規表示法

e.g. 使用","分隔

Scanner sc = new Scanner("123,text,789.123");

sc.useDelimiter(",");

System.out.println(sc.nextInt());

System.out.println(sc.next());

System.out.println(sc.nextFloat());

sc.close();

使用findInLine切割資料

e.g. 使用正規表達式做分隔

Scanner s = new Scanner ("123,test|789.123");

s.findInLine("(\\d+),(\\w+)\\ | (([^ ]+)");

MatchResult result = s.match();

for (int i=1; i<=result.groupCount();i++)

System.out.println(result.group(i));

sc.close();

使用Scanner取得第一行的資料

->用兩層的Scanner去讀取

->useDelimiter("\t") 是指Tab鍵

Scanner sc;

try {

sc = new Scanner(new File("上市股票代碼表.txt"));

while(sc.hasNextLine()){

String next = sc.nextLine();

Scanner sc2 = new Scanner(next);

sc2.useDelimiter("\t");

System.out.println(sc2.next());

sc2.close();

}

sc.close();

} catch (IOException e) {

e.printStackTrace();

}

如何讀取CSV格式檔案

CSV(Comma-Separated Values)

->CSV欄位之間是以","做分隔的。

**單純使用useDelimiter分隔會有錯誤

可見成交股數方面需要用到「""」區分數值型態，以及數值每3個數字會有「，」這樣使得單單使用useDelimiter無法解決問題，我們可能還需要編寫正規表達式去解決這個問題。

但是有一個比較簡單的方法，

OpenCSV

CSVReader csvReader = new CSVReader(new FileReader("103年06月18日成交量前二十名證劵.csv"));

String[] row = null;

while ((row = csvReader.readNext()) != null){

System.out.println(row[0]+ "#" + row[1]);

}

csvReader.close();

處理Excel 格式的檔案

可以使用Jexcel(jxl)或Apache POI

jxl 性能較好，使用記憶體較少，但功能較少，不支援.xlsx檔案

apache POI 可針對EXCEL 儲存格做細部操作

使用jxl讀取Excel資訊

Workbook workbook = Workbook.getWorkbook(new File("FileName"));

Sheet sheet = workbook.getSheet("SheetName"); #指定excel裡的工作表

但jxl不支援xlsx(最新的檔案格式)，

可以使用Apache POI，

HSSF(Horrible SpreadSheet Format)-> 可操作xls

XSSF(XML SpreadSheet Format)->可操作xlsx

XSSFWorkbook workbook = new XSSWorkbook(file);

XSSFSheet sheet = workbook.getSheetAt(0);

如何讀取XML檔案

XML, eXtensible Markup Language

</stock>

資料格式 : <標籤>資料</標籤>
只有一個root節點

XML剖析套件

DOM, SAX, JDOM, DOM4J

DOM4J:

https://mvnrepository.com/artifact/dom4j/dom4j

JAXAN:

https://mvnrepository.com/artifact/jaxen/jaxen/1.2.0

使用DOM4J讀取XML資訊

Example:

SAXReader saxReader = new SAXReader();

Document document = saxReader.read(new File("FileName"));

如何讀取JSON檔案

JSON特徵{key:value}

json-simple

使用json-simple讀取json資訊

JSONParser parser = new JSONParser();

Object obj = parser.parse (new FileReader("2330.json"));

JSONObject jsonObject = (JSONObject) obj;

如何讀取爬取網頁資料

首先要分辨需要爬取的網頁的Response type 是GET 還是 POST

利用瀏覽器的開發人員工具(F12)即可查看。

GET 比較像是明信片，內容都寫在明信片上可以直接看到。在網址上就可以傳遞參數，讓伺服器可以作出回應

而POST就是像是有信封的信件，內容就寫在信封裡面，可以裝載比較多的內容。

使用Jsoup透過GET抓取網頁

Example:

Document doc;

doc = Jsoup.connect("http://...).get();

String html = doc.html();

System.out.println(html);

使用Jsoup透過POST抓取網頁

以台灣高鐵的「時刻表與票價查詢」網頁為例︰

https://www.thsrc.com.tw/tw/TimeTable/SearchResult

若是輸入資料再進行查詢的話，可以在Headers中看到Request Method是屬於POST。

而Form Data會顯示各查詢參數的數值︰

StartStation:"977abb69-413a-4ccf-a109-0272c24fd490"

EndStation:"2f940836-cedc-41ef-8e28-c2336ac8fe68"

SearchDate:"2020/06/12"

SearchTime:"11:00"

SearchType:"S"

在Jsoup中，可以利用post去取得特定參數的網頁

Document doc;

doc = Jsoup.connect("https://www.thsrc.com.tw/tw/TimeTable/Search")

.data("StartStation","977abb69-413a-4ccf-a109-0272c24fd490")

.data("EndStation","2f940836-cedc-41ef-8e28-c2336ac8fe68")

.data("SearchDate","2020/06/12")

.data("SearchTime","11:00")

.data("SearchType","S")

.post();

String html = doc.html();

System.out.println(html);

資料過濾(Filter)

Google Guava

擴充及增強JAVA原有COLLECTION的功能

Filter Method

Collection2.filter(Collection,Predicate)

FluentIterable.filter(Predicate)

Iterables.filter(Iterable,Predicate)

List<Stock> stock;

stock = Lists.newArrayList();

try{

Scanner sc = new Scanner(new File("20410625收盤行情.csv"));

while (sc.hasNextLine()){

String next = sc.nextLine();

Scanner sc2 = new Scanner (next);

sc2.useDelimiter(",")

stock.add(new Stock(sc2.next(),sc2.next()

使用JSOUP篩選網頁資料

Elements headlines = doc.select("#id");

String html = " <html> <body>"

+"<h1 id=\"title\"> Hello World </h1>"

+ " <a href=\"#\" class=\"link\"> This is link1</a>"

+ " <a href=\"#link2\"class=\"link\"> This is link2</a>"

+ "</body> </html>";

Document doc = Jsoup.parse(html);

System.out.println(doc.select("h1").text());

搜尋此網誌

freeCodeInfo

[資料探勘]Java ETL技巧整理

留言

張貼留言

這個網誌中的熱門文章

8-Bit Plane Slicing 位元平面分割詳細解說 # 附 Python 程式碼

[實用工具]分租式單位電費單計算機

Histogram Equalization - 直方圖均衡化詳細解說 # 附 Python 程式碼

[資料探勘]Java ETL技巧整理

留言

張貼留言

這個網誌中的熱門文章

8-Bit Plane Slicing 位元平面分割 詳細解說 # 附 Python 程式碼

[實用工具]分租式單位電費單計算機

Histogram Equalization - 直方圖均衡化 詳細解說 # 附 Python 程式碼

8-Bit Plane Slicing 位元平面分割詳細解說 # 附 Python 程式碼

Histogram Equalization - 直方圖均衡化詳細解說 # 附 Python 程式碼